[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] jobs don't run when using condor_credd



When your job tries to start, it probably uses a shared pool password to authenticate against the credd. Did you set the shared pool password on all machines?
 
condor_store_credd -c add
 
 
Mike


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jeffrey Stephen
Sent: 08 March 2007 06:40
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] jobs don't run when using condor_credd

Hi,
 
I am trying to set up condor_credd on Windows XP. I have a central manager machine (nes30700) and one submit/execute (ie. slave) machine (nes15300). The slave machine is configured to always run jobs:
 
=================================================================
> condor_status
 
Name          OpSys       Arch   State      Activity   LoadAv Mem   ActvtyTime
 
vm1@NES30700. WINNT51     INTEL  Owner      Idle       0.040  1023  0+00:05:15
vm2@NES30700. WINNT51     INTEL  Owner      Idle       0.000  1023  0+00:05:16
nes15300.land     WINNT51     INTEL  Unclaimed  Idle       -0.010  1022  0+00:09:55
=================================================================
 
To run jobs I had to use "condor_store_cred" to set my password. I did this on both the central manager and slave manager. (Is that correct?)
Once that was done, I could successfully run a test program using condor_submit.
 
I want to use a shared filesystem, so I tried to set up condor_credd. I did the following:
1. copied the example file (etc/condor_config.local.credd) into condor_config.local in the condor main directory on both the central manager and the slave machines;
2. added the following lines to the condor_config file (on both the central manager and the slave machines):
    STARTER_ALLOW_RUNAS_OWNER = True
    CREDD_HOST = nes30700.lands.resnet.qg
    CREDD_CACHE_LOCALLY = True
    SEC_CLIENT_AUTHENTICATION_METHODS = NTSSPI, PASSWORD
3. Modified condor_config file (on both the central manager and the slave machines):
   COLLECTOR_NAME = QCCCE_condor
   where "QCCCE_condor" is the name of my condor pool
4. started condor on both the central manager and the slave machines (using net start condor)
The condor_master, condor_collector, condor_credd, condor_negotiator, condor_schedd and condor_startd) daemons started on both machines. I thought condor_negotiator and condor_collector were only supposed to run on the central manager machine, but they were running on the both the central manager and the slave machine.
5. added "run_as_owner = true" to the job config file
 
When I submit a job it appears in the queue but is "idle" and it doesn't get run:
=================================================================
> condor_q

-- Submitter: NES30700.lands.resnet.qg : <131.242.63.124:1144> : NES30700.lands.resnet.qg
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD              
   6.0   jeffreysj       3/7  14:07   0+00:00:00 I  0   9.8  output_name.exe  
 
1 jobs; 1 idle, 0 running, 0 held
=================================================================
 
This same job executed immediately before I installed the condor_credd.
 
The credd log file contains an authentication error:
 
=================================================================
3/8 11:53:30 ******************************************************
3/8 11:53:30 ** condor_credd.exe (CONDOR_CREDD) STARTING UP
3/8 11:53:30 ** D:\condor\bin\condor_credd.exe
3/8 11:53:30 ** $CondorVersion: 6.9.1 Jan  8 2007 $
3/8 11:53:30 ** $CondorPlatform: INTEL-WINNT50 $
3/8 11:53:30 ** PID = 2180
3/8 11:53:30 ** Log last touched 3/8 11:34:43
3/8 11:53:30 ******************************************************
3/8 11:53:30 Using config source: D:\condor\condor_config
3/8 11:53:30 Using local config sources:
3/8 11:53:30    D:\condor/condor_config.local
3/8 11:53:30 DaemonCore: Command Socket at <131.242.63.124:9620>
3/8 11:53:30 main_init() called
3/8 11:53:30 Calling Timer handler 0 (dc_touch_log_file)
3/8 11:53:31 Return from Timer handler 0 (dc_touch_log_file)
3/8 11:53:31 Calling Timer handler 1 (check_session_cache)
3/8 11:53:31 Return from Timer handler 1 (check_session_cache)
3/8 11:53:31 Calling Timer handler 2 (handle_cookie_refresh)
3/8 11:53:31 Return from Timer handler 2 (handle_cookie_refresh)
3/8 11:53:31 Calling Timer handler 3 (self_monitor)
3/8 11:53:31 Return from Timer handler 3 (self_monitor)
3/8 11:53:31 Calling Timer handler 6 (update_collector)
3/8 11:53:31 Return from Timer handler 6 (update_collector)
3/8 11:53:31 Calling Timer handler 5 (DaemonCore::SendAliveToParent)
3/8 11:53:31 Return from Timer handler 5 (DaemonCore::SendAliveToParent)
3/8 11:53:31 Calling Handler <<131.242.63.124:9618>>
3/8 11:53:31 Return from Handler <<131.242.63.124:9618>>
3/8 11:54:31 Calling Timer handler 7 (dc_touch_log_file)
3/8 11:54:31 Return from Timer handler 7 (dc_touch_log_file)
3/8 11:55:31 Calling Timer handler 8 (dc_touch_log_file)
3/8 11:55:31 Return from Timer handler 8 (dc_touch_log_file)
3/8 11:56:12 Calling Handler <DaemonCore::HandleReqSocketHandler>
3/8 11:56:12 getStoredCredential(): Could not locate credential for user 'condor_pool@xxxxxxxxxxxxxxx'
3/8 11:56:12 getStoredCredential(): Could not locate credential for user 'condor_pool@xxxxxxxxxxxxxxx'
3/8 11:56:32 AUTHENTICATE: no available authentication methods succeeded, failing!
3/8 11:56:32 DC_AUTHENTICATE: authenticate failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using PASSWORD
3/8 11:56:32 Return from Handler <DaemonCore::HandleReqSocketHandler>
3/8 11:56:32 Calling Timer handler 9 (dc_touch_log_file)
3/8 11:56:32 Return from Timer handler 9 (dc_touch_log_file)
3/8 11:57:13 Calling Handler <DaemonCore::HandleReqSocketHandler>
3/8 11:57:13 Calling HandleReq <store_cred_handler> (0)
3/8 11:57:13 Return from HandleReq <store_cred_handler>
3/8 11:57:13 Return from Handler <DaemonCore::HandleReqSocketHandler>
3/8 11:57:31 Calling Timer handler 3 (self_monitor)
3/8 11:57:31 Return from Timer handler 3 (self_monitor)
3/8 11:57:32 Calling Timer handler 11 (dc_touch_log_file)
3/8 11:57:32 Return from Timer handler 11 (dc_touch_log_file)
=================================================================
 
 
Does anyone know what the problem could be?
 
cheers
steve

************************************************************************

The information in this e-mail together with any attachments is

intended only for the person or entity to which it is addressed

and may contain confidential and/or privileged material.

Any form of review, disclosure, modification, distribution

and/or publication of this e-mail message is prohibited.

If you have received this message in error, you are asked to

inform the sender as quickly as possible and delete this message

and any copies of this message from your computer and/or your

computer system network.

************************************************************************

----

Gloucester Research Limited believes the information provided herein is reliable. While every care has been taken to ensure accuracy, the information is furnished to the recipients with no warranty as to the completeness and accuracy of its contents and on condition that any errors or omissions shall not be made the basis for any claim, demand or cause for action.

The information in this email is intended only for the named recipient.  If you are not the intended recipient please notify us immediately and do not copy, distribute or take action based on this e-mail.

All messages sent to and from this email address will be logged by Gloucester Research Ltd and are subject to archival storage, monitoring, review and disclosure.

Gloucester Research Limited, 5th Floor, Whittington House, 19-30 Alfred Place, London WC1E 7EA.

Gloucester Research Limited is a company registered in England and Wales with company number 04267560.

----