[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] [Condor 6.8.4] Job not executed because user could not authenticate (Linux with Windows authentication)



Hi, 
we are having a problem where a particular Linux box on our network is rejecting jobs that have been sent to it. 
 
000 (443.000.000) 05/11 12:01:04 Job submitted from host: <xxx.xxx.xxx.xxx:1061>
...
022 (443.000.000) 05/11 12:01:12 Job disconnected, attempting to reconnect
    Socket between submit and execute hosts closed unexpectedly
    Trying to reconnect to vm1@xxxxxxxxxxx <xxx.xxx.xxx.xxx:32773>
...

024 (443.000.000) 05/11 12:01:12 Job reconnection failed
    Job not found at execution machine
    Can not reconnect to vm1@xxxxxxxxxxx, rescheduling job
...
 
The box is different from the other Linux box as it has a different authentication system and use Windows domain instead of a plain /etc/passwd file. This is something we cannot change.
Condor works OK on our other Linux (which use a different /etc/passwd + NIS) and Windows (which use the domain) boxes. From StarterLog.vm1, the original error that cause jobs to be evicted was :
 
5/11 13:35:31 ******************************************************
5/11 13:35:31 ** condor_starter (CONDOR_STARTER) STARTING UP
5/11 13:35:31 ** /opt/condor-6.8.4/sbin/condor_starter
5/11 13:35:31 ** $CondorVersion: 6.8.4 Feb  1 2007 $
5/11 13:35:31 ** $CondorPlatform: I386-LINUX_RHEL3 $
5/11 13:35:31 ** PID = 4580
5/11 13:35:31 ** Log last touched 5/11 13:31:18
5/11 13:35:31 ******************************************************
5/11 13:35:31 Using config source: /opt/condor-6.8.4/etc/condor_config
5/11 13:35:31 Using local config sources: 
5/11 13:35:31    /opt/condor-6.8.4/local.xxx/condor_config.local
5/11 13:35:31 DaemonCore: Command Socket at <xxx.xxx.xxx.xxx:32821>
5/11 13:35:31 Done setting resource limits
5/11 13:35:31 Communicating with shadow <xxx.xxx.xxx.xxx:3550>
5/11 13:35:31 Submitting machine is "xxx.xxx.xxx.xxxl"
5/11 13:35:31 passwd_cache::cache_uid(): getpwnam("xxx") failed: user not found
5/11 13:35:31 ERROR: Uid for "xxx" not found in passwd file and SOFT_UID_DOMAIN is False
5/11 13:35:31 ERROR: Failed to determine what user to run this job as, aborting
5/11 13:35:31 Failed to initialize JobInfoCommunicator, aborting
5/11 13:35:31 Unable to start job.
5/11 13:35:31 **** condor_starter (condor_STARTER) EXITING WITH STATUS 1
5/11 13:40:31 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found
 
5/11 13:40:31 passwd_cache::cache_uid(): getpwnam("condor") failed: user not found

I then set the SOFT_UID_DOMAIN variable to TRUE in the global config file and restarted condor on this computer. Now the errror is :
 
5/11 13:52:27 ******************************************************
5/11 13:52:27 ** condor_starter (CONDOR_STARTER) STARTING UP
5/11 13:52:27 ** /opt/condor-6.8.4/sbin/condor_starter
5/11 13:52:27 ** $CondorVersion: 6.8.4 Feb  1 2007 $
5/11 13:52:27 ** $CondorPlatform: I386-LINUX_RHEL3 $
5/11 13:52:27 ** PID = 4785
5/11 13:52:27 ** Log last touched 5/11 13:43:40
5/11 13:52:27 ******************************************************
5/11 13:52:27 Using config source: /opt/condor-6.8.4/etc/condor_config
5/11 13:52:27 Using local config sources: 
5/11 13:52:27    /opt/condor-6.8.4/local.xxx/condor_config.local
5/11 13:52:27 DaemonCore: Command Socket at <xxx.xxx.xxx.xxx:32863>
5/11 13:52:27 Done setting resource limits
5/11 13:52:28 Communicating with shadow <xxx.xxx.xxx.xxx:xxx>
5/11 13:52:28 Submitting machine is "xxx.xxx.xxx.xxx"
5/11 13:52:28 passwd_cache::cache_uid(): getpwnam("xxx") failed: user not found
5/11 13:52:28 user_info ClassAd does not contain Uid!
5/11 13:52:28 ERROR: Failed to determine what user to run this job as, aborting
5/11 13:52:28 Failed to initialize JobInfoCommunicator, aborting
5/11 13:52:28 Unable to start job.
5/11 13:52:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 1

Following a post about the similar error message in the Condor archive I tried to put VM1_USER = domainname\username in the local config file, restarted condor but the error still persist.
 
Any idea? 

Thanks.
---- 
Fabrice Bouyé (http://fabricebouye.cv.fm/) 
Research Officer (Data) 
Tel: +687 26 20 00 (Ext 411) 
Oceanic Fisheries, Pacific Community 
http://www.spc.int/