[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] schedd changes owner to a regular user and results in queue crash



Hi all,

A serious problem just happened to my cluster, causing entire shutdown 
of condor. The ownership of schedd was was changed to a regular user!!! 
How could this happen?

[root@master1 y-61.1]# ps -ef | grep condor
pwang    26763     1  0 Nov18 ?        00:00:00 condor_shadow -f 886.0 
<10.10.20.1:34661> -
pwang    26766     1  0 Nov18 ?        00:00:00 condor_shadow -f 886.2 
<10.10.20.1:34661> -
pwang    26772     1  0 Nov18 ?        00:00:00 condor_shadow -f 886.1 
<10.10.20.1:34661> -
pwang    29394     1  0 Nov18 ?        00:00:00 condor_shadow -f 886.4 
<10.10.20.1:34661> -
condor   19319     1  0 Nov21 ?        
00:34:54 /home2/condor/sbin/condor_master
condor   19320 19319  0 Nov21 ?        01:43:02 condor_collector -f
pwang    19393 19319  0 Dec09 ?        00:00:06 condor_schedd -f
condor   19401 19319  0 Dec09 ?        00:02:31 condor_negotiator -f


Restarting condor daemons still gives wrong owner of schedd. However 
condor started correctly after I deleted job 886 in the tansaction log 
file job_queue.log.

Job 886 in was terminated by "condor_rm" job_queue.log.

The lastest log covers from 2am Dec 9. The SchedLog file reports shawdow 
exceptions. 
ERROR: Shadow exited with job exception code!

In my vague memory, the job 886 was in state X last week. It looks to me 
that "condor_rm" will affect schedd. Is it true?

Junjun