[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"



Jobs in condor queue were restarted over and over around 2am, almost 
everyday. Further investigation revealed the scheduler of the central 
manager was crashed and restarted.

Here is what in MasterLog:
10/13 13:52:51 Child 15457 died, but not a daemon -- Ignored
10/14 02:02:21 The SCHEDD (pid 10264) exited with status 4
10/14 02:02:21 Sending obituary for "/home2/condor/sbin/condor_schedd"
10/14 02:02:21 restarting /home2/condor/sbin/condor_schedd in 10 seconds
10/14 02:02:31 Started DaemonCore 
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 5510
10/14 13:52:48 Preen pid is 11906
10/14 13:52:51 Child 11906 died, but not a daemon -- Ignored
10/15 13:52:48 Preen pid is 3619
10/15 13:52:52 Child 3619 died, but not a daemon -- Ignored
10/16 02:02:29 The SCHEDD (pid 5510) exited with status 4
10/16 02:02:29 Sending obituary for "/home2/condor/sbin/condor_schedd"
10/16 02:02:30 restarting /home2/condor/sbin/condor_schedd in 10 seconds
10/16 02:02:40 Started DaemonCore 
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 19703
10/16 11:20:15 DaemonCore: Command received via TCP from host 
<10.10.20.1:41725>

In SchedLog
10/16 02:02:27 (pid:5510) Sent ad to 1 collectors for cbriscoe@xxxxxxxx
10/16 02:02:29 (pid:5510) ERROR "write 
to /home2/condor/hosts/master1/spool/job_queue.log failed, errno = 2" 
at line 150 in file classad_log.C
10/16 02:02:41 (pid:19703) 
******************************************************
10/16 02:02:41 (pid:19703) ** condor_schedd (CONDOR_SCHEDD) STARTING UP
10/16 02:02:41 (pid:19703) ** /home2/condor/sbin/condor_schedd
10/16 02:02:41 (pid:19703) ** $CondorVersion: 6.7.18 Mar 22 2006 $
10/16 02:02:41 (pid:19703) ** $CondorPlatform: I386-LINUX_RH9 $
10/16 02:02:41 (pid:19703) ** PID = 19703
10/16 02:02:41 (pid:19703) 
******************************************************

I don't understand errno = 2 for the job_queue.log file, which has the 
right permission and is not too big. This file is on a shared file 
system.
-rw-------    1 condor   condor     836527 Oct 16  
2006 /home2/condor/hosts/master1/spool/job_queue.log

Any ideas?

Junjun