[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"




The error number (2) reported by the schedd indicates "No such file or directory". The schedd treats any such failure to access its job_queue.log as a critical error, so I would recommend putting your SPOOL directory on a local filesystem, if you can.

--Dan

Junjun Mao wrote:

It was found that Condor Schedd crashed during the system backup, at which time the disk was just too busy. It was resolved by throttling the backup speed. While it is understandable, I still hope Condor may tolerate the slow disk IO just as other applications.

Junjun

On Wednesday 01 November 2006 16:21, Junjun Mao wrote:
Jobs in condor queue were restarted over and over around 2am, almost
everyday. Further investigation revealed the scheduler of the central
manager was crashed and restarted.

Here is what in MasterLog:
10/13 13:52:51 Child 15457 died, but not a daemon -- Ignored
10/14 02:02:21 The SCHEDD (pid 10264) exited with status 4
10/14 02:02:21 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/14 02:02:21 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/14 02:02:31 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 5510
10/14 13:52:48 Preen pid is 11906
10/14 13:52:51 Child 11906 died, but not a daemon -- Ignored
10/15 13:52:48 Preen pid is 3619
10/15 13:52:52 Child 3619 died, but not a daemon -- Ignored
10/16 02:02:29 The SCHEDD (pid 5510) exited with status 4
10/16 02:02:29 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/16 02:02:30 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/16 02:02:40 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 19703
10/16 11:20:15 DaemonCore: Command received via TCP from host
<10.10.20.1:41725>

In SchedLog
10/16 02:02:27 (pid:5510) Sent ad to 1 collectors for
cbriscoe@xxxxxxxx 10/16 02:02:29 (pid:5510) ERROR "write
to /home2/condor/hosts/master1/spool/job_queue.log failed, errno = 2"
at line 150 in file classad_log.C
10/16 02:02:41 (pid:19703)
******************************************************
10/16 02:02:41 (pid:19703) ** condor_schedd (CONDOR_SCHEDD) STARTING
UP 10/16 02:02:41 (pid:19703) ** /home2/condor/sbin/condor_schedd
10/16 02:02:41 (pid:19703) ** $CondorVersion: 6.7.18 Mar 22 2006 $
10/16 02:02:41 (pid:19703) ** $CondorPlatform: I386-LINUX_RH9 $ 10/16
02:02:41 (pid:19703) ** PID = 19703
10/16 02:02:41 (pid:19703)
******************************************************

I don't understand errno = 2 for the job_queue.log file, which has
the right permission and is not too big. This file is on a shared
file system.
-rw-------    1 condor   condor     836527 Oct 16
2006 /home2/condor/hosts/master1/spool/job_queue.log

Any ideas?

Junjun




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR