Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"

Date: Thu, 02 Nov 2006 08:36:55 -0600
From: Dan Bradley <dan@xxxxxxxxxxxx>
Subject: Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"

The error number (2) reported by the schedd indicates "No such file ordirectory". The schedd treats any such failure to access itsjob_queue.log as a critical error, so I would recommend putting yourSPOOL directory on a local filesystem, if you can.


--Dan

Junjun Mao wrote:

It was found that Condor Schedd crashed during the system backup, atwhich time the disk was just too busy. It was resolved by throttlingthe backup speed. While it is understandable, I still hope Condor maytolerate the slow disk IO just as other applications.


Junjun

On Wednesday 01 November 2006 16:21, Junjun Mao wrote:

Jobs in condor queue were restarted over and over around 2am, almost
everyday. Further investigation revealed the scheduler of the central
manager was crashed and restarted.

Here is what in MasterLog:
10/13 13:52:51 Child 15457 died, but not a daemon -- Ignored
10/14 02:02:21 The SCHEDD (pid 10264) exited with status 4
10/14 02:02:21 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/14 02:02:21 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/14 02:02:31 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 5510
10/14 13:52:48 Preen pid is 11906
10/14 13:52:51 Child 11906 died, but not a daemon -- Ignored
10/15 13:52:48 Preen pid is 3619
10/15 13:52:52 Child 3619 died, but not a daemon -- Ignored
10/16 02:02:29 The SCHEDD (pid 5510) exited with status 4
10/16 02:02:29 Sending obituary for
"/home2/condor/sbin/condor_schedd" 10/16 02:02:30 restarting
/home2/condor/sbin/condor_schedd in 10 seconds 10/16 02:02:40 Started
DaemonCore
process "/home2/condor/sbin/condor_schedd", pid and pgroup = 19703
10/16 11:20:15 DaemonCore: Command received via TCP from host
<10.10.20.1:41725>

In SchedLog
10/16 02:02:27 (pid:5510) Sent ad to 1 collectors for
cbriscoe@xxxxxxxx 10/16 02:02:29 (pid:5510) ERROR "write
to /home2/condor/hosts/master1/spool/job_queue.log failed, errno = 2"
at line 150 in file classad_log.C
10/16 02:02:41 (pid:19703)
******************************************************
10/16 02:02:41 (pid:19703) ** condor_schedd (CONDOR_SCHEDD) STARTING
UP 10/16 02:02:41 (pid:19703) ** /home2/condor/sbin/condor_schedd
10/16 02:02:41 (pid:19703) ** $CondorVersion: 6.7.18 Mar 22 2006 $
10/16 02:02:41 (pid:19703) ** $CondorPlatform: I386-LINUX_RH9 $ 10/16
02:02:41 (pid:19703) ** PID = 19703
10/16 02:02:41 (pid:19703)
******************************************************

I don't understand errno = 2 for the job_queue.log file, which has
the right permission and is not too big. This file is on a shared
file system.
-rw-------    1 condor   condor     836527 Oct 16
2006 /home2/condor/hosts/master1/spool/job_queue.log

Any ideas?

Junjun




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at either
https://lists.cs.wisc.edu/archive/condor-users/
http://www.opencondor.org/spaces/viewmailarchive.action?key=CONDOR

References:
- Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"
  - From: Junjun Mao

Prev by Date: Re: [Condor-users] multiple VM problem
Next by Date: Re: [Condor-users] how to resrict job run time
Previous by thread: Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"
Next by thread: Re: [Condor-users] Minimal installation package for execution hosts
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] SCHEDD crash - "write job_queue.log failed, errno = 2"