[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[condor-users] scheduler crashing



Hello,

something very unusual happened to our condor system.
The scheduler crashed. What follows is from MasterLog

4/21 14:45:27 The SCHEDD (pid 12019) exited with status 4
4/21 14:45:27 Sending obituary for "/data0/opt/condor-6.6.0/sbin/condor_schedd"
4/21 14:45:28 restarting /data0/opt/condor-6.6.0/sbin/condor_schedd in 10 seconds
4/21 14:45:38 Started DaemonCore process "/data0/opt/condor-6.6.0/sbin/condor_schedd", pid and pgroup = 1942
4/21 14:48:08 The SCHEDD (pid 1942) exited with status 4
4/21 14:48:08 Sending obituary for "/data0/opt/condor-6.6.0/sbin/condor_schedd"
4/21 14:48:08 restarting /data0/opt/condor-6.6.0/sbin/condor_schedd in 11 seconds
4/21 14:48:19 Started DaemonCore process "/data0/opt/condor-6.6.0/sbin/condor_schedd", pid and pgroup = 1954
4/21 14:50:26 The SCHEDD (pid 1954) exited with status 4
...


and so forth. Below is an e-mail from condor

This is an automated email from the Condor system
on machine "hydra.crump.ucla.edu".  Do not reply.

"/data0/opt/condor-6.6.0/sbin/condor_schedd" on "hydra.crump.ucla.edu" exited with status 4.
Condor will automatically restart this process in 10 seconds.

*** Last 20 line(s) of file SchedLog:
4/21 14:40:01 Scheduler::Relinquish - mrec is NULL, can't relinquish
4/21 14:40:01 Null parameter --- match not deleted
4/21 14:40:20 Activity on stashed negotiator socket
4/21 14:40:20 Negotiating for owner: mpetct@xxxxxxxxxxxxxxxx
4/21 14:40:20 Checking consistency running and runnable jobs
4/21 14:40:20 Tables are consistent
4/21 14:40:20 Out of servers - 0 jobs matched, 3 jobs idle, 3 jobs rejected
4/21 14:40:20 Scheduler::Relinquish - mrec is NULL, can't relinquish
4/21 14:40:20 Null parameter --- match not deleted
4/21 14:45:11 Sent ad to central manager for rannou@xxxxxxxxxxxxxxxx
4/21 14:45:11 Sent ad to central manager for mpetct@xxxxxxxxxxxxxxxx
4/21 14:45:11 Started shadow for job 243.6422 on "<192.168.10.7:44184>", (shadow pid = 1935)
4/21 14:45:17 Sent ad to central manager for rannou@xxxxxxxxxxxxxxxx
4/21 14:45:17 Sent ad to central manager for mpetct@xxxxxxxxxxxxxxxx
4/21 14:45:17 DaemonCore: Command received via TCP from host <192.168.10.7:34359>
4/21 14:45:17 DaemonCore: received command 443 (VACATE_SERVICE), calling handler (vacate_service)
4/21 14:45:17 Got VACATE_SERVICE from <192.168.10.7:34359>
4/21 14:45:17 Sent RELEASE_CLAIM to startd on <192.168.10.7:44184>
4/21 14:45:17 Match record (<192.168.10.7:44184>, 243, 6422) deleted
4/21 14:45:27 ERROR "write inside a transaction failed, errno = 0" at line 122 in file log_transaction.C
*** End of file SchedLog


Thanks for any help


Fernando Rannou



Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>