[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] schedd stopped working (died?) with SchedLog filled with 'WriteUserLog checking for event, log rotation, but no lock'



Hi all,

tonight one of our schedulers (on a SL6.7 grid ARC CE) stopped working.
While I cannot rule out the ARC itself, I suspect a problem with the
Condor Schedd.

Job submission failed around 2:00 tonight after which the SchedLog [1]
contained only of lines as
 > WriteUserLog checking for event log rotation, but no lock
which occured before as well but not solely.

A bit later at ~2:16 the MasterLog [2] started to log sched daemons to
be reaped/to die(?) exiting with code 44. Restarts of the schedd went on
for ~20m after which the MasterLog went silent until the service got
restarted.

I found so far no information on the schedd error code 44 but only for
the shadow [3].
There was a mail exchange >10 year ago, when a shadow exiting with error
44 took down the schedd as well

https://www-auth.cs.wisc.edu/lists/htcondor-users/2004-November/msg00229.shtml
but I don't know, if this could still be of relevance (I found no
"*dprintf*" cores)?

The system logs contained no messages with obvious correlations.

Maybe somebody has an idea how to debug further what went wrong?

Cheers and thanks,
  Thomas


[1]
> condor/SchedLog
   1856 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
log rotation, but no lock
   1857 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
log rotation, but no lock
   1858 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
log rotation, but no lock
   1859 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event
log rotation, but no lock
   1860 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event
log rotation, but no lock

...

10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
rotation, but no lock
10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
rotation, but no lock
10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
rotation, but no lock
10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
rotation, but no lock
10/17/16 10:42:19 (pid:1968938) Starting add_shadow_birthdate(253638.0)
10/17/16 10:42:19 (pid:1968938) Started shadow for job 253638.0 on
<131.169.161.82:9620?addrs=131.169.161.82-9620&noUDP&sock=5184_1b9d_3>
for group_CMS.cms_multicore.cmsplt036, (shadow pid = 1969011)
10/17/16 10:42:19 (pid:1968938) TransferQueueManager stats: active
up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
10/17/16 10:42:19 (pid:1968938) TransferQueueManager upload 1m I/O load:
0 bytes/s  0.000 disk load  0.000 net load
10/17/16 10:42:19 (pid:1968938) TransferQueueManager download 1m I/O
load: 0 bytes/s  0.000 disk load  0.000 net load
10/17/16 10:42:19 (pid:1968938) Sent ad to central manager for
group_BELLE2.other.belleprd001@xxxxxxx
10/17/16 10:42:19 (pid:1968938) Sent ad to 1 collectors for
group_BELLE2.other.belleprd001@xxxxxxx


[2]
condor/MasterLog
10/15/16 12:38:39 DefaultReaper unexpectedly called on pid 1197280,
status 0.
10/16/16 12:38:39 Preen pid is 3854283
10/16/16 12:38:39 DefaultReaper unexpectedly called on pid 3854283,
status 0.
10/17/16 02:16:36 DefaultReaper unexpectedly called on pid 4168039,
status 11264.
10/17/16 02:16:36 The SCHEDD (pid 4168039) exited with status 44
10/17/16 02:16:37 Sending obituary for "/usr/sbin/condor_schedd"
10/17/16 02:16:37 restarting /usr/sbin/condor_schedd in 10 seconds
10/17/16 02:16:47 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 1712607
10/17/16 02:28:34 DefaultReaper unexpectedly called on pid 1712607,
status 1024.
10/17/16 02:28:34 The SCHEDD (pid 1712607) exited with status 4
10/17/16 02:28:34 Sending obituary for "/usr/sbin/condor_schedd"
10/17/16 02:28:34 restarting /usr/sbin/condor_schedd in 10 seconds
10/17/16 02:28:44 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 1728616
10/17/16 02:35:26 DefaultReaper unexpectedly called on pid 1728616,
status 11264.
10/17/16 02:35:26 The SCHEDD (pid 1728616) exited with status 44
...
10/17/16 02:35:48 SharedPortEndpoint: failed to open
/var/lock/condor/shared_port_ad: No such file or directory
10/17/16 02:35:48 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 1739389
10/17/16 02:35:48 Collector port not defined, will use default: 9618
10/17/16 02:35:48 Started DaemonCore process
"/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739390
10/17/16 02:35:48 DefaultReaper unexpectedly called on pid 1739389,
status 11264.
10/17/16 02:35:48 The SCHEDD (pid 1739389) exited with status 44
10/17/16 02:35:48 Sending obituary for "/usr/sbin/condor_schedd"
10/17/16 02:35:49 restarting /usr/sbin/condor_schedd in 13 seconds
10/17/16 02:35:49 DefaultReaper unexpectedly called on pid 1739390,
status 11264.
10/17/16 02:35:49 The SHARED_PORT (pid 1739390) exited with status 44
10/17/16 02:35:49 Sending obituary for
"/usr/libexec/condor/condor_shared_port"
10/17/16 02:35:49 restarting /usr/libexec/condor/condor_shared_port in
13 seconds
10/17/16 02:36:02 SharedPortEndpoint: failed to open
/var/lock/condor/shared_port_ad: No such file or directory
10/17/16 02:36:02 Started DaemonCore process "/usr/sbin/condor_schedd",
pid and pgroup = 1739409
10/17/16 02:36:02 Collector port not defined, will use default: 9618
10/17/16 02:36:02 Started DaemonCore process
"/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739410
10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739409,
status 11264.
10/17/16 02:36:02 The SCHEDD (pid 1739409) exited with status 44
10/17/16 02:36:02 Sending obituary for "/usr/sbin/condor_schedd"
10/17/16 02:36:02 restarting /usr/sbin/condor_schedd in 17 seconds
10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739410,
status 11264.
10/17/16 02:36:02 The SHARED_PORT (pid 1739410) exited with status 44
10/17/16 02:36:02 Sending obituary for
"/usr/libexec/condor/condor_shared_port"
10/17/16 02:36:02 restarting /usr/libexec/condor/condor_shared_port in
17 seconds
10/17/16 02:36:19 SharedPortEndpoint: failed to open
/var/lock/condor/shared_port_ad: No such file or directory
10/17/16 02:36:19 Started DaemonCore process "/usr/sbin/condor_schedd",
pid 10/17/16 10:21:02 ******************************************************
10/17/16 10:21:02 ** condor_master (CONDOR_MASTER) STARTING UP


[3]
http://pages.cs.wisc.edu/~adesmet/status.html
> shadow
44 	DPRINTF_ERROR 	There is a fatal error with dprintf()
> GRAM Error Codes
44	STAGING_STDIN	the job manager failed to stage the stdin file

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature