[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] schedd stopped working (died?) with SchedLog filled with 'WriteUserLog checking for event, log rotation, but no lock'



ah, sorry - I forgot to mention the ShadowLog, which also stopped at
2:35 [1] with function doWriteEvent() failing to write logs.
AFAIS disk space/inodes have been sufficient and memory consumption
stayed also normal


[1]
10/17/16 02:35:16 (262870.0) (1731674): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:16 (262870.0) (1731674): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:18 (262865.0) (1729344): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:18 (262865.0) (1729344): WARNING:
WriteUserLog::writeEvent global doWriteEvent() failed on global log! The
global event log will be missing an event.
10/17/16 02:35:18 (262865.0) (1729344): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:18 (262813.0) (1731484): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:18 (262813.0) (1731484): WARNING:
WriteUserLog::writeEvent global doWriteEvent() failed on global log! The
global event log will be missing an event.
10/17/16 02:35:18 (262813.0) (1731484): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:20 (255809.0) (1730692): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:20 (255809.0) (1730692): WARNING:
WriteUserLog::writeEvent global doWriteEvent() failed on global log! The
global event log will be missing an event.
10/17/16 02:35:20 (255809.0) (1730692): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:25 (255722.0) (1731636): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:25 (255722.0) (1731636): WARNING:
WriteUserLog::writeEvent global doWriteEvent() failed on global log! The
global event log will be missing an event.
10/17/16 02:35:25 (255722.0) (1731636): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:26 (255721.0) (1731158): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:26 (255721.0) (1731158): WARNING:
WriteUserLog::writeEvent global doWriteEvent() failed on global log! The
global event log will be missing an event.
10/17/16 02:35:26 (255721.0) (1731158): WriteUserLog checking for event
log rotation, but no lock
10/17/16 02:35:26 (255737.0) (1731084): Write10/17/16 10:21:41
******************************************************
10/17/16 10:21:41 ** condor_shadow (CONDOR_SHADOW) STARTING UP


On 2016-10-17 11:44, Thomas Hartmann wrote:
> Hi all,
> 
> tonight one of our schedulers (on a SL6.7 grid ARC CE) stopped working.
> While I cannot rule out the ARC itself, I suspect a problem with the
> Condor Schedd.
> 
> Job submission failed around 2:00 tonight after which the SchedLog [1]
> contained only of lines as
>  > WriteUserLog checking for event log rotation, but no lock
> which occured before as well but not solely.
> 
> A bit later at ~2:16 the MasterLog [2] started to log sched daemons to
> be reaped/to die(?) exiting with code 44. Restarts of the schedd went on
> for ~20m after which the MasterLog went silent until the service got
> restarted.
> 
> I found so far no information on the schedd error code 44 but only for
> the shadow [3].
> There was a mail exchange >10 year ago, when a shadow exiting with error
> 44 took down the schedd as well
> 
> https://www-auth.cs.wisc.edu/lists/htcondor-users/2004-November/msg00229.shtml
> but I don't know, if this could still be of relevance (I found no
> "*dprintf*" cores)?
> 
> The system logs contained no messages with obvious correlations.
> 
> Maybe somebody has an idea how to debug further what went wrong?
> 
> Cheers and thanks,
>   Thomas
> 
> 
> [1]
>> condor/SchedLog
>    1856 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
> log rotation, but no lock
>    1857 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
> log rotation, but no lock
>    1858 10/17/16 01:59:54 (pid:4168039) WriteUserLog checking for event
> log rotation, but no lock
>    1859 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event
> log rotation, but no lock
>    1860 10/17/16 02:01:02 (pid:4168039) WriteUserLog checking for event
> log rotation, but no lock
> 
> ...
> 
> 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
> rotation, but no lock
> 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
> rotation, but no lock
> 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
> rotation, but no lock
> 10/17/16 10:42:19 (pid:1968938) WriteUserLog checking for event log
> rotation, but no lock
> 10/17/16 10:42:19 (pid:1968938) Starting add_shadow_birthdate(253638.0)
> 10/17/16 10:42:19 (pid:1968938) Started shadow for job 253638.0 on
> <131.169.161.82:9620?addrs=131.169.161.82-9620&noUDP&sock=5184_1b9d_3>
> for group_CMS.cms_multicore.cmsplt036, (shadow pid = 1969011)
> 10/17/16 10:42:19 (pid:1968938) TransferQueueManager stats: active
> up=0/10 down=0/10; waiting up=0 down=0; wait time up=0s down=0s
> 10/17/16 10:42:19 (pid:1968938) TransferQueueManager upload 1m I/O load:
> 0 bytes/s  0.000 disk load  0.000 net load
> 10/17/16 10:42:19 (pid:1968938) TransferQueueManager download 1m I/O
> load: 0 bytes/s  0.000 disk load  0.000 net load
> 10/17/16 10:42:19 (pid:1968938) Sent ad to central manager for
> group_BELLE2.other.belleprd001@xxxxxxx
> 10/17/16 10:42:19 (pid:1968938) Sent ad to 1 collectors for
> group_BELLE2.other.belleprd001@xxxxxxx
> 
> 
> [2]
> condor/MasterLog
> 10/15/16 12:38:39 DefaultReaper unexpectedly called on pid 1197280,
> status 0.
> 10/16/16 12:38:39 Preen pid is 3854283
> 10/16/16 12:38:39 DefaultReaper unexpectedly called on pid 3854283,
> status 0.
> 10/17/16 02:16:36 DefaultReaper unexpectedly called on pid 4168039,
> status 11264.
> 10/17/16 02:16:36 The SCHEDD (pid 4168039) exited with status 44
> 10/17/16 02:16:37 Sending obituary for "/usr/sbin/condor_schedd"
> 10/17/16 02:16:37 restarting /usr/sbin/condor_schedd in 10 seconds
> 10/17/16 02:16:47 Started DaemonCore process "/usr/sbin/condor_schedd",
> pid and pgroup = 1712607
> 10/17/16 02:28:34 DefaultReaper unexpectedly called on pid 1712607,
> status 1024.
> 10/17/16 02:28:34 The SCHEDD (pid 1712607) exited with status 4
> 10/17/16 02:28:34 Sending obituary for "/usr/sbin/condor_schedd"
> 10/17/16 02:28:34 restarting /usr/sbin/condor_schedd in 10 seconds
> 10/17/16 02:28:44 Started DaemonCore process "/usr/sbin/condor_schedd",
> pid and pgroup = 1728616
> 10/17/16 02:35:26 DefaultReaper unexpectedly called on pid 1728616,
> status 11264.
> 10/17/16 02:35:26 The SCHEDD (pid 1728616) exited with status 44
> ...
> 10/17/16 02:35:48 SharedPortEndpoint: failed to open
> /var/lock/condor/shared_port_ad: No such file or directory
> 10/17/16 02:35:48 Started DaemonCore process "/usr/sbin/condor_schedd",
> pid and pgroup = 1739389
> 10/17/16 02:35:48 Collector port not defined, will use default: 9618
> 10/17/16 02:35:48 Started DaemonCore process
> "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739390
> 10/17/16 02:35:48 DefaultReaper unexpectedly called on pid 1739389,
> status 11264.
> 10/17/16 02:35:48 The SCHEDD (pid 1739389) exited with status 44
> 10/17/16 02:35:48 Sending obituary for "/usr/sbin/condor_schedd"
> 10/17/16 02:35:49 restarting /usr/sbin/condor_schedd in 13 seconds
> 10/17/16 02:35:49 DefaultReaper unexpectedly called on pid 1739390,
> status 11264.
> 10/17/16 02:35:49 The SHARED_PORT (pid 1739390) exited with status 44
> 10/17/16 02:35:49 Sending obituary for
> "/usr/libexec/condor/condor_shared_port"
> 10/17/16 02:35:49 restarting /usr/libexec/condor/condor_shared_port in
> 13 seconds
> 10/17/16 02:36:02 SharedPortEndpoint: failed to open
> /var/lock/condor/shared_port_ad: No such file or directory
> 10/17/16 02:36:02 Started DaemonCore process "/usr/sbin/condor_schedd",
> pid and pgroup = 1739409
> 10/17/16 02:36:02 Collector port not defined, will use default: 9618
> 10/17/16 02:36:02 Started DaemonCore process
> "/usr/libexec/condor/condor_shared_port", pid and pgroup = 1739410
> 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739409,
> status 11264.
> 10/17/16 02:36:02 The SCHEDD (pid 1739409) exited with status 44
> 10/17/16 02:36:02 Sending obituary for "/usr/sbin/condor_schedd"
> 10/17/16 02:36:02 restarting /usr/sbin/condor_schedd in 17 seconds
> 10/17/16 02:36:02 DefaultReaper unexpectedly called on pid 1739410,
> status 11264.
> 10/17/16 02:36:02 The SHARED_PORT (pid 1739410) exited with status 44
> 10/17/16 02:36:02 Sending obituary for
> "/usr/libexec/condor/condor_shared_port"
> 10/17/16 02:36:02 restarting /usr/libexec/condor/condor_shared_port in
> 17 seconds
> 10/17/16 02:36:19 SharedPortEndpoint: failed to open
> /var/lock/condor/shared_port_ad: No such file or directory
> 10/17/16 02:36:19 Started DaemonCore process "/usr/sbin/condor_schedd",
> pid 10/17/16 10:21:02 ******************************************************
> 10/17/16 10:21:02 ** condor_master (CONDOR_MASTER) STARTING UP
> 
> 
> [3]
> http://pages.cs.wisc.edu/~adesmet/status.html
>> shadow
> 44 	DPRINTF_ERROR 	There is a fatal error with dprintf()
>> GRAM Error Codes
> 44	STAGING_STDIN	the job manager failed to stage the stdin file
> 
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature