[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] JobLeaseDuration expires immediately (LastJobLeaseRenewal very old)



Dear HTCondor experts,

we had an (announced) network outage of about 10 minutes and I assumed that with JobLeaseDuration=2400 this would not be an issue for running jobs. 
However, I find the following in the log of all submission nodes after the first connection error appears:

Mar 20 02:06:16 condor_shadow[710892]: condor_read() failed: recv(fd=8) returned -1, errno = 110 Connection timed out, reading 21 bytes from startd slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxxx
Mar 20 02:06:16 condor_shadow[710892]: condor_read(): UNEXPECTED read timeout after 0s during non-blocking read from startd slot1_1@xxxxxxxxxxxxxxxxxxxxxxxxxx (desired timeout=300s)
Mar 20 02:06:16 condor_shadow[710892]: IO: Failed to read packet header
Mar 20 02:06:16 condor_shadow[710892]: Can no longer talk to condor_starter <natgw-of-starter-privnet.example.com:43553>
Mar 20 02:06:16 condor_shadow[710892]: Unable to log ULOG_JOB_DISCONNECTED event
Mar 20 02:06:16 condor_shadow[710892]: Trying to reconnect to disconnected job
Mar 20 02:06:16 condor_shadow[710892]: LastJobLeaseRenewal: 1553017431 Tue Mar 19 18:43:51 2019
Mar 20 02:06:16 condor_shadow[710892]: JobLeaseDuration: 2400 seconds
Mar 20 02:06:16 condor_shadow[710892]: JobLeaseDuration remaining: EXPIRED!
Mar 20 02:06:16 condor_shadow[710892]: Reconnect FAILED: Job disconnected too long: JobLeaseDuration (2400 seconds) expired
Mar 20 02:06:16 condor_shadow[710892]: Unable to log ULOG_JOB_RECONNECT_FAILED event
Mar 20 02:06:16 condor_shadow[710892]: Exiting with JOB_SHOULD_REQUEUE
Mar 20 02:06:16 condor_shadow[710892]: **** condor_shadow (condor_SHADOW) pid 710892 EXITING WITH STATUS 107

All shadows exited and all jobs are lost :-(. 

Can somebody explain me what has happened here? 
My interpretation is as follows:
- Network connection timeout happens, read timeout is 300 s. 
- Reconnection attempt is scheduled... 
- LastJobLeaseRenewal is checked, and is some time 6 hours ago (why was the lease never renewed? when should this happen?)
- LastJobLeaseRenewal is more than JobLeaseDuration seconds ago. 

What I do not understand is why the job lease was never renewed. In this situation, any short disconnection of ~5 minutes will immediately terminate jobs,
and JobLeaseDuration has no effect at all. 

Did we stumble upon a bug, or is there something broken in our setup?

Our setup has:
- all startd in a private network, using sharedportd everywhere and CCBs run on central managers which are in public network
- all submit in public network

Also, all startd are running HTCondor 8.6.13, while everything else (collector, negotiator, submitd's) are on 8.8. 
For the startds we have to stay on 8.6.13 for now due to the Singularity regressions (nsenter, fixed in 8.8.1, interactive jobs broken, no fix yet it seems). 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature