[HTCondor-users] observed disconnect errors/shadow exceptions, JobEvictedEvents

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello htcondor-users,

   In our scaling studies within the LSST data management project we have started to obtain
several modes of error as we utilize HTCondor and HTCondor DAGman.

In our tests we are using DAGMAN to run a collection of identical jobs where each job take about 6 minutes
to execute. We are processing on the TACC Lonestar cluster by using Glidein to
add Lonestar compute nodes to out working pool, where our central manager runs on a machine at NCSA.
We are using CCB with the server being the collector on our central manager.
We are running 8064 test jobs at scales of 504, 1008, and 2016 cores / htcondor processing slots,
by using 42 nodes - 12 cores/node, 84 nodes - 12 cores/node, 168 nodes - 12 cores/node, respectively.

1) One collection of error modes are messages to the xxxxx.dag.nodes.log file of the form

...
007 (92602.000.000) 04/07 19:49:35 Shadow exception!
       Error from slot11@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <206.76.195.45:33839>
....
007 (92299.000.000) 04/07 19:49:35 Shadow exception!
       Error from slot9@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: ProcD has failed
...
024 (95543.000.000) 04/07 20:19:43 Job reconnection failed
   Job disconnected too long: JobLeaseDuration (1200 seconds) expired
   Can not reconnect to slot8@3632@xxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

We observed these errors at the 2016 core scale, but did not observe them at the 504 or 1008 core scale.
We have made some progress in mitigating these errors but would like to understand further:
With a setting of

DAGMAN_USER_LOG_SCAN_INTERVAL=5

such errors occur at the 2016 core scale, but with a higher setting

DAGMAN_USER_LOG_SCAN_INTERVAL=40

we do not observe them. We have a hypothesis that the errors result from a busy/backed up
schedd, and suspect the parameter change has lightened the load on the schedd or perhaps the
collector, and diminished the load on the central manager. Can condor experts
shed light on the results with these various parameter settings and the errors observed?
Is there a more direct parameter setting change that might impact theses errors?

2) At any of these scales but primarily on the high 2016 core end, we can observe
JobEvictedEvents like

...
028 (95502.000.000) 04/07 20:24:32 Job ad information event triggered.
Proc = 0
EventTime = "2013-04-07T20:24:32"
TriggerEventTypeName = "ULOG_JOB_EVICTED"
TriggerEventTypeNumber = 4
RunRemoteUsage = "Usr 0 00:03:20, Sys 0 00:00:01"
RunLocalUsage = "Usr 0 00:00:00, Sys 0 00:00:00"
SentBytes = 0.0
MyType = "JobEvictedEvent"
Checkpointed = false
TerminatedAndRequeued = false
Cluster = 95502
MachineSlotName = "slot2@18857@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Subproc = 0
EventTypeNumber = 28
CurrentTime = time()
ReceivedBytes = 3242.000000
TerminatedNormally = false
...
004 (95403.000.000) 04/07 20:24:32 Job was evicted.

in xxxxx.dag.nodes.log. We have very little visibility on the cause of these
job evictions. Can the condor team advise on how to debug the cause of these
evictions, and how they might be addressed?

Mailing List Archives

Public Access

[HTCondor-users] observed disconnect errors/shadow exceptions, JobEvictedEvents