[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] observed disconnect errors/shadow exceptions, JobEvictedEvents



One of the folks here tried to get this posted to the mailing list, but was having some problems doing it, so I'm forwarding it:

 

Hello htcondor-users,

   In our scaling studies within the LSST data management project we have started to obtain
several modes of error as we utilize HTCondor and HTCondor DAGman.

In our tests we are using DAGMAN to run  a collection of identical jobs where each job take about 6 minutes
to execute. We are processing on the TACC Lonestar cluster by using Glidein to
add Lonestar compute nodes to out working pool, where our central manager runs on a machine at NCSA.
We are using CCB with the server being the collector on our central manager.
We are running 8064 test jobs at scales of  504, 1008, and 2016 cores / htcondor processing slots,
by using 42 nodes - 12 cores/node, 84 nodes - 12 cores/node, 168 nodes - 12 cores/node, respectively.

1) One collection of error modes are messages to the xxxxx.dag.nodes.log file of the form

...
007 (92602.000.000) 04/07 19:49:35 Shadow exception!
       Error from slot11@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <206.76.195.45:33839>
....
007 (92299.000.000) 04/07 19:49:35 Shadow exception!
       Error from slot9@17868@xxxxxxxxxxxxxxxxxxxxxxxxxxxx: ProcD has failed
...
024 (95543.000.000) 04/07 20:19:43 Job reconnection failed
   Job disconnected too long: JobLeaseDuration (1200 seconds) expired
   Can not reconnect to slot8@3632@xxxxxxxxxxxxxxxxxxxxxxxxxxxx, rescheduling job

We observed these errors at the 2016 core scale, but did not observe them at the 504 or 1008 core scale.
We have made some progress in mitigating these errors but would like to understand further:
With a setting of

DAGMAN_USER_LOG_SCAN_INTERVAL=5

such errors occur at the 2016 core scale, but with a higher setting

DAGMAN_USER_LOG_SCAN_INTERVAL=40

we do not observe them.  We have a hypothesis that the errors result from a busy/backed up
schedd, and suspect the parameter change has lightened the load on the schedd or perhaps the
collector, and diminished the load on the central manager.  Can condor experts
shed light on the results with these various parameter settings and the errors observed?
Is there a more direct parameter setting change that might impact theses errors?


2) At any of these scales but primarily on the high 2016 core end, we can observe
 JobEvictedEvents like

...
028 (95502.000.000) 04/07 20:24:32 Job ad information event triggered.
Proc = 0
EventTime = "2013-04-07T20:24:32"
TriggerEventTypeName = "ULOG_JOB_EVICTED"
TriggerEventTypeNumber = 4
RunRemoteUsage = "Usr 0 00:03:20, Sys 0 00:00:01"
RunLocalUsage = "Usr 0 00:00:00, Sys 0 00:00:00"
SentBytes = 0.0
MyType = "JobEvictedEvent"
Checkpointed = false
TerminatedAndRequeued = false
Cluster = 95502
MachineSlotName = "slot2@18857@xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
Subproc = 0
EventTypeNumber = 28
CurrentTime = time()
ReceivedBytes = 3242.000000
TerminatedNormally = false
...
004 (95403.000.000) 04/07 20:24:32 Job was evicted.

in xxxxx.dag.nodes.log. We have very little visibility on the cause of these
job evictions. Can the condor team advise on how to debug the cause of these
evictions, and how they might be addressed?