[HTCondor-users] Dagman job resubmission after condor

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

This morning, we saw a dagman process in our cluster that was stuck in an “X” state (via condor_q –dag) after it had been removed (condor_rm <dag cluster id>). A few of the nodes from the DAG were still running or idle. We weren't sure why these nodes were still executing after the dagman was removed; but we think it has something to do with the FINAL node, and the way the dagman parses the DAG. Since these jobs take a very long time to complete, this ends up causing us issues with slots being held by jobs that aren't actually supposed to be running. After we cleaned up these extra processes, we were able to reproduce this with a simple job.

We create a simple bat script that sleeps for 30min (arbitrary time), which each node in the dag will use as the executable in the submit file. Our test DAG had 20 nodes total (also arbitrary), 10 of which were children of the first 10. The most important part is the FINAL statement at the end of the DAG. This just sleeps for 15 seconds (also arbitrary, but short enough that we don't have to wait a long time to watch it finish). When we submit the DAG, we see a dagman enter a running state, then begin submitting the nodes of the DAG. If we kill this off after a certain amount of time (we weren't able to figure this out, but it is likely less than 5-10sec after the first nodes are submitted), the dagman exits and doesn't run the workflow; and we don't see any processes remaining from the DAG. If we wait a bit longer (maybe >10sec), we can do a condor_rm to put the dagman into an "X" state, and any running and/or idle nodes will be removed. This is where the problem occurs. We expect this to happen, and we expect the FINAL node to be queued, execute, and return; killing the rest of the workflow. What actually happens is that the FINAL node is queued, and any running/idle jobs that were in the queue when dagman was removed are *also* queued. I'm not certain if this is expected behavior, but we didn't anticipate it when removing a dagman; we assumed every node would be removed. We also noticed that these processes continue running after the FINAL node exits.

We think this has something to do with the FINAL node, since we can't reproduce the issue without it. Also, since we don't see it if we kill the dagman early enough in the workflow, we think that the FINAL node might not be evaluated right away; maybe the DAG is still being parsed when the first set up jobs are queued?

Could this be caused by a config setting we are using? Has anyone else seen this behavior (can you reproduce it)?

Regards,

Eric Gross

System Engineer

Susquehanna International Group

Mailing List Archives

Public Access

[HTCondor-users] Dagman job resubmission after condor_rm