[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] condor_rm on DAG fails to remove queued DAG nodes




 I think I asked about this once before, but I don't recall (and don't
have in my email archives) any answer: sometimes when I do a condor_rm
on a DAG JID the DAG is removed but the queued jobs are not.

If I had to guess, I'd say it is a concurrency issue caused by a
recently submitted DAG that is actively spawning DAG node jobs.  The DAG
is removed, all nodes of that DAG that exist at that instant are
(possibly) removed, but any DAG nodes that are in the process of being
instantiated/created still come into existence.

Here are some log file excerpts.  You can see I held the job, then
realized that I actually had to remove it, so I did this 3 minutes later.

:::3c87.dag.dagman.log
...
012 (2510305.000.000) 07/06 17:06:35 Job was held.
    via condor_hold (by user ijstokes)
    Code 1 Subcode 0
...
009 (2510305.000.000) 07/06 17:09:32 Job was aborted by the user.
    via condor_rm (by user ijstokes)
...

:::3c87.dag.dagman.out
07/06/10 17:06:34 Got SIGTERM. Performing graceful shutdown.
07/06/10 17:06:35 Warning: ReadMultipleUserLogs destructor called, but
still monitoring 1 log(s)!
07/06/10 17:06:35 **** condor_scheduniv_exec.2510305.0 (condor_DAGMAN)
pid 1963 EXITING WITH STATUS 3

In fact, it looks like the "hold" may have caused something to go wrong,
as the dagman.out output above suggests something went wrong at 17:06, 2
seconds after the condor_hold command was issued.  If the DAG was in a
funny state, it is surprising that something didn't notice this at 17:09
when the condor_rm command was given, and reported this either in the
log file or to the console where the command was issued.

Looking in a different log file, from the job logging, I can see that
nothing happened between 17:06 and 17:14, at which point I manually
condor_rm'ed each of the DAG nodes that was still in the pool (the DAG
job itself was no longer present).

:::3c87.dag.nodes.log
...
000 (2511305.000.000) 07/06 17:06:34 Job submitted from host:
<134.174.140.112:59673>
    DAG Node: 3c87-2a41a2
...
009 (2510306.000.000) 07/06 17:14:15 Job was aborted by the user.
    via condor_rm (by user ijstokes)
...

Any advice on how to avoid this recurring problem would be kindly
received.  If it is believed that this is a Condor (DAGMan?) bug, I'm
happy to report it.

Regards,

Ian

-- 
Ian Stokes-Rees, PhD                       W: http://hkl.hms.harvard.edu
ijstokes@xxxxxxxxxxxxxxxxxxx               T: +1 617 432-5608 x75
NEBioGrid, Harvard Medical School          C: +1 617 331-5993