[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
- Date: Thu, 17 Aug 2006 16:02:49 +0200
- From: Horvátth Szabolcs <szabolcs@xxxxxxxxxxxxx>
- Subject: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0
For quite a while - using the 6.7.x series - we used a script to restart
parent dependent child jobs by traversing the hierarchy
and restarting jobs (using hold + release) that were required for the
completion of a child job. (Sometimes software license issues,
disk problems or data read / write errors can make a task unusable for a
while although restarting after a short amount of time makes
it work and the whole dag continue.)
The script restarts the parent jobs, waits for their completion and
after completion it modifies the child jobs' data using qedit
and restarts the child jobs.(hold and release again). Now this worked ok
with 6.7 but with 6.8 I get a DAG error message in the dagman.out file
and *all* tasks in the DAGMan job goes into the removed state. The
reason being: RemoveReason = "via condor_rm (by user szabolcs)"
8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total end count != 0 (1)
8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job
(34202.0.0) executing, total end count != 0 (1))
8/17 15:53:02 Aborting DAG...
Now this is not really good for me. Could you tell me what happens under
the hood? How can I avoid it and get my script working or simply
disable this "error" checking?
Thanks in advance!