[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



Hi,

For quite a while - using the 6.7.x series - we used a script to restart parent dependent child jobs by traversing the hierarchy and restarting jobs (using hold + release) that were required for the completion of a child job. (Sometimes software license issues, disk problems or data read / write errors can make a task unusable for a while although restarting after a short amount of time makes
it work and the whole dag continue.)

The script restarts the parent jobs, waits for their completion and after completion it modifies the child jobs' data using qedit and restarts the child jobs.(hold and release again). Now this worked ok with 6.7 but with 6.8 I get a DAG error message in the dagman.out file and *all* tasks in the DAGMan job goes into the removed state. The reason being: RemoveReason = "via condor_rm (by user szabolcs)"

8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total end count != 0 (1)
8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job (34202.0.0) executing, total end count != 0 (1))
8/17 15:53:02 Aborting DAG...

Now this is not really good for me. Could you tell me what happens under the hood? How can I avoid it and get my script working or simply
disable this "error" checking?

Thanks in advance!

Cheers,
Szabolcs