[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



Horvátth,

I'm not sure I understand what you're doing -- but I'm not surprised it stopped working, as it's akin to brain surgery on a live, moving patient. :)

If the issue is jobs which fail sometimes due to factors outside your control, but which succeed if re-submitted, then why not use DAGMan's RETRY feature?

If that's not sufficient, please describe the problem in a little more detail. I'm optimistic there's a better solution than using condor_qedit. DAGMan's underlying implementation is obviously subject to change, so relying on a script which circumvents the supported API & semantics is going to be fragile.

-Peter


On Aug 17, 2006, at 9:02 AM, Horvátth Szabolcs wrote:
For quite a while - using the 6.7.x series - we used a script to restart
parent dependent child jobs by traversing the hierarchy
and restarting jobs (using hold + release) that were required for the
completion of a child job. (Sometimes software license issues,
disk problems or data read / write errors can make a task unusable for a
while although restarting after a short amount of time makes
it work and the whole dag continue.)

The script restarts the parent jobs, waits for their completion and
after completion it modifies the child jobs' data using qedit
and restarts the child jobs.(hold and release again). Now this worked ok
with 6.7 but with 6.8 I get a DAG error message in the dagman.out file
and *all* tasks in the DAGMan job goes into the removed state. The
reason being: RemoveReason = "via condor_rm (by user szabolcs)"

8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total end count != 0 (1)
8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job
(34202.0.0) executing, total end count != 0 (1))
8/17 15:53:02 Aborting DAG...

Now this is not really good for me. Could you tell me what happens under
the hood? How can I avoid it and get my script working or simply
disable this "error" checking?

Thanks in advance!

Cheers,
Szabolcs



--
Peter Couvares                        University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
pfc@xxxxxxxxxxx                       1210 W. Dayton St. Rm #4241
(608) 265-8936                        Madison, WI 53706-1685