[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Restarting completed dag jobs does not work anymore with 6.8.0



On Thu, 17 Aug 2006, [ISO-8859-1] Horvátth Szabolcs wrote:

> For quite a while - using the 6.7.x series - we used a script to restart
> parent dependent child jobs by traversing the hierarchy
> and restarting jobs (using hold + release) that were required for the
> completion of a child job. (Sometimes software license issues,
> disk problems or data read / write errors can make a task unusable for a
> while although restarting after a short amount of time makes
> it work and the whole dag continue.)
>
> The script restarts the parent jobs, waits for their completion and
> after completion it modifies the child jobs' data using qedit
> and restarts the child jobs.(hold and release again). Now this worked ok
> with 6.7 but with 6.8 I get a DAG error message in the dagman.out file
> and *all* tasks in the DAGMan job goes into the removed state. The
> reason being: RemoveReason = "via condor_rm (by user szabolcs)"
>
> 8/17 15:53:02 BAD EVENT: job (34202.0.0) executing, total end count != 0 (1)
> 8/17 15:53:02 ERROR: aborting DAG because of bad event (BAD EVENT: job
> (34202.0.0) executing, total end count != 0 (1))
> 8/17 15:53:02 Aborting DAG...
>
> Now this is not really good for me. Could you tell me what happens under
> the hood? How can I avoid it and get my script working or simply
> disable this "error" checking?

Okay, disabling the checking is easy:  just set the config macro
DAGMAN_ALLOW_EVENTS to 5.  You can do that in your config file, or
by setting _CONDOR_DAGMAN_ALLOW_EVENTS in your environment before
running condor_submit_dag.

I'd like to actually diagnose what's going on, though.  Can you send
the relevant dagman.out and user log files?

Kent Wenger
Condor Team