[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] no rescue file created



On Thu, May 17, 2012 at 08:12:08AM -0400, Gautam Saxena wrote:

Hi Gautam:

> Are there circumstances when a rescue file fails to get created? And if so,
> is there a way to force its recreation?
> 
> This is what happened: We were running a reasonalbly large DAG over 10 days
> or so.
> One of the main machines (the submitting machine actually) rebooted.
> (Not sure if this reboot is relevant.)

Yes it is relevant.

>Eventually, the dag seemed to finish
> (in that there was nothing actually running on any machine), but the "dag"
> job showed that there was 1 job on hold plus there was the actual dag job
> itself.

Was the DAGman job itself on hold?

> So, I did a condor_rm on the job that was on hold.  That operation
> both removed the "holded" job as well as the "dag" job itself.
> However, no rescue file was created.

This sounds definitely wrong.  Can you send me the .dagman.out file from
the run?

> Is this normal? (Also, I've noticed that if I do a condor_rm on the
> dag job itself, it will not produce a rescue file either -- is that
> normal too?)

If the condor_dagman job was on hold when you did a condor_rm, this is a
known bug.

Some relevant historical information is at the following tickets.

https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2765
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=1490
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2434
https://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=2213

> 
> -Gautam