[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Stuck dagman jobs after restart



On 15/12/2014 15:50, R. Kent Wenger wrote:
On Mon, 15 Dec 2014, Brian Bockelman wrote:

Hi Brian,

It might be worth it to look at the UserLog of these jobs - it's possible they are switching quickly between R and I?

Hmm, you could look, but I'd be really surprised if that were happening.
Could you send us your SchedLog? I think that's the most likely log to give us some useful information.

We actually have a test for DAGs getting correctly restarted across a Condor restart, so I'm a little surprised this is happening.

Something else I just thought of -- you might want to try doing condor_hold and then condor_release on one of the DAGs, to see if that gets it to run (just a wild guess).
I thought I had condor_rm'd these jobs, but right now I see they're still there.

condor_hold and condor_release didn't help.

It's possible that the working directory for these two jobs has been removed. Oh well, not to worry. I've condor_rm'd them (again?) and I'll let you know if they resurface :-)

Thanks,

Brian.