[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Vanilla universe jobs getting evicted them immediately aborted ( 8.6.11)



Hi Todd,

So from what you are saying it should work just fine.

The problem is that if you look at the time stamps, the job is aborted almost immediately so no chance to look at the condor_q
This is a snippet from the ShadowLog on the submitting machine for another job, 52.44. Not sure if that provides any clues.

08/15/18 09:16:39 (52.44) (23772): Job 52.44 is being evicted from slot5@xxx
08/15/18 09:16:40 (52.44) (23772): ERROR: SharedPortEndpoint: Named pipe does not exist.
08/15/18 09:16:40 (52.44) (23772): SharedPortEndpoint: Destructor: Problem in thread shutdown notification: 0
08/15/18 09:16:40 (52.44) (23772): **** condor_shadow (condor_SHADOW) pid 23772 EXITING WITH STATUS 102

Andrew

On Wed, Aug 15, 2018 at 12:16 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 8/15/2018 1:15 PM, Andrew Cunningham wrote:
Hi,
I am struggling with a problem that when my vanilla jobs getting evicted from a machine, they seem to get aborted rather than rescheduled.

Tue 14:27 job 51.101 evictedÂÂÂ case1.xml from machine1...
Rescheduling ...
Tue 14:27 cluster 51 status: 10/3 active/running, 268/268 aborted/failed, 88/366 finished/submitted (97.3% done)
...
Tue 14:27 job 51.101 abortedÂÂÂ case1.xml on machine1, killed by ? ...

I am using the Condor.pm PERL module. When the eviction callback is called, the PERL code then calls condor_reschedule.


It is not unreasonable to call condor_reschedule upon eviction of a job, although it is not really needed...

The documentation seems a little unclear on what happens to evicted then rescheduled vanilla jobs. Obviously they would have to transfer files again to the new renegotiated machine.


So from the above, it looks like job 51.101 got evicted from machine1. After being evicted, it should go back to state "Idle" if you look at the job with condor_q, at which point it will get rescheduled to run again (perhaps on a different machine). However, it appears that the second time it ran, it exited/abort...?

It would help to have more information than whatever the Perl module wants to tell you. What does condor_history 51.101 have to say? Or "condor_history -l 51.101"? Or you could also look in the event log at Condor_WINDOWS_X86_64.out (according to your submit file below) for entries for job 51.101. Hopefully that will give a better idea as to what is really happening the second time the job runs. Another place to look would be to grep c:\condor\log\ShadowLog* for "51.101" to see what errors appear there.

Hope the above gives some clues,
regards,
Todd