[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Vanilla universe jobs getting evicted them immediately aborted ( 8.6.11)



Hi,
I am struggling with a problem that when my vanilla jobs getting evicted from a machine, they seem to get aborted rather than rescheduled.

Tue 14:27 job 51.101 evictedÂÂÂ case1.xml from machine1...
Rescheduling ...
Tue 14:27 cluster 51 status: 10/3 active/running, 268/268 aborted/failed, 88/366 finished/submitted (97.3% done)
...
Tue 14:27 job 51.101 abortedÂÂÂ case1.xml on machine1, killed by ? ...

I am using the Condor.pm PERL module. When the eviction callback is called, the PERL code then calls condor_reschedule.

The documentation seems a little unclear on what happens to evicted then rescheduled vanilla jobs. Obviously they would have to transfer files again to the new renegotiated machine.

My slightly redacted (generated by script) submit file is as per below



Andrew


universeÂÂÂÂ = vanilla
executableÂÂ = RunSolver_Condor.bat
should_transfer_files ÂÂÂ = YES
when_to_transfer_output = ON_EXIT
logÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ = Condor_WINDOWS_X86_64.out
####################################################################
# Job #1
initial_dir = D:\QA\Case1
requirements = (OpSys=="WINDOWS") && (Arch=="X86_64")
argumentsÂÂÂ = -arg1 -arg2
outputÂÂÂÂÂÂ = case1_Condor_mswin64_2018.008.out
errorÂÂÂÂÂÂÂ = case_2_Condor_mswin64_2018.008.err
transfer_input_filesÂÂÂ ÂÂÂ =case1.xml
queue
####################################################################
# Job #2
nitial_dir = D:\Case2
requirements = (OpSys=="WINDOWS") && (Arch=="X86_64")
argumentsÂÂÂ = -arg1 -arg2
outputÂÂÂÂÂÂ = case_2_Condor_mswin64_2018.008.out
errorÂÂÂÂÂÂÂ = case_2_ Condor_mswin64_2018.008.err
transfer_input_filesÂÂÂ ÂÂÂ = case2.xml
queue

<etc, etc, etc> for many jobs