[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Good way to start failed jobs from large cluster?



Carsten Aulbert wrote:
> Hi,
> 
> as an admin I'm out of condor submit file magic for some time and would
> like to know if there is an easy way to accomplish this:
> 
> Imagine a user using vanilla universe and large clusters using a submit
> file like this:
> 
> universe                = vanilla
> Arguments               = -j $(Process)
> log                     = /home/user/log/$(Process).log
> error                   = /home/user/log/$(Process).err
> executable              = /home/user/bin/IWillFindIt.exe
> notification            = Never
> queue 45345
> 
> Now imagine this ran for a while but 134 jobs with more or less random
> numbers failed, e.g.
> 
> 5.6, 5.1345, 5.8733, ...
> 
> What is a good way to restart only these? So far I help me with this:
> 
> for i in `magic_which_will_outpuy_me_process_ids_only`; do
> cat <<EOF | condor_submit
> universe                = vanilla
> Arguments               = -j $i
> log                     = /home/user/log/$i.log
> error                   = /home/user/log/$i.err
> executable              = /home/user/bin/IWillFindIt.exe
> notification            = Never
> queue
> EOF
> done
> 
> Is there a better way to get this?
> 
> Please note: I need to get the log, error as well argument line correctly.
> 
> Cheers
> 
> Carsten

Two quick thoughts,

1) you can probably do condor_history -format "%d\n" ProcId -constraint
"ClusterId == ?? && [magic to identify proc as a failure]" to get your Ids

2) are your failed jobs being removed from the queue, why not use an
OnExit policy to put them on hold when [magic to identify proc as a
failure] is identified. This would let you avoid the resubmission, you'd
just have to release the jobs for them to run again.

Best,


matt