[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Good way to start failed jobs from large cluster?



Hi,

as an admin I'm out of condor submit file magic for some time and would
like to know if there is an easy way to accomplish this:

Imagine a user using vanilla universe and large clusters using a submit
file like this:

universe                = vanilla
Arguments               = -j $(Process)
log                     = /home/user/log/$(Process).log
error                   = /home/user/log/$(Process).err
executable              = /home/user/bin/IWillFindIt.exe
notification            = Never
queue 45345

Now imagine this ran for a while but 134 jobs with more or less random
numbers failed, e.g.

5.6, 5.1345, 5.8733, ...

What is a good way to restart only these? So far I help me with this:

for i in `magic_which_will_outpuy_me_process_ids_only`; do
cat <<EOF | condor_submit
universe                = vanilla
Arguments               = -j $i
log                     = /home/user/log/$i.log
error                   = /home/user/log/$i.err
executable              = /home/user/bin/IWillFindIt.exe
notification            = Never
queue
EOF
done

Is there a better way to get this?

Please note: I need to get the log, error as well argument line correctly.

Cheers

Carsten