[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Failing Jobs on hold due to output missing



Hi,

I have a problem with HTCondor's transfer_output_files and hold mechanic with failing jobs. We often have some very big temporary files in our jobs, so specifying the actual output files is a must for us in these cases. However, when a job fails before creating the output file, the STARTER subsequently fails to transfer the "requested" output file and as a result the job is put on hold. [1] I realize this is a documented behavior. [2] As we have a multi-backend job manager wrapped around Condor, this is not exactly optimal for us. It masks what is an error in the job itself as an error of Condor. We could catch the error but this requires separate handling for jobs with transfer_output and those without it. It would be much easier if we could have Condor treat the job as Completed regardless of whether all files exist or define individual files as optional. Is there any way to do this?

Cheers,
Max


[1]
8928.89 mfischer 3/27 14:49 Error from [WORKERNODE]: STARTER at [WORKERNODE] failed to send file(s) to <[SCHEDD]:9615>: error reading from /home/cmsusr189/home_cream_084391945/glide_p21124/execute/dir_20515/cmssw.log.gz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <[WORKERNODE]:52719>

[2]
http://research.cs.wisc.edu/htcondor/manual/current/2_5Submitting_Job.html#SECTION00354200000000000000