[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Failing Jobs on hold due to output missing



Max,

This seems a bit hack-y, but could this be done with a custom file-transfer plugin? If you only needed to limit the size of the files that get transferred, you could write a script that would check the file size before sending, and if it was bigger than so-and-so MB, then it wouldn't be transferred. (You would also need to unset transfer_output_files so that the entire sandbox is processed)

I guess it would be possible to do it based on the filename too, but that would have pretty big scalability issues, most likely.

On Wed, Mar 27, 2013 at 10:39 AM, Max Fischer <mfischer@xxxxxxxxxxxxxxxxxxxx> wrote:
Hi,

I have a problem with HTCondor's transfer_output_files and hold mechanic with failing jobs. We often have some very big temporary files in our jobs, so specifying the actual output files is a must for us in these cases. However, when a job fails before creating the output file, the STARTER subsequently fails to transfer the "requested" output file and as a result the job is put on hold. [1] I realize this is a documented behavior. [2]
As we have a multi-backend job manager wrapped around Condor, this is not exactly optimal for us. It masks what is an error in the job itself as an error of Condor. We could catch the error but this requires separate handling for jobs with transfer_output and those without it. It would be much easier if we could have Condor treat the job as Completed regardless of whether all files exist or define individual files as optional. Is there any way to do this?

Cheers,
Max


[1]
8928.89  mfischer        3/27 14:49 Error from [WORKERNODE]: STARTER at [WORKERNODE] failed to send file(s) to <[SCHEDD]:9615>: error reading from /home/cmsusr189/home_cream_084391945/glide_p21124/execute/dir_20515/cmssw.log.gz: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <[WORKERNODE]:52719>

[2]
http://research.cs.wisc.edu/htcondor/manual/current/2_5Submitting_Job.html#SECTION00354200000000000000
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/