[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Behavior of Condor jobs held for file transfer errors



On 6/20/2012 3:31 PM, Todd Tannenbaum wrote:

Do you always want to simply remove held grid jobs?

If so, you can put the following into the submit file of a grid universe
job:

    +nonessential = true

This tells Condor to simply abort (remove) any problematic job instead
of putting the job on hold.  Condor will try to remove it nicely, but
will not let it stick around in the queue even if it fails to confirm
what happened on the execute node.  So placing the nonessential
attribute in the job ad is equal to doing condor_rm followed by
condor_rm -forcex anytime the job would have otherwise gone on hold.

regards,
Todd


Just for the record: to clarify, the nonessential attribute is only honored for jobs submitted into the grid universe (any gridtype). It does not work for vanilla universe, but if folks would like to see it there as well we could add it to our list...

regards,
Todd






-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx
[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
Sent: Wednesday, June 20, 2012 2:56 PM
To: condor-users@xxxxxxxxxxx
Subject: Re: [Condor-users] Behavior of Condor jobs held for file
transfer errors


Removing jobs that are on hold can be achieved by using the
periodic_remove expression in the job submit file or by the
SYSTEM_PERIODIC_REMOVE expression in the submit machine condor
configuration.

Example:

SYSTEM_PERIODIC_REMOVE = HoldReasonCode == 12 || HoldReasonCode == 14

The HoldReasonCodes are defined in the manual:

http://research.cs.wisc.edu/condor/manual/v7.6/10_Appendix_A.html#82773

--Dan

On 6/20/12 12:25 PM, Myung Cho wrote:
Hi , I did a quick search for this topic but haven't found any
relevant posts. Is there a way to change/specify the default behavior
in Condor for jobs with file transfer errors? Our jobs with any error
in file transfer, for example a missing file specified in
transfer_output_files, seem to cause the job to be in held state for
ever. Is there a way for the job to just complete with error? I rather
see it finish with an error reported rather than have it just hang
around in hold state.

Thanks.
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx
with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/


_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/





--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
Condor Project Technical Lead          1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685