[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] jobs die due to low free memory

Anika Boehm wrote:
Once in a while we experience that quite a lot of jobs in a row die on one
workstation (up to 200 jobs within 15 min). They mostly die by signal 6 or 11.
As far as I can see this is caused by the fact that the workstation is
running out of memory. I couldn't find any note in the documentation that Condor
checks the actually free memory of a machine besides its totally installed
memory (we're running Condor 6.5.3). Is there a way to make Condor check this
point and not starting a job if free memory is smaller than the job's size?


I don't know whether setting the ImageSize macro in the submit file would help in this situation.

ImageSize : Estimate of the memory image size of the job in kbytes. The initial estimate may be specified in the job submit file. Otherwise, the initial value is equal to the size of the executable. When the job checkpoints, the ImageSize attribute is set to the size of the checkpoint file (since the checkpoint file contains the job's memory image).

This may also be a case of what UW calls a 'black hole' machine. Even if it's not a real black hole, putting this statement in the submit file will prevent the uncompleted job from being removed from the queue if it took less than 10 minutes to run:

on_exit_remove = (CurrentTime - JobStartDate) > (10 * 60)

From the manual:

on_exit_remove = ClassAd Boolean Expression
This expression is checked when the job exits and if true, then it allows the job to leave the queue normally. If false, then the job is placed back into the Idle state. If the user job is a vanilla job then it restarts from the beginning. If the user job is a standard job, then it restarts from the last checkpoint.

For example: Suppose you have a job that occasionally segfaults but you know if you run it again on the same data, chances are it will finish successfully. This is how you would represent that with on_exit_remove(assuming the signal identifier for segmentation fault is 4):

on_exit_remove = (ExitBySignal == True) && (ExitSignal != 4)

The above expression will not let the job exit if it exited by a signal and that signal number was 4(representing segmentation fault). In any other case of the job exiting, it will leave the queue as it normally would have done.

If left unspecified, this will default to True.

periodic_ expressions(defined elsewhere in this man page) take precedent over on_exit_ expressions and a _hold expression takes precedent over a _remove expression.

This expression is available for the vanilla and java universes. It is additionally available, when submitted from a Unix machine, for the standard universe.

-- Chris Horn p: 703.413.1100 x5193 f: 703.413.8111 Condor Support Information: http://www.cs.wisc.edu/condor/condor-support/ To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with unsubscribe condor-users <your_email_address>