[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [condor-users] jobs die due to low free memory



Hi Chris,

thanks for your hints. I hoped there would be a "cleaner" solution than
adding a command to one's submit file.
By the way: the expression
   on_exit_remove = (ExitBySignal == True) && (ExitSignal != 4)
doesn't release a successfully done job (ExitCode = 0) from the queue (at
least not with condor-6.5.3@Solaris). The expression still evaluates to FALSE.
It seems
   on_exit_remove = ExitCode =?= 0
is the solution to my problem.

Cheers
Anja

> 
> I don't know whether setting the ImageSize macro in the submit file 
> would help in this situation.
> 
> ImageSize : Estimate of the memory image size of the job in kbytes. The 
> initial estimate may be specified in the job submit file. Otherwise, the 
> initial value is equal to the size of the executable. When the job 
> checkpoints, the ImageSize attribute is set to the size of the 
> checkpoint file (since the checkpoint file contains the job's memory
> image).
> 
> This may also be a case of what UW calls a 'black hole' machine.  Even 
> if it's not a real black hole, putting this statement in the submit file 
> will prevent the uncompleted job from being removed from the queue if it 
> took less than 10 minutes to run:
> 
> on_exit_remove = (CurrentTime - JobStartDate) > (10 * 60)
> 
> 
>  From the manual:
> 
> on_exit_remove = ClassAd Boolean Expression
>      This expression is checked when the job exits and if true, then it 
> allows the job to leave the queue normally. If false, then the job is 
> placed back into the Idle state. If the user job is a vanilla job then 
> it restarts from the beginning. If the user job is a standard job, then 
> it restarts from the last checkpoint.
> 
>      For example: Suppose you have a job that occasionally segfaults but 
> you know if you run it again on the same data, chances are it will 
> finish successfully. This is how you would represent that with 
> on_exit_remove(assuming the signal identifier for segmentation fault is
> 4):
> 
> 	on_exit_remove = (ExitBySignal == True) && (ExitSignal != 4)
> 
>      The above expression will not let the job exit if it exited by a 
> signal and that signal number was 4(representing segmentation fault). In 
> any other case of the job exiting, it will leave the queue as it 
> normally would have done.
> 
>      If left unspecified, this will default to True.
> 
>      periodic_ expressions(defined elsewhere in this man page) take 
> precedent over on_exit_ expressions and a _hold expression takes 
> precedent over a _remove expression.
> 
>      This expression is available for the vanilla and java universes. It 
> is additionally available, when submitted from a Unix machine, for the 
> standard universe.
> 
> 
> -- 
> Chris Horn
> p: 703.413.1100 x5193
> f: 703.413.8111
> Condor Support Information:
> http://www.cs.wisc.edu/condor/condor-support/
> To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
> unsubscribe condor-users <your_email_address>
> 

-- 
+++ NEU bei GMX und erstmalig in Deutschland: TÜV-geprüfter Virenschutz +++
100% Virenerkennung nach Wildlist. Infos: http://www.gmx.net/virenschutz

Condor Support Information:
http://www.cs.wisc.edu/condor/condor-support/
To Unsubscribe, send mail to majordomo@xxxxxxxxxxx with
unsubscribe condor-users <your_email_address>