[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Make runs fail?



Hi,

Some quick thoughts on the below --

1. Whenever I hear someone saying they are doing a parameter sweep from 
a probability distribution and want to throw out the jobs that run for a 
long time (or use a lot of memory, or whatever), alarm bells go off.  Be 
aware that by removing these longer running jobs, you may be introducing 
a significant statistical bias in your sample that could render your 
scientific results as flawed!  Proceed with caution....

2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold 
1.10" which would free the cores and hopefully sstill keep 
PEST/YAMR/Panter from submitting another jobs with the same parameters? 
Then once all your modeling work is done, you could simply condor_rm 
everything.

3. How do you go about telling if the jobs will be worthless to you 
after the first hour?  I ask because perhaps it can be something that is 
easily automated... ie you can tell HTCondor to automatically 
condor_hold or condor_rm jobs that run for more than hour...

Hope the above thoughts help,
Todd


On 10/19/2018 9:58 AM, Kitlasten, Wesley via HTCondor-users wrote:
> Hello,
> 
> I am using the parameter estimation software PEST to run multiple models
> (jobs). PEST uses YAMR and Panther, although I struggle to make sense of
> how everything works together.
> 
> The parameters are determined from a probability distribution. Some
> parameter combinations (jobs) can take 12+ hours to run, and from previous
> experience I can tell the results of those runs will be worthless to 
> me. I can usually
> tell which jobs will be useless within the first hour. I would like to
> remove these jobs after about 1 hours to free up cores for other runs 
> but have them returned as failed jobs and not be resubmitted to the pool.
> 
> For example, if I type "condor_rm 1.10" it will
> remove that job, but the model with those parameters will just be
> resubmitted to another node and start over. However, if the job truly 
> fails a job with
> those parameters will not be resubmitted.
> 
> Is there a way to remove a job and have condor return a failed status,
> rather than have the same parameters run under a different job name?
> 
> References:
> https://github.com/dwelter/pestpp
> https://github.com/jtwhite79/pestpp/tree/master/bin/iwin
> 
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
> 


-- 
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685