[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?
- Date: Fri, 19 Oct 2018 11:17:11 -0700
- From: "Kitlasten, Wesley" <wkitlasten@xxxxxxxx>
- Subject: Re: [HTCondor-users] [EXTERNAL] Re: Make runs fail?
1) I too heard those alarm bells! The model is highly non-linear and it just does not converge for certain parameter combinations. As such, the "results" of the model with those parameter combination are inherently flawed (i.e. no point in considering fluxes when your mass balance errors are 200%, right?). With a better understanding of the model I could potentially better inform the covariance matrix and avoid those combinations that fail to converge, but so far I have been unsuccessful.
2 and 3) These are helpful. I use a python script to extract the elapsed time of the simulation and the number of iterations, then estimate how long the model will take to complete. I can likely use Michael's presentation to guide me further, although honestly I only grok about 60% or less of it!
Some quick thoughts on the below --
1. Whenever I hear someone saying they are doing a parameter sweep from
a probability distribution and want to throw out the jobs that run for a
long time (or use a lot of memory, or whatever), alarm bells go off.Â Be
aware that by removing these longer running jobs, you may be introducing
a significant statistical bias in your sample that could render your
scientific results as flawed!Â Proceed with caution....
2. Perhaps instead of doing "condor_rm 1.10", you could do "condor_hold
1.10" which would free the cores and hopefully sstill keep
PEST/YAMR/Panter from submitting another jobs with the same parameters?
Then once all your modeling work is done, you could simply condor_rm
3. How do you go about telling if the jobs will be worthless to you
after the first hour?Â I ask because perhaps it can be something that is
easily automated... ie you can tell HTCondor to automatically
condor_hold or condor_rm jobs that run for more than hour...
Hope the above thoughts help,
On 10/19/2018 9:58 AM, Kitlasten, Wesley via HTCondor-users wrote:
> I am using the parameter estimation software PEST to run multiple models
> (jobs). PEST uses YAMR and Panther, although I struggle to make sense of
> how everything works together.
> The parameters are determined from a probability distribution. Some
> parameter combinations (jobs) can take 12+ hours to run, and from previous
> experience I can tell the results of those runs will be worthless to
> me.Â I can usually
> tell which jobs will be useless within the first hour. I would like to
> remove these jobs after about 1 hours to free up cores for other runs
> but have them returned as failed jobs and not be resubmitted to the pool.
> For example, if I type "condor_rm 1.10" it will
> remove that job, but the model with those parameters will just be
> resubmitted to another node and start over. However, if the job truly
> fails a job with
> those parameters will not be resubmitted.
> Is there a way to remove a job and have condor return a failed status,
> rather than have the same parameters run under a different job name?
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> The archives can be found at:
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput ComputingÂ ÂDepartment of Computer Sciences
HTCondor Technical LeadÂ Â Â Â Â Â Â Â 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685
United States Geological Survey
2730 N. Deer Run Road
Carson City, NV 89701