[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] No-idle jobs



Jim,

I think the best approach here will be a periodic_hold and periodic_release expression pair using the existing attributes of the job ClassAd.

For example, if the job doesn't get a connection to the database within five minutes, it will not start accumulating any CPU time. So you could have an expression that will put the job on hold if the JobCurrentStartDate attribute is more than, for example, 10 minutes ago, and the RemoteUserCpu is less than, say, a few seconds - something that's reliably less than whatever a job which finally got connected to the database after a five-minute wait will accumulate after a total runtime of 10 minutes.

If it's possible for such a job to be restarted and try again, then you can create a periodic_release expression. You'd set the code and subcode of the periodic hold, and then refer that to do the release of a job that failed to connect - JobStatus == 5 && PeriodicHoldReasonSubcode == 12345 - something along those lines.

This way, the hung jobs can be cleared out in fairly short order, and allowed to make another attempt, while not clogging up useful slots.

If there's situations where you can tell a priori the jobs won't be able to connect, you can set up a STARTD_CRON job to add a "DatabaseAvailable" Boolean attribute to all of the machines to check for that situation, and set the job requirements so that the jobs won't even attempt to start if the database is offline.

	-Michael Pelletier.

-------------
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of James Ward - NOAA Affiliate
Sent: Wednesday, January 3, 2018 3:29 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] [HTCondor-users] No-idle jobs

We are trying HTCondor, but we have jobs that don't like to be idled - if the job doesn't get a connection to a database in 5 minutes, it hangs. For example, you start 6 no-idle jobs,100 more jobs are queued by another user, the 6 no-idle jobs keep running, the 100 more jobs finish in several hours, and the 6 no-idle jobs are still running in the queue? There is a pool of users that will be submitting no-idle jobs, along with other jobs.
Thank you,
Jim Ward