[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Efficiency & centralization of global information gathering?



Hey Edward,

 

I was able to confirm that in order to get the negotiator to recognize a new concurrency limit in a running job, you have to update the machine ad for the slot where the job is running, as opposed to the job ad via condor_qedit.

 

I ran condor_update_machine_ad as a queue superuser to insert a new ConcurrencyLimits string in the slot in question, which contains a suspended job:

 

condor2$ condor_status âlong slot1_1@xxxxxxxxxxxxxx \

     | awk âBEGIN{print âConcurrencyLimits = \âmatlab:999\ââ}{print}â | condor_update_machine_ad

 

Once this is in place, the âcondor_status âaf name concurrencylimitsâ command shows the update, and no additional âmatlabâ concurrency-limited jobs will start.

 

Since I canât get the âname command line option to work on 8.4.10, it appears to be necessary to run this on the exec node in question as a queue superuser, rather than on the CM server.

 

Interestingly, watching the âcondor_statusâ output during one test, I saw the slotâs concurrency limit disappear into an âundefinedâ after a certain length of time. This may have been because the string didnât exactly match the job classad, so it would be reasonable to set both the job and machine ConcurrencyLimits attributes to matching strings. Iâve got another test going with precisely matched job and machine attribute strings which has been holding steady for about 45 minutes, so that looks like the trick.

 

It also turns out that a job can condor_suspend itself as long as it has a short sleep as the next step to give the startd/starter time to send the suspend signals, so that would be an alternative to polling (depending on the job characteristics) while waiting for the queue-superuser task to insert the requested concurrency limit into the machine ad. Youâd just have the machine-ad-update task âcondor_continueâ the job after the update is complete.

 

Thanks again for your insights and suggestions!

 

                -Michael Pelletier.

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Edward Labao
Sent: Tuesday, January 10, 2017 12:07 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Efficiency & centralization of global information gathering?

 

Hi Michael!

We ran into the same issue a few years ago with user jobs tying up a particularly scarce license for hours before they were actually used. We tested the exact same thing you're thinking of by just running a condor_qedit on a long running job to update it's concurrency limit attribute, but it didn't look like the negotiator ever gets an update of the concurrency limit.