[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit



 

Hello,

 

I have found in the documentation how to create custom machine resources (MACHINE_RESOURCE_<tag>) but could not find how to create custom computing cluster resource which is shared among all the computing nodes. Could you please let me know if it is possible?

 

Also, updating RequestMemory using

 

   condor_qedit ClusterID RequestMemory NewRAM

 

does not seem to have any effect on the number of running jobs. Also it does not affect “ChildMemory” list of a dynamic slot (it stays identical to one created during submission). Please, let me know if the RequestMemory parameter of jobs has an effect on scheduling after the job cluster submission. Even with dynamic slots I find myself in need to adjust resource allocation of running jobs.

 

Thank you,

Siarhei.

 

 

From: Vaurynovich, Siarhei
Sent: Monday, June 04, 2018 12:43 PM
To: Todd Tannenbaum <tannenba@xxxxxxxxxxx>; HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: RE: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit

 

 

Hello Todd,

 

Thank you very much for your reply! It contained a lot of useful advise.

 

In terms of attempts to reproduce the problem with jobs becoming idle after condor_qedit modifies SlotID Requirement: I did not attempt to study the problem systematically - it happened to me about half a dozen times in the last 2-3 weeks. And so it was certainly not a single occurrence. I noticed that simply changing priority of my jobs using condor_prio makes them to run again (such priority changes do not make such job any more or less relatively prioritized with respect to other jobs). After your last e-mail, I also started to run condor_reschedule after every modification of job ClassAds, and so far no job cluster got stuck.

 

I have switched my servers to use 100% dynamic slots according to your advice. And it works beautifully.

 

If you have time to help a bit more, I have a few questions about how to fulfill some requirements I have in practical use of using HTCondor:

 

1) Some resources are naturally limited, and to achieve overall optimal computing progress, it is desired to fully utilize those resources whenever they are available. Examples of such resources: a) number of connections to a certain database (to avoid overloading the DB or even crashing it, number of simultaneous queries has to be limited), b) server grade GPUs, which can run up to a certain number of jobs in parallel. So, how can I define such custom resources and require them in submit files? If I set priority of jobs utilizing such resource to be max, then the resources would be fully utilized.

 

2) Can I ask HTCondor to always run a certain number of jobs from a specific cluster? This is needed to make sure that large (i.e. requiring a lot of RAM) lower priority jobs continue to make progress, while higher priority CPU-bound jobs utilize the remaining RAM.

 

3) Is it possible to dynamically put to sleep (or in the worst case kill and later restart) jobs, which attempts to allocate such amount of RAM which would leave less than some threshold percent  of RAM remaining available on a computing node? For example: if a job attempts to allocate for some operation such amount of RAM, that after such operation less than 10% of RAM would remain free, I want it to fall asleep until that would be possible. The practical situation is that some jobs use wide range of RAM while they run: for example, the max could be almost an order of magnitude higher than the min. If each job requires the max amount of RAM it needs in the submit file, then most of the time the computing nodes could have large fraction of RAM not utilized and only a small number of jobs would run (i.e. CPU would not be fully utilized also) since large amount of RAM is only needed by such jobs for some rather small fraction of time. Or, is it at least possible to automatically reschedule jobs (for example, in held or some other state) which were killed by the OS due to memory allocation problems?

 

Thank you very much for your help,

Siarhei.

 

 

-----Original Message-----
From: Todd Tannenbaum [mailto:tannenba@xxxxxxxxxxx]
Sent: Thursday, May 31, 2018 12:50 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>; Vaurynovich, Siarhei <siarhei.vaurynovich@xxxxxxxxxxxxx>
Subject: Re: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit

 

On 5/30/2018 2:04 PM, Vaurynovich, Siarhei wrote:

> Hello,

>

> *Please, let me know if there is a way to force HTCondor matchmaker to

> consider a job cluster for scheduling.*

>

 

The command "condor_reschedule", issued on the submit host (i.e. where the schedd is running), will do that.  However, by default, this should happen automatically every few minutes.

 

> My jobs often sit unscheduled in the queue for many hours

> (indefinitely) if I use condor_qedit to adjust job requirements.

>

> To make sure jobs have enough RAM to run, I sometimes restrict allowed

> SlotID range in requirements. There is probably a better way to do it:

> i.e. somehow to declare RAM as a shared resource with certain number

> of units of the resource available, but for now this is my quick hack

> to do it. Setting ImageSize does not work since my jobs are almost

> always bigger than per slot RAM and so if I give realistic job size,

> my jobs would never start. Creating specialized slots is also a bad

> idea since my jobs vary strongly in size.

> 

 

The above sounds like pretty strange usage. As you suspect, there are

better ways to do this.  Assuming you are using a current version of

HTCondor (i.e. HTCondor v8.6 or above), instead of configuring your

nodes to partition resources like memory into statically sized slots,

you could configure your nodes to use dynamic (partitionable) slots.

See the HTCondor Manual section "Dynamic Provisioning: Partitionable and

Dynamic Slots" at URL http://tinyurl.com/y83a9ufo.  Once setup your

execute nodes to use a partitionable slot as described, then your

condor_submit file can look like:

 

   executable = foo

   # This job only needs one CPU core in the execute slot

   request_cpus = 1

   # This job needs 3.5 GB of RAM in the execute slot

   request_memory = 3500

   queue

 

and the execute node (startd) will carve off a new slot with 3.5GB of

memory for this job.  No messing around with ImageSize required.

 

> The problem is that often after such adjustment, my jobs would often

> stop being scheduled for running – they sit in the queue indefinitely

> and ‘condor_q -better-analyze clusterID’ gives “Job has not yet been

> considered by the matchmaker.” while claiming that there are slots

> “available to run your job”. If I do not use condor_qedit, jobs run

> fine. If I kill the same jobs and then submit them again with new

> requirements, they also run fine.

>

 

This sounds pretty strange.  Can you easily reproduce it?  Does it

happen every time or only sometimes? What version of HTCondor are you

using, on what platform?

 

regards,

Todd

 

............................................................................

Trading instructions sent electronically to Bernstein shall not be deemed
accepted until a representative of Bernstein acknowledges receipt
electronically or by telephone.  Comments in this e-mail transmission and
any attachments are part of a larger body of investment analysis. For our
research reports, which contain information that may be used to support
investment decisions, and disclosures see our website at
www.bernsteinresearch.com.

For further important information about AllianceBernstein please click here
http://www.alliancebernstein.com/disclaimer/email/disclaimer.html