[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] "Job has not yet been considered by the matchmaker" after condor_qedit

On 5/30/2018 2:04 PM, Vaurynovich, Siarhei wrote:

*Please, let me know if there is a way to force HTCondor matchmaker to consider a job cluster for scheduling.*

The command "condor_reschedule", issued on the submit host (i.e. where the schedd is running), will do that. However, by default, this should happen automatically every few minutes.

My jobs often sit unscheduled in the queue for many hours (indefinitely) if I use condor_qedit to adjust job requirements.

To make sure jobs have enough RAM to run, I sometimes restrict allowed SlotID range in requirements. There is probably a better way to do it: i.e. somehow to declare RAM as a shared resource with certain number of units of the resource available, but for now this is my quick hack to do it. Setting ImageSize does not work since my jobs are almost always bigger than per slot RAM and so if I give realistic job size, my jobs would never start. Creating specialized slots is also a bad idea since my jobs vary strongly in size.

The above sounds like pretty strange usage. As you suspect, there are better ways to do this. Assuming you are using a current version of HTCondor (i.e. HTCondor v8.6 or above), instead of configuring your nodes to partition resources like memory into statically sized slots, you could configure your nodes to use dynamic (partitionable) slots. See the HTCondor Manual section "Dynamic Provisioning: Partitionable and Dynamic Slots" at URL http://tinyurl.com/y83a9ufo. Once setup your execute nodes to use a partitionable slot as described, then your condor_submit file can look like:

  executable = foo
  # This job only needs one CPU core in the execute slot
  request_cpus = 1
  # This job needs 3.5 GB of RAM in the execute slot
  request_memory = 3500

and the execute node (startd) will carve off a new slot with 3.5GB of memory for this job. No messing around with ImageSize required.

The problem is that often after such adjustment, my jobs would often stop being scheduled for running – they sit in the queue indefinitely and ‘condor_q -better-analyze clusterID’ gives “Job has not yet been considered by the matchmaker.” while claiming that there are slots “available to run your job”. If I do not use condor_qedit, jobs run fine. If I kill the same jobs and then submit them again with new requirements, they also run fine.

This sounds pretty strange. Can you easily reproduce it? Does it happen every time or only sometimes? What version of HTCondor are you using, on what platform?