[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>
>> As a test I removed liszt from the config and it still will not use
>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>
>
> What do you mean when you say "removed liszt from the config" ?

I set NUM_SLOTS to 0

> Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
> liszt) ?
>
> So now when you do a "condor_status", the only thing you see is slots on
> chopin?

Correct.

> And yet your job remains idle and refuses to run on chopin?

Correct.

> Maybe your job does not match any slots on chopin

What makes a match? Before the reboot the same jobs I am trying to run
today were running on chopin and when that was full, ran on liszt.

> -- perhaps because chopin
> has less memory than liszt, for instance.

The 2 machines are identical in every way - CPU, memory, etc. The
reason they were rebooted was that a 10G point to point connection was
installed between them.

> See the manual section "Why is
> the job not running?" at http://tinyurl.com/ycbut82r

That tiny url does not work, but I've been looking at this page:

http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs

and here is the output of condor_q -analyze:verbose when run on the master:

-- Schedd: bach.elucid.local : <192.168.10.2:20734>
No successful match recorded.
Last failed match: Mon Feb 19 17:19:52 2018

Reason for last match failure: no match found

21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job

>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
>> wrote:
>>>
>>> I have a master and 2 execute hosts (chopin and liszt) and I have one
>>> host (chopin) preferred over the other with these settings:
>>>
>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>     (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>     (100 * Machine =?= "chopin")
>>> NEGOTIATOR_DEPTH_FIRST = True
>>>
>>> The preferred host is chopin. This has been working fine until Friday
>>> when both execute hosts were rebooted. Since then condor will only run
>>> jobs on liszt. Even if there are more jobs in the queue then slots on
>>> liszt it will not use chopin. A condor_status shows all the slots on
>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and no
>>> errors in the logs.
>>>
>>> How can I debug and/or fix this?