[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



If I grep in all the logs for one of the ids for a job in the queue I
see this in the MatchLog:

Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
preempting none
<192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3>
slot1@chopin

This message repeats 132 times (once for each slot on chopin) and then
I see this:

Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
no match found

That sequence of 133 message repeats over and over.

Is this a clue to anything?


On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> If I run ps -efal | grep condor on the 2 execute hosts the only
> difference is that on chopin (the one I cannot get condor to use) it
> has this:
>
> condor_shared_port
>
> That is not on liszt. Is that an issue?
>
> On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>>>
>>>> As a test I removed liszt from the config and it still will not use
>>>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>>>
>>>
>>> What do you mean when you say "removed liszt from the config" ?
>>
>> I set NUM_SLOTS to 0
>>
>>> Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
>>> liszt) ?
>>>
>>> So now when you do a "condor_status", the only thing you see is slots on
>>> chopin?
>>
>> Correct.
>>
>>> And yet your job remains idle and refuses to run on chopin?
>>
>> Correct.
>>
>>> Maybe your job does not match any slots on chopin
>>
>> What makes a match? Before the reboot the same jobs I am trying to run
>> today were running on chopin and when that was full, ran on liszt.
>>
>>> -- perhaps because chopin
>>> has less memory than liszt, for instance.
>>
>> The 2 machines are identical in every way - CPU, memory, etc. The
>> reason they were rebooted was that a 10G point to point connection was
>> installed between them.
>>
>>> See the manual section "Why is
>>> the job not running?" at http://tinyurl.com/ycbut82r
>>
>> That tiny url does not work, but I've been looking at this page:
>>
>> http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
>>
>> and here is the output of condor_q -analyze:verbose when run on the master:
>>
>> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
>> No successful match recorded.
>> Last failed match: Mon Feb 19 17:19:52 2018
>>
>> Reason for last match failure: no match found
>>
>> 21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
>>       0 are rejected by your job's requirements
>>       0 reject your job because of their own requirements
>>       0 match and are already running your jobs
>>       0 match but are serving other users
>>       0 are available to run your job
>>
>>>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> I have a master and 2 execute hosts (chopin and liszt) and I have one
>>>>> host (chopin) preferred over the other with these settings:
>>>>>
>>>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>>>     (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>>>     (100 * Machine =?= "chopin")
>>>>> NEGOTIATOR_DEPTH_FIRST = True
>>>>>
>>>>> The preferred host is chopin. This has been working fine until Friday
>>>>> when both execute hosts were rebooted. Since then condor will only run
>>>>> jobs on liszt. Even if there are more jobs in the queue then slots on
>>>>> liszt it will not use chopin. A condor_status shows all the slots on
>>>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and no
>>>>> errors in the logs.
>>>>>
>>>>> How can I debug and/or fix this?