[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



Christophe, thank you so much. This is a great catch, and I am
thinking it may be the issue. The .11 network was added on Friday and
this is when this issue started happening. The .10 network is 1G and
the .11 is 10G. I do not have the power to disconnect the 10G link and
I need to make it work with the systems I am given. My sysadmin said I
should only use the .10 network, so now I am trying to figure out how
to force condor to only use that network.

I have tried setting BIND_ALL_INTERFACES = False and I have tried
setting NETWORK_INTERFACE = 192.168.10.15 on chopin but neither
setting had any effect. The jobs will still not run on chopin and I
still see this in the logs:

Matched 25421.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
preempting none
<192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=66121_15b3_3>
slot1@chopin
Rejected 25421.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
no match found

What is interesting is that my other execute host, liszt, is also on
both networks, but condor never tries to talk to that over the .11
network, only the .10, and that one will run jobs.

Does anyone know how I can force it to only use the .10 network for both hosts?

On Tue, Feb 20, 2018 at 3:46 AM, Christophe DIARRA <diarra@xxxxxxxxxxxxx> wrote:
> Hello Larry,
>
> Did you change the network setting when you added the 10Gb link ? What
> happens if you disconnect the 10Gb link ?
>
> You seems to use two distinct networks :192.168.10.x (scheduler) and
> 192.168.11.x (exec nodes) . Is there any reason for not having all the nodes
> on the same network ?
>
> I'm not an expert at all and hope this input can help. I have my own
> questions coming soon on the list.
>
> Cheers,
>
> Christophe.
>
>
>
> Le 20/02/2018 00:33, Larry Martell a Ãcrit :
>>
>> Here is the output of condor_q -better-analyze.
>>
>>
>> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
>> The Requirements expression for job 21283.000 is
>>
>>      ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
>> TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
>>      ( TARGET.HasFileTransfer )
>>
>> Job 21283.000 defines the following attributes:
>>
>>      DiskUsage = 0
>>      ImageSize = 0
>>      RequestDisk = DiskUsage
>>      RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(
>> ImageSize + 1023 ) / 1024)
>>
>> The Requirements expression for job 21283.000 reduces to these conditions:
>>
>>           Slots
>> Step    Matched  Condition
>> -----  --------  ---------
>> [0]         132  TARGET.Arch == "X86_64"
>> [1]         132  TARGET.OpSys == "LINUX"
>> [3]         132  TARGET.Disk >= RequestDisk
>> [5]         132  TARGET.Memory >= RequestMemory
>> [7]         132  TARGET.HasFileTransfer
>>
>> No successful match recorded.
>> Last failed match: Mon Feb 19 18:30:13 2018
>>
>> Reason for last match failure: no match found
>>
>> 21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
>>        0 are rejected by your job's requirements
>>        0 reject your job because of their own requirements
>>        0 match and are already running your jobs
>>        0 match but are serving other users
>>        0 are available to run your job
>>
>> On Mon, Feb 19, 2018 at 6:09 PM, Larry Martell <larry.martell@xxxxxxxxx>
>> wrote:
>>>
>>> If I grep in all the logs for one of the ids for a job in the queue I
>>> see this in the MatchLog:
>>>
>>> Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx
>>>
>>> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
>>> preempting none
>>>
>>> <192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3>
>>> slot1@chopin
>>>
>>> This message repeats 132 times (once for each slot on chopin) and then
>>> I see this:
>>>
>>> Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx
>>>
>>> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
>>> no match found
>>>
>>> That sequence of 133 message repeats over and over.
>>>
>>> Is this a clue to anything?
>>>
>>>
>>> On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>> wrote:
>>>>
>>>> If I run ps -efal | grep condor on the 2 execute hosts the only
>>>> difference is that on chopin (the one I cannot get condor to use) it
>>>> has this:
>>>>
>>>> condor_shared_port
>>>>
>>>> That is not on liszt. Is that an issue?
>>>>
>>>> On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>>>>>>
>>>>>>> As a test I removed liszt from the config and it still will not use
>>>>>>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>>>>>>
>>>>>> What do you mean when you say "removed liszt from the config" ?
>>>>>
>>>>> I set NUM_SLOTS to 0
>>>>>
>>>>>> Do you mean you removed it from your HTCondor pool (i.e. did
>>>>>> condor_off on
>>>>>> liszt) ?
>>>>>>
>>>>>> So now when you do a "condor_status", the only thing you see is slots
>>>>>> on
>>>>>> chopin?
>>>>>
>>>>> Correct.
>>>>>
>>>>>> And yet your job remains idle and refuses to run on chopin?
>>>>>
>>>>> Correct.
>>>>>
>>>>>> Maybe your job does not match any slots on chopin
>>>>>
>>>>> What makes a match? Before the reboot the same jobs I am trying to run
>>>>> today were running on chopin and when that was full, ran on liszt.
>>>>>
>>>>>> -- perhaps because chopin
>>>>>> has less memory than liszt, for instance.
>>>>>
>>>>> The 2 machines are identical in every way - CPU, memory, etc. The
>>>>> reason they were rebooted was that a 10G point to point connection was
>>>>> installed between them.
>>>>>
>>>>>> See the manual section "Why is
>>>>>> the job not running?" at http://tinyurl.com/ycbut82r
>>>>>
>>>>> That tiny url does not work, but I've been looking at this page:
>>>>>
>>>>>
>>>>> http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
>>>>>
>>>>> and here is the output of condor_q -analyze:verbose when run on the
>>>>> master:
>>>>>
>>>>> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
>>>>> No successful match recorded.
>>>>> Last failed match: Mon Feb 19 17:19:52 2018
>>>>>
>>>>> Reason for last match failure: no match found
>>>>>
>>>>> 21283.000:  Run analysis summary ignoring user priority.  Of 132
>>>>> machines,
>>>>>        0 are rejected by your job's requirements
>>>>>        0 reject your job because of their own requirements
>>>>>        0 match and are already running your jobs
>>>>>        0 match but are serving other users
>>>>>        0 are available to run your job
>>>>>
>>>>>>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell
>>>>>>> <larry.martell@xxxxxxxxx>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I have a master and 2 execute hosts (chopin and liszt) and I have
>>>>>>>> one
>>>>>>>> host (chopin) preferred over the other with these settings:
>>>>>>>>
>>>>>>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>>>>>>      (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>>>>>>      (100 * Machine =?= "chopin")
>>>>>>>> NEGOTIATOR_DEPTH_FIRST = True
>>>>>>>>
>>>>>>>> The preferred host is chopin. This has been working fine until
>>>>>>>> Friday
>>>>>>>> when both execute hosts were rebooted. Since then condor will only
>>>>>>>> run
>>>>>>>> jobs on liszt. Even if there are more jobs in the queue then slots
>>>>>>>> on
>>>>>>>> liszt it will not use chopin. A condor_status shows all the slots on
>>>>>>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and
>>>>>>>> no
>>>>>>>> errors in the logs.
>>>>>>>>
>>>>>>>> How can I debug and/or fix this?