[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



Here is the output of condor_q -better-analyze.


-- Schedd: bach.elucid.local : <192.168.10.2:20734>
The Requirements expression for job 21283.000 is

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.HasFileTransfer )

Job 21283.000 defines the following attributes:

    DiskUsage = 0
    ImageSize = 0
    RequestDisk = DiskUsage
    RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(
ImageSize + 1023 ) / 1024)

The Requirements expression for job 21283.000 reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]         132  TARGET.Arch == "X86_64"
[1]         132  TARGET.OpSys == "LINUX"
[3]         132  TARGET.Disk >= RequestDisk
[5]         132  TARGET.Memory >= RequestMemory
[7]         132  TARGET.HasFileTransfer

No successful match recorded.
Last failed match: Mon Feb 19 18:30:13 2018

Reason for last match failure: no match found

21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job

On Mon, Feb 19, 2018 at 6:09 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
> If I grep in all the logs for one of the ids for a job in the queue I
> see this in the MatchLog:
>
> Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx
> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
> preempting none
> <192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3>
> slot1@chopin
>
> This message repeats 132 times (once for each slot on chopin) and then
> I see this:
>
> Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx
> <192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
> no match found
>
> That sequence of 133 message repeats over and over.
>
> Is this a clue to anything?
>
>
> On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>> If I run ps -efal | grep condor on the 2 execute hosts the only
>> difference is that on chopin (the one I cannot get condor to use) it
>> has this:
>>
>> condor_shared_port
>>
>> That is not on liszt. Is that an issue?
>>
>> On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
>>> On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
>>>> On 2/19/2018 4:10 PM, Larry Martell wrote:
>>>>>
>>>>> As a test I removed liszt from the config and it still will not use
>>>>> chopin. Even thought a status shows all slots as 'Unclaimed Idle'
>>>>>
>>>>
>>>> What do you mean when you say "removed liszt from the config" ?
>>>
>>> I set NUM_SLOTS to 0
>>>
>>>> Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
>>>> liszt) ?
>>>>
>>>> So now when you do a "condor_status", the only thing you see is slots on
>>>> chopin?
>>>
>>> Correct.
>>>
>>>> And yet your job remains idle and refuses to run on chopin?
>>>
>>> Correct.
>>>
>>>> Maybe your job does not match any slots on chopin
>>>
>>> What makes a match? Before the reboot the same jobs I am trying to run
>>> today were running on chopin and when that was full, ran on liszt.
>>>
>>>> -- perhaps because chopin
>>>> has less memory than liszt, for instance.
>>>
>>> The 2 machines are identical in every way - CPU, memory, etc. The
>>> reason they were rebooted was that a 10G point to point connection was
>>> installed between them.
>>>
>>>> See the manual section "Why is
>>>> the job not running?" at http://tinyurl.com/ycbut82r
>>>
>>> That tiny url does not work, but I've been looking at this page:
>>>
>>> http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs
>>>
>>> and here is the output of condor_q -analyze:verbose when run on the master:
>>>
>>> -- Schedd: bach.elucid.local : <192.168.10.2:20734>
>>> No successful match recorded.
>>> Last failed match: Mon Feb 19 17:19:52 2018
>>>
>>> Reason for last match failure: no match found
>>>
>>> 21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
>>>       0 are rejected by your job's requirements
>>>       0 reject your job because of their own requirements
>>>       0 match and are already running your jobs
>>>       0 match but are serving other users
>>>       0 are available to run your job
>>>
>>>>> On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> I have a master and 2 execute hosts (chopin and liszt) and I have one
>>>>>> host (chopin) preferred over the other with these settings:
>>>>>>
>>>>>> NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
>>>>>>     (1000000 * (RemoteOwner =?= UNDEFINED)) + \
>>>>>>     (100 * Machine =?= "chopin")
>>>>>> NEGOTIATOR_DEPTH_FIRST = True
>>>>>>
>>>>>> The preferred host is chopin. This has been working fine until Friday
>>>>>> when both execute hosts were rebooted. Since then condor will only run
>>>>>> jobs on liszt. Even if there are more jobs in the queue then slots on
>>>>>> liszt it will not use chopin. A condor_status shows all the slots on
>>>>>> chopin as 'Unclaimed Idle' I see all the proper daemons running and no
>>>>>> errors in the logs.
>>>>>>
>>>>>> How can I debug and/or fix this?