[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] condor will not use preferred host at all



Hello Larry,

Did you change the network setting when you added the 10Gb link ? What happens if you disconnect the 10Gb link ?

You seems to use two distinct networks :192.168.10.x (scheduler) and 192.168.11.x (exec nodes) . Is there any reason for not having all the nodes on the same network ?

I'm not an expert at all and hope this input can help. I have my own questions coming soon on the list.

Cheers,

Christophe.


Le 20/02/2018 00:33, Larry Martell a écrit :
Here is the output of condor_q -better-analyze.


-- Schedd: bach.elucid.local : <192.168.10.2:20734>
The Requirements expression for job 21283.000 is

     ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) && (
TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
     ( TARGET.HasFileTransfer )

Job 21283.000 defines the following attributes:

     DiskUsage = 0
     ImageSize = 0
     RequestDisk = DiskUsage
     RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(
ImageSize + 1023 ) / 1024)

The Requirements expression for job 21283.000 reduces to these conditions:

          Slots
Step    Matched  Condition
-----  --------  ---------
[0]         132  TARGET.Arch == "X86_64"
[1]         132  TARGET.OpSys == "LINUX"
[3]         132  TARGET.Disk >= RequestDisk
[5]         132  TARGET.Memory >= RequestMemory
[7]         132  TARGET.HasFileTransfer

No successful match recorded.
Last failed match: Mon Feb 19 18:30:13 2018

Reason for last match failure: no match found

21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
       0 are rejected by your job's requirements
       0 reject your job because of their own requirements
       0 match and are already running your jobs
       0 match but are serving other users
       0 are available to run your job

On Mon, Feb 19, 2018 at 6:09 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
If I grep in all the logs for one of the ids for a job in the queue I
see this in the MatchLog:

Matched 21283.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>
preempting none
<192.168.11.1:9618?addrs=192.168.11.1-9618+[--1]-9618&noUDP&sock=14430_5bb5_3>
slot1@chopin

This message repeats 132 times (once for each slot on chopin) and then
I see this:

Rejected 21283.0 prod_user@xxxxxxxxxxxxxxxxx
<192.168.10.2:9618?addrs=192.168.10.2-9618+[--1]-9618&noUDP&sock=522229_3c3e_4>:
no match found

That sequence of 133 message repeats over and over.

Is this a clue to anything?


On Mon, Feb 19, 2018 at 5:55 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
If I run ps -efal | grep condor on the 2 execute hosts the only
difference is that on chopin (the one I cannot get condor to use) it
has this:

condor_shared_port

That is not on liszt. Is that an issue?

On Mon, Feb 19, 2018 at 5:31 PM, Larry Martell <larry.martell@xxxxxxxxx> wrote:
On Mon, Feb 19, 2018 at 5:22 PM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 2/19/2018 4:10 PM, Larry Martell wrote:
As a test I removed liszt from the config and it still will not use
chopin. Even thought a status shows all slots as 'Unclaimed Idle'

What do you mean when you say "removed liszt from the config" ?
I set NUM_SLOTS to 0

Do you mean you removed it from your HTCondor pool (i.e. did condor_off on
liszt) ?

So now when you do a "condor_status", the only thing you see is slots on
chopin?
Correct.

And yet your job remains idle and refuses to run on chopin?
Correct.

Maybe your job does not match any slots on chopin
What makes a match? Before the reboot the same jobs I am trying to run
today were running on chopin and when that was full, ran on liszt.

-- perhaps because chopin
has less memory than liszt, for instance.
The 2 machines are identical in every way - CPU, memory, etc. The
reason they were rebooted was that a 10G point to point connection was
installed between them.

See the manual section "Why is
the job not running?" at http://tinyurl.com/ycbut82r
That tiny url does not work, but I've been looking at this page:

http://research.cs.wisc.edu/htcondor/CondorWeek2004/presentations/adesmet_admin_tutorial/#DebuggingJobs

and here is the output of condor_q -analyze:verbose when run on the master:

-- Schedd: bach.elucid.local : <192.168.10.2:20734>
No successful match recorded.
Last failed match: Mon Feb 19 17:19:52 2018

Reason for last match failure: no match found

21283.000:  Run analysis summary ignoring user priority.  Of 132 machines,
       0 are rejected by your job's requirements
       0 reject your job because of their own requirements
       0 match and are already running your jobs
       0 match but are serving other users
       0 are available to run your job

On Mon, Feb 19, 2018 at 4:46 PM, Larry Martell <larry.martell@xxxxxxxxx>
wrote:
I have a master and 2 execute hosts (chopin and liszt) and I have one
host (chopin) preferred over the other with these settings:

NEGOTIATOR_PRE_JOB_RANK = (10000000 * My.Rank) + \
     (1000000 * (RemoteOwner =?= UNDEFINED)) + \
     (100 * Machine =?= "chopin")
NEGOTIATOR_DEPTH_FIRST = True

The preferred host is chopin. This has been working fine until Friday
when both execute hosts were rebooted. Since then condor will only run
jobs on liszt. Even if there are more jobs in the queue then slots on
liszt it will not use chopin. A condor_status shows all the slots on
chopin as 'Unclaimed Idle' I see all the proper daemons running and no
errors in the logs.

How can I debug and/or fix this?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


--
Christophe DIARRA
Institut de Physique Nucleaire
15 Rue Georges Clemenceau
S2I/D2I - Bat 100A - Piece A108
F91406 ORSAY Cedex
Tel:    +33 (0)1 69 15 65 60 / +33 (0)6 31 26 23 69
Fax:    +33 (0)1 69 15 64 70 / E-mail: diarra@xxxxxxxxxxxxx