[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE



Thank you JOhn.

The IT was able to fix the problem. Below is an edited exerpt of the resolving steps:
(The main problem was that the hostname was wrong.)

  • The response from the group was helpful
  • The hostname on submit needed to be changed. 
  • ... also an old config setting on each compute node that was designed to try and kick non-local jobs off if there were waiting local jobs. [i.e. about jobs that flocked in.]
  • Since host name was wrong, jobs were kicked out.
  • config files for each compute node updated to use the new correct hostname
I post this here for the record in case it is useful in the future, and thank you for your help.

JYS



From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Wednesday, November 3, 2021 9:51 AM
To: JEAN-YVES SGRO <jsgro@xxxxxxxx>; htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
Despite the fact that the slot has 4094 G of memory allocated to it,  it seems that it is refusing jobs that request  2048 G of or more of memory unless they come from a specific schedd. 

This bit is saying that jobs from the biochem schedd can request all of the slot memory, but jobs from other schedds must request less than 2048. 
This is something that the administrator of that execute node has added to the START _expression_ explicitly.   It might be a mistake, so you should contact them.

[0]           0  Target.OriginSchedd is "submit.biochem.wisc.edu"
[2]           0  Target.RequestMemory < 2048
[4]           0  [0] || [2]

A power outage and reboot might indicate that a configuration change that had been made on execute nodes had not yet been applied.  A reboot would have caused the execute nodes to restart and re-read the configuration.   

-tj


From: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Sent: Tuesday, November 2, 2021 5:58 PM
To: John M Knoeller <johnkn@xxxxxxxxxxx>; htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
Thank you John.

Your message crossed another update that I posted:
The error comes from a line that was in all failed submissions:
request_memory          = 2048

I was not expecting that line to be the problem because a command:
condor_q -better-analyze -reverse 363329.0

provided the following, in which I bold the information that I thought was useful:

This slot defines the following attributes:

    CheckpointPlatform = "LINUX X86_64 3.10.0-1160.42.2.el7.x86_64 normal N/A avx ssse3 sse4_1 sse4_2"
    Cpus = 1
    Disk = 4364065
    GPUs = 0
    Memory = 4096

Job 363329.0 has the following attributes:

    TARGET.JobUniverse = 5
    TARGET.NumCkpts = 0
    TARGET.OriginSchedd = "biocwk-01093l.ad.wisc.edu"
    TARGET.RequestCpus = 1
    TARGET.RequestDisk = 5
    TARGET.RequestMemory = 2048

The Requirements _expression_ for this slot reduces to these conditions:

       Clusters
Step    Matched  Condition
-----  --------  ---------
[0]           0  Target.OriginSchedd is "submit.biochem.wisc.edu"
[2]           0  Target.RequestMemory < 2048
[4]           0  [0] || [2]

What seems odd from my understanding is that it seems that the machine can offer:
Memory = 4096
while the job only requested:
TARGET.RequestMemory = 2048
But in the end the requirement ends up being:
Target.RequestMemory < 2048
with an "absolute less than" rather than and "equal" value...
Since it's not "equal" then it should fail..
At least that what I would uderstand as a reason, but I don't understand why this would be set that way in the final step.

The other command that you suggested indeed gave only zeros:

$ condor_q -analyze -reverse 363329.0

-- Schedd: biocwk-01093l.ad.wisc.edu : <128.104.119.165:9618?...
363329.0: Analyzing matches for 1 job
                                     Slot  Slot's Req    Job's Req     Both  
Name                                 Type  Matches Job Matches Slot    Match %
------------------------             ---- ------------ ------------ ----------
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  Part            0            1       0.00
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  Part            0            1       0.00
[... goes from 0001 to 009 ...]
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx  Part            0            1       0.00

But none of this was a problem before the Power Outage and the rebooting...
Something has changed, or is missing from the reboot process perhaps.

JYS



From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 5:17 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
The full breakdown of the -reverse analysis will probably tell us something useful.   We can see from the summary

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.

That the slot does not match the job.  The breakdown of each clause in the slot requirements should show why.   We are looking for clauses that have a match count of 0

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
Greetings,

Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.

Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.

I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.

I can see the following:

Reason for last match failure: no match found

363334.000:  Run analysis summary ignoring user priority.  Of 141 machines,
      0 are rejected by your job's requirements
    141 reject your job because of their own requirements

I don't understand the machine's "own requirements" I did try also the extended command:

condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.

I find these 2 statements conflicting in their meaning...

The output for both commands is very long and rather cryptic.

These are on "Universe = Docker" and I tested simpler .sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.

What can have been changed from rebooting? Is there any way to find this information?

THanks
Jean-Yves