Thank you John.
Your message crossed another update that I posted:
The error comes from a line that was in all failed submissions:
request_memory = 2048
I was not expecting that line to be the problem because a command:
condor_q -better-analyze -reverse 363329.0
provided the following, in which I bold the information that I thought was useful:
What seems odd from my understanding is that it seems that the machine can offer:
Memory = 4096
while the job only requested:
TARGET.RequestMemory = 2048
But in the end the requirement ends up being:
Target.RequestMemory < 2048
with an "absolute less than" rather than and "equal" value...
Since it's not "equal" then it should fail..
At least that what I would uderstand as a reason, but I don't understand why this would be set that way in the final step.
The other command that you suggested indeed gave only zeros:
$
condor_q -analyze -reverse 363329.0
-- Schedd: biocwk-01093l.ad.wisc.edu : <128.104.119.165:9618?...
363329.0: Analyzing matches for 1 job
Slot Slot's Req Job's Req Both
Name Type Matches Job Matches Slot Match %
------------------------ ---- ------------ ------------ ----------
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
[... goes from 0001 to 009 ...]
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
But none of this was a problem before the Power Outage and the rebooting...
Something has changed, or is missing from the reboot process perhaps.
JYS
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 5:17 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
The full breakdown of the -reverse analysis will probably tell us something useful. We can see from the summary
that provides a summary:
That the slot does not match the job. The breakdown of each clause in the slot requirements should show why. We are looking for clauses that have a match count of 0
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE Greetings,
Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.
Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.
I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.
I can see the following:
I don't understand the machine's "own requirements" I did try also the extended command:
condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
that provides a summary:
I find these 2 statements conflicting in their meaning...
The output for both commands is very long and rather cryptic.
These are on "Universe = Docker" and I tested simpler
.sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.
What can have been changed from rebooting? Is there any way to find this information?
THanks
Jean-Yves
|