At the moment, any job with a memory requirement is placed on hold.
I tried with as litltle as:
request_memory = 100M
The simple job will be placed on IDLE as well.
If the memory requirement is removed, then the job runs.
Other's jobs are affected in the same way i.e. it's not only me but "some setting somewhere" that is changing the behavior that was OK last week.
I have alerted our Sys-Admin that perhaps can take over finding the fault loaction.
THank you for all that have replied.
Jean-Yves.
P.S. also confirming the having 'unset TMPDIR' on the
run.sh file fixed the
Xmgrace problems that I had yesterday! Thanks Greg!
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 5:58 PM To: John M Knoeller <johnkn@xxxxxxxxxxx>; htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
Thank you John.
Your message crossed another update that I posted:
The error comes from a line that was in all failed submissions:
request_memory = 2048
I was not expecting that line to be the problem because a command:
condor_q -better-analyze -reverse 363329.0
provided the following, in which I bold the information that I thought was useful:
What seems odd from my understanding is that it seems that the machine can offer:
Memory = 4096
while the job only requested:
TARGET.RequestMemory = 2048
But in the end the requirement ends up being:
Target.RequestMemory < 2048
with an "absolute less than" rather than and "equal" value...
Since it's not "equal" then it should fail..
At least that what I would uderstand as a reason, but I don't understand why this would be set that way in the final step.
The other command that you suggested indeed gave only zeros:
$ condor_q -analyze -reverse 363329.0
-- Schedd: biocwk-01093l.ad.wisc.edu : <128.104.119.165:9618?...
363329.0: Analyzing matches for 1 job
Slot Slot's Req Job's Req Both
Name Type Matches Job Matches Slot Match %
------------------------ ---- ------------ ------------ ----------
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
[... goes from 0001 to 009 ...]
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx Part 0 1 0.00
But none of this was a problem before the Power Outage and the rebooting...
Something has changed, or is missing from the reboot process perhaps.
JYS
From: John M Knoeller <johnkn@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 5:17 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
The full breakdown of the -reverse analysis will probably tell us something useful. We can see from the summary
that provides a summary:
That the slot does not match the job. The breakdown of each clause in the slot requirements should show why. We are looking for clauses that have a match count of 0
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE Greetings,
Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.
Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.
I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.
I can see the following:
I don't understand the machine's "own requirements" I did try also the extended command:
condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
that provides a summary:
I find these 2 statements conflicting in their meaning...
The output for both commands is very long and rather cryptic.
These are on "Universe = Docker" and I tested simpler
.sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.
What can have been changed from rebooting? Is there any way to find this information?
THanks
Jean-Yves
|