The full breakdown of the -reverse analysis will probably tell us something useful. We can see from the summary
that provides a summary:
That the slot does not match the job. The breakdown of each clause in the slot requirements should show why. We are looking for clauses that have a match count of 0
-tj
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE Greetings,
Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.
Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.
I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.
I can see the following:
I don't understand the machine's "own requirements" I did try also the extended command:
condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
that provides a summary:
I find these 2 statements conflicting in their meaning...
The output for both commands is very long and rather cryptic.
These are on "Universe = Docker" and I tested simpler
.sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.
What can have been changed from rebooting? Is there any way to find this information?
THanks
Jean-Yves
|