[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE



The full breakdown of the -reverse analysis will probably tell us something useful.   We can see from the summary

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.

That the slot does not match the job.  The breakdown of each clause in the slot requirements should show why.   We are looking for clauses that have a match count of 0

-tj


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
Greetings,

Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.

Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.

I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.

I can see the following:

Reason for last match failure: no match found

363334.000:  Run analysis summary ignoring user priority.  Of 141 machines,
      0 are rejected by your job's requirements
    141 reject your job because of their own requirements

I don't understand the machine's "own requirements" I did try also the extended command:

condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.

I find these 2 statements conflicting in their meaning...

The output for both commands is very long and rather cryptic.

These are on "Universe = Docker" and I tested simpler .sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.

What can have been changed from rebooting? Is there any way to find this information?

THanks
Jean-Yves