[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE



Greetings,

Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.

Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.

I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.

I can see the following:

Reason for last match failure: no match found

363334.000:  Run analysis summary ignoring user priority.  Of 141 machines,
      0 are rejected by your job's requirements
    141 reject your job because of their own requirements

I don't understand the machine's "own requirements" I did try also the extended command:

condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.

I find these 2 statements conflicting in their meaning...

The output for both commands is very long and rather cryptic.

These are on "Universe = Docker" and I tested simpler .sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.

What can have been changed from rebooting? Is there any way to find this information?

THanks
Jean-Yves