[HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Date: Tue, 2 Nov 2021 18:51:13 +0000

From: JEAN-YVES SGRO <jsgro@xxxxxxxx>

Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE

Greetings,

Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.

Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.

I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number

to try to understand, but I can't figure out where the problem is really.

I can see the following:

Reason for last match failure: no match found

363334.000: Run analysis summary ignoring user priority. Of 141 machines,

0 are rejected by your job's requirements

141 reject your job because of their own requirements

I don't understand the machine's "own requirements" I did try also the extended command:

condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

that provides a summary:

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.

0 (0.00 %) match both slot and job requirements.

0 match the requirements of this slot.

1 have job requirements that match this slot.

I find these 2 statements conflicting in their meaning...

The output for both commands is very long and rather cryptic.

These are on "Universe = Docker" and I tested simpler .sub files that ran OK. Hence the Docker Universe is available.

The 2 .sub file I sent this morning to test are the same as yesterday.

What can have been changed from rebooting? Is there any way to find this information?

THanks

Jean-Yves

Mailing List Archives

Public Access

[HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE