[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE



The recurring "motto" in the movie "Galaxy Quest" was:
"Never give up. Never surrender!"

I have made some more inquiries and I found that the problem is related to memory request.
It is not related to the presence of "/" within the docker image... all of these .sub files had the same request:

request_memory          = 2048

By removing this line, at least those with small memory foot-print can work.
I am not sure about larger ones.

However, this "request_memory" line was present before and did not prevent the job to run.

Therefore, after the rebooting something changed... It could be part of a start-up file script perhaps?

I am not yet sure how this would affect other jobs...

I have contacted our Sys-Admin who hopefully can find where the change occurred, and why it occurred after the reboot..

JYS


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 3:24 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
After more testing this is what I found so far about this strange behavior:

Summary of tests:

Docker Universe is available.

Docker images that are accessed with a single word such as found on hub.docker.com

debian
ubuntu
busybox

are running OK.

Previous .sub files that used to work and no longer work all are images with a forward slash in their name found on hub.docker.com

 hindrek/bowtie2_samtools:1.0.0
 jysgro/mf36c
 jysgro/xmgrace-c7
 pegi3s/clustalomega

All of these remain in HOLD while before yesterday they used to work (with the same .sub file unchanged.)

They all have the same matching problem that I mentioned in the first message:

$ condor_q -analyze 363337.0
363337.000:  Run analysis summary ignoring user priority.  Of 131 machines,
      0 are rejected by your job's requirements
    131 reject your job because of their own requirements

One exception was:  gromacs/gromacs but this might be part of a more "official" naming convention.

I noted that for the Singularity jobs the image had to be labeled as:
 "docker://ubuntu"
or
"docker://docker.io/ubuntu"

This makes me wonder if there is a definition, or environment variable specific for Docker Hub address that needs to be updated, added, or triggered that goes beyond the "official" images that fit mostly in one word.

That is what makes more sense at the moment...

JYS




From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM
To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx>
Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx>
Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
 
Greetings,

Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.

Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.

I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.

I can see the following:

Reason for last match failure: no match found

363334.000:  Run analysis summary ignoring user priority.  Of 141 machines,
      0 are rejected by your job's requirements
    141 reject your job because of their own requirements

I don't understand the machine's "own requirements" I did try also the extended command:

condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

that provides a summary:
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Run analysis summary of 1 jobs.
    0 (0.00 %) match both slot and job requirements.
    0 match the requirements of this slot.
    1 have job requirements that match this slot.

I find these 2 statements conflicting in their meaning...

The output for both commands is very long and rather cryptic.

These are on "Universe = Docker" and I tested simpler .sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.

What can have been changed from rebooting? Is there any way to find this information?

THanks
Jean-Yves