The recurring "motto" in the movie "Galaxy Quest" was:
"Never give up. Never surrender!"
I have made some more inquiries and I found that the problem is related to memory request.
It is not related to the presence of "/" within the docker image... all of these .sub files had the same request:
request_memory = 2048
By removing this line, at least those with small memory foot-print can work.
I am not sure about larger ones.
However, this "request_memory" line was present before and did not prevent the job to run.
Therefore, after the rebooting something changed... It could be part of a start-up file script perhaps?
I am not yet sure how this would affect other jobs...
I have contacted our Sys-Admin who hopefully can find where the change occurred, and why it occurred after the reboot..
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 3:24 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: Re: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE
After more testing this is what I found so far about this strange behavior:
Summary of tests:
Docker Universe
is available.
Docker images that are accessed with a single word such as found on
hub.docker.com
are running OK.
Previous .sub files that used to work and no longer work all are images with a forward slash in their name found on
hub.docker.com
All of these remain in HOLD while before yesterday they used to work (with the same .sub file unchanged.)
They all have the same matching problem that I mentioned in the first message:
One exception was: gromacs/gromacs but this might be part of a more "official" naming convention.
I noted that for the Singularity jobs the image had to be labeled as:
"docker://ubuntu"
or
"docker://docker.io/ubuntu"
This makes me wonder if there is a definition, or environment variable specific for
Docker Hub address that needs to be updated, added, or triggered that goes beyond the "official" images that fit mostly in one word.
That is what makes more sense at the moment...
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of JEAN-YVES SGRO via HTCondor-users <htcondor-users@xxxxxxxxxxx>
Sent: Tuesday, November 2, 2021 1:51 PM To: htcondor-users@xxxxxxxxxxx <htcondor-users@xxxxxxxxxxx> Cc: JEAN-YVES SGRO <jsgro@xxxxxxxx> Subject: [HTCondor-users] After complete reboot same jobs that worked yesterday now stay IDLE Greetings,
Yesterday there was a general building Power Outage and the HTCondor Cluster system was eventually rebooted.
Now the same jobs (same .sub) files that worked yesterday no longer work and stay IDLE.
I used the command condor_q -better-analyze 363334.0 # where 363334.0 is the Job number
to try to understand, but I can't figure out where the problem is really.
I can see the following:
I don't understand the machine's "own requirements" I did try also the extended command:
condor_q -better-analyze 363334.0 -reverse -machine slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
that provides a summary:
I find these 2 statements conflicting in their meaning...
The output for both commands is very long and rather cryptic.
These are on "Universe = Docker" and I tested simpler
.sub files that ran OK. Hence the Docker Universe is available.
The 2 .sub file I sent this morning to test are the same as yesterday.
What can have been changed from rebooting? Is there any way to find this information?
THanks
Jean-Yves
|