[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor doesn't run a job after some time



Dear all,

Sorry, this is my fault. The job was not started because the condition NumJobStarts == 0 was not met.

----- Original Message -----
From: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
To: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Cc: "Dmitry Golubkov" <dmitry.golubkov@xxxxxxxxxxxxxx>
Sent: Tuesday, June 15, 2021 6:38:42 PM
Subject: [HTCondor-users] HTCondor doesn't run a job after some time

Dear all, 

After several successive runs, htcondor ended up in a strange state: 

--- 
Name OpSys Arch State Activity LoadAv Mem ActvtyTime 

slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:38 
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:39 
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX X86_64 Unclaimed Idle 0.000 1800 0+00:54:39 

Total Owner Claimed Unclaimed Matched Preempting Backfill Drain 

X86_64/LINUX 3 0 0 3 0 0 0 0 

Total 3 0 0 3 0 0 0 0 


-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxxxxxx : <10.42.0.171:46693?... @ 06/15/21 15:24:40 
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS 
user20001 ID: 91 6/15 14:53 _ _ 5 _ 5 91.0-4 

Total for query: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended 
Total for all users: 5 jobs; 0 completed, 0 removed, 5 idle, 0 running, 0 held, 0 suspended
----

As you see, I have three partitionable slots, absolutely free, one job in the idle state which can be started but nothing happens for a long time (a have waited for 30 minutes). In condor_q details I have found:


---

         Slots
Step    Matched  Condition
-----  --------  ---------
[2]           3  OpSys == "LINUX"
[5]           3  Arch == "X86_64"
[7]           3  DA__P7__RUNENV_PYTHON3 >= 13
[9]           3  DA__P7__CLUSTER_NODE == "True"
[11]          3  TARGET.Disk >= RequestDisk
[13]          3  TARGET.Memory >= RequestMemory
[15]          3  TARGET.FileSystemDomain == MY.FileSystemDomain

No successful match recorded.
Last failed match: Tue Jun 15 14:55:24 2021

Reason for last match failure: PREEMPTION_REQUIREMENTS == False 

091.004:  Run analysis summary ignoring user priority.  Of 3 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      3 are able to run your job


--- 


But all previous jobs finished successfully. And I use dynamic slots to run the jobs. Any ideas?

Thanks in advance,
Dmitry.
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/