[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job has all necessary slots claimed but still is idle



Hi,

I performed additional tests, tried HTCondor 8.9.11 and 9.0.1, tried different combinations of setting CLAIM_PARTITIONABLE_LEFTOVERS and CONSUMPTION_POLICY (False-True, True-False, False-False) and the problem was always there.
For some jobs schedd request and claims additional slots it won't use. If there are resources for such additional slots - the job will run, if no - the job will be IDLE forever.

Any ideas?


Best regards,
Stanislav Markevich


----- Original Message -----
From: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Sent: Thursday, 10 June, 2021 23:15:54
Subject: [HTCondor-users] job has all necessary slots claimed but still is	idle

Hi all,

My job stuck in IDLE state and I cannot understand why.


It's the only job:

    -- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... @ 06/10/21 18:26:37
    OWNER     BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
    user20002 ID: 49       6/10 17:06      _      _      2      2 49.0-1

    Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
    Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended


There are suitable slots (dynamic slots created for this job):

    Name                         OpSys State     Activity Cpus Memory
    slot1@xxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     75   1544  
    slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     25   256   
    slot1@xxxxxxxxxxxxxxxxxxxx   LINUX Unclaimed Idle     0    8     
    slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed   Idle     100  1792 


But "condor_q -better-analyze" says there is no successful match (and last failed match was an hour ago):

    -- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?...
    The Requirements expression for job 49.000 is
    ...

    Job 49.000 defines the following attributes:

        DiskUsage = 0
        FileSystemDomain = "p7-htcondor"
        NumJobStarts = 0
        RequestCpus = 25
        RequestDisk = DiskUsage
        RequestMemory = 256

    The Requirements expression for job 49.000 reduces to these conditions:
    ...

    No successful match recorded.
    Last failed match: Thu Jun 10 17:26:33 2021

    Reason for last match failure: PREEMPTION_REQUIREMENTS == False 

    049.000:  Run analysis summary ignoring user priority.  Of 2 machines,
          0 are rejected by your job's requirements
          0 reject your job because of their own requirements
          0 match and are already running your jobs
          0 match but are serving other users
          2 are able to run your job

    The Requirements expression for job 49.001 is
    ...

    Job 49.001 defines the following attributes:

        DiskUsage = 42
        FileSystemDomain = "p7-htcondor"
        NumJobStarts = 0
        RequestCpus = 100
        RequestDisk = DiskUsage
        RequestMemory = 1792

    The Requirements expression for job 49.001 reduces to these conditions:
    ...

    No successful match recorded.
    Last failed match: Thu Jun 10 17:26:33 2021

    Reason for last match failure: no match found 

    049.001:  Run analysis summary ignoring user priority.  Of 2 machines,
          0 are rejected by your job's requirements
          0 reject your job because of their own requirements
          0 match and are already running your jobs
          0 match but are serving other users
          2 are able to run your job


And there are successful matches in logs:

    17:57:50.766 condor_negotiator[51]: ---------- Started Negotiation Cycle ----------
    17:57:50.766 condor_negotiator[51]: Phase 1:  Obtaining ads from collector ...
    17:57:50.766 condor_negotiator[51]:   Getting startd private ads ...
    17:57:50.771 condor_negotiator[51]:   Getting Scheduler, Submitter and Machine ads ...
    17:57:50.777 condor_negotiator[51]:   Sorting 4 ads ...
    17:57:50.777 condor_negotiator[51]: Got ads: 4 public and 2 private
    17:57:50.777 condor_negotiator[51]: Public ads include 1 submitter, 2 startd
    17:57:50.777 condor_negotiator[51]: Phase 2:  Performing accounting ...
    17:57:50.789 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
    17:57:50.789 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
    17:57:50.789 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
    17:57:50.789 condor_negotiator[51]: Phase 3:  Sorting submitter ads by priority ...
    17:57:50.789 condor_negotiator[51]: Starting prefetch round; 1 potential prefetches to do.
    17:57:50.789 condor_negotiator[51]: Starting prefetch negotiation for DedicatedScheduler@parallel_schedd@submit.p7-htcondor.
    17:57:50.791 condor_schedd[143]: Finished sending rrls to negotiator
    17:57:50.791 condor_schedd[143]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 0 matched, 0 rejected
    17:57:50.791 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399>
    17:57:50.791 condor_schedd[143]: Using negotiation protocol: NEGOTIATE
    17:57:50.791 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor
    17:57:50.791 condor_negotiator[51]:     Got NO_MORE_JOBS;  schedd has no more requests
    17:57:50.791 condor_negotiator[51]: Prefetch summary: 1 attempted, 1 successful.
    17:57:50.791 condor_negotiator[51]: Phase 4.1:  Negotiating with schedds ...
    17:57:50.791 condor_negotiator[51]:   Negotiating with DedicatedScheduler@parallel_schedd@submit.p7-htcondor at <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6>
    17:57:50.791 condor_negotiator[51]: 0 seconds so far for this submitter
    17:57:50.791 condor_negotiator[51]: 0 seconds so far for this schedd
    17:57:50.791 condor_negotiator[51]:     Request 00049.00000: autocluster -1 (request count 1 of 0)
    17:57:50.791 condor_negotiator[51]:       Matched 49.0 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.17:33779?addrs=10.42.0.17-33779&alias=p7-htcondorexecute-gb8z4.p7-htcondor&noUDP&sock=startd_99_cc5f> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.800 condor_startd[135]: slot1_1: New machine resource of type -1 allocated
    17:57:50.801 condor_startd[135]: slot1_1: Request accepted.
    17:57:50.802 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.804 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.805 condor_schedd[143]: Found 2 potential dedicated resources in 0 seconds
    17:57:50.805 condor_schedd[143]: Adding submitter DedicatedScheduler@parallel_schedd@submit.p7-htcondor to the submitter map for default pool.
    17:57:50.807 condor_negotiator[51]:       Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.807 condor_negotiator[51]:     Request 00049.00001: autocluster -1 (request count 1 of 0)
    17:57:50.807 condor_negotiator[51]:       Matched 49.1 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.19:37885?addrs=10.42.0.19-37885&alias=p7-htcondorexecute-vxbqp.p7-htcondor&noUDP&sock=startd_87_913c> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.811 condor_startd[123]: slot1_1: New machine resource of type -1 allocated
    17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2
    17:57:50.818 condor_startd[135]: slot1_1: Remote owner is user20002@p7-htcondor
    17:57:50.819 condor_startd[135]: slot1_1: State change: claiming protocol successful
    17:57:50.819 condor_startd[135]: slot1_1: Changing state: Owner -> Claimed
    17:57:50.819 condor_startd[123]: slot1_1: Request accepted.
    17:57:50.826 condor_negotiator[51]:       Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    17:57:50.826 condor_negotiator[51]:  negotiateWithGroup resources used submitterAds length 0
    17:57:50.827 condor_negotiator[51]: ---------- Finished Negotiation Cycle ----------

Maybe this is important, in the configuration files I have:
    CLAIM_PARTITIONABLE_LEFTOVERS = FALSE
    CONSUMPTION_POLICY = TRUE

Am I missing something? Is this a bug?


PS: after two hours of waiting I submitted another job and surprisingly both jobs ran pretty soon. 


Best regards, 
Stanislav V. Markevich 
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/