[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job has all necessary slots claimed but still is idle



Hi David,

Thank you for your reply!

> You have slots that are claimed and stay idle and you have enough resources avaliable for that job? 

Yes. Sometimes a job runs without any problems, sometimes it claims all necessary slots and tries to claim more. Sometimes it tries to start and fails, then it tries for the second time and runs ok.


My first thought was that there is some problem resources quantization, I indeed have changed MODIFY_REQUEST_EXPR_REQUESTMEMORY and CONSUMPTION_MEMORY:

  MODIFY_REQUEST_EXPR_REQUESTMEMORY = quantize(RequestMemory, {64})
  CONSUMPTION_MEMORY = quantize(target.RequestMemory,{64})

but resetting them to default values doesn't make any difference.


There are some errors in logs. This time I run the same job two times. First time it ran without any problem. Second time it tried to run but failed. After some time it did the second attempt and then ran.


2021-06-15T05:29:18.888907541Z condor_shadow[ 1175 ]: DaemonCore: command socket at < 10 . 244 . 0 . 161:35215 ?addrs= 10 . 244 . 0 . 161-35215 &alias= submit.p7-htcondor &noUDP&sock=shadow_153_8d5f_2> 
2021-06-15T05:29:18.889071531Z condor_shadow[ 1175 ]: DaemonCore: private command socket at < 10 . 244 . 0 . 161:35215 ?addrs= 10 . 244 . 0 . 161-35215 &alias= submit.p7-htcondor &noUDP&sock=shadow_153_8d5f_2> 
2021-06-15T05:29:18.899472741Z condor_shadow[ 1175 ]: Initializing a PARALLEL shadow for job 2 . 0 
2021-06-15T05:29:18.901317447Z condor_shadow[ 1175 ]: LIMIT_DIRECTORY_ACCESS = <unset> 
2021-06-15T05:29:18.909500761Z condor_collector[ 51 ]: QueryWorker: forked new worker with id 798 ( max 4 active 1 pending 0 ) 
2021-06-15T05:29:18.911321126Z condor_collector[ 51 ]: Got QUERY_SCHEDD_ADS 
2021-06-15T05:29:18.911348173Z condor_collector[ 51 ]: (Sending 1 ads in response to query) 
2021-06-15T05:29:18.911354100Z condor_collector[ 51 ]: Query info: matched= 1 ; skipped= 0 ; query_time= 0 . 000145 ; send_time= 0 . 000089 ; type=Scheduler; requirements={((TotalRunningJobs > 0 || TotalIdleJobs > 0 || TotalHeldJobs > 0 || TotalRemovedJobs > 0 || TotalJobAds > 0 ))}; locate= 0 ; limit= 0 ; from=TOOL; peer=< 10 . 244 . 0 . 161:39473 >; projection={ScheddIpAddr CondorVersion Name Machine}; filter_private_ads= 0 
2021-06-15T05:29:18.911358189Z condor_collector[ 798 ]: (Sending 4 ads in response to query) 
2021-06-15T05:29:18.911362278Z condor_collector[ 798 ]: Query info: matched= 4 ; skipped= 0 ; query_time= 0 . 000409 ; send_time= 0 . 000212 ; type=Machine; requirements={ true }; locate= 0 ; limit= 0 ; from=TOOL; peer=< 10 . 244 . 0 . 162:33279 >; projection={Activity Cpus JobId Machine Memory Name OpSys State}; filter_private_ads= 0 
2021-06-15T05:29:18.946981574Z condor_schedd[ 153 ]: ZKM: In Condor_Auth_Passwd::wrap. 
2021-06-15T05:29:18.947296808Z condor_schedd[ 153 ]: Number of Active Workers 0 
2021-06-15T05:29:19.916226397Z condor_startd[ 167 ]: Error : can't find resource with ClaimId (< 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5># 1623734651 # 1 #...) for 444 (ACTIVATE_CLAIM) 
2021-06-15T05:29:19.916588962Z condor_shadow[ 1175 ]: Request to run on slot1_1@ p7-7s4bb.p7-htcondor < 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5> was REFUSED 
2021-06-15T05:29:19.917372796Z condor_shadow[ 1175 ]: Job 2 . 0 is being evicted 
2021-06-15T05:29:19.931164599Z condor_startd[ 167 ]: Error : can't find resource with ClaimId (< 10 . 244 . 0 . 162:38061 ?addrs= 10 . 244 . 0 . 162-38061 &alias= p7-7s4bb.p7-htcondor &noUDP&sock=startd_125_a1d5># 1623734651 # 1 #...) -- perhaps this claim was already removed? 
2021-06-15T05:29:19.931239776Z condor_startd[ 167 ]: Error : problem finding resource for 403 (DEACTIVATE_CLAIM) 
2021-06-15T05:29:19.931358337Z condor_startd[ 122 ]: Failed to write ToE tag to . job.ad file ( 2 ): No such file or directory 
2021-06-15T05:29:19.931404129Z condor_startd[ 122 ]: slot1_1: Called deactivate_claim() 
2021-06-15T05:29:19.934665824Z condor_shadow[ 1175 ]: RemoteResource::killStarter(): Could not send command to startd 
2021-06-15T05:29:19.934711905Z condor_shadow[ 1175 ]: logEvictEvent with unknown reason ( 108 ), not logging. 
2021-06-15T05:29:19.939935680Z condor_shadow[ 1175 ]: **** condor_shadow (condor_SHADOW) pid 1175 EXITING WITH STATUS 108 
2021-06-15T05:29:19.940043649Z condor_schedd[ 153 ]: In DedicatedScheduler::reaper pid 1175 has status 27648 
2021-06-15T05:29:19.940129064Z condor_schedd[ 153 ]: Shadow pid 1175 exited with status 108 
2021-06-15T05:29:19.940137774Z condor_schedd[ 153 ]: Dedicated job abnormally ended, releasing claim 
2021-06-15T05:29:19.940159496Z condor_schedd[ 153 ]: Dedicated job abnormally ended, releasing claim 
2021-06-15T05:29:19.941444269Z condor_startd[ 167 ]: Can't read ClaimId 
2021-06-15T05:29:19.944025188Z condor_startd[ 122 ]: ZKM: In Condor_Auth_Passwd::wrap. 
2021-06-15T05:29:19.944574506Z condor_schedd[ 153 ]: ZKM: In Condor_Auth_Passwd::unwrap. 
2021-06-15T05:29:19.945556145Z condor_startd[ 122 ]: Can't read ClaimId 
2021-06-15T05:29:19.942080482Z condor_startd[ 167 ]: condor_write(): Socket closed when trying to write 29 bytes to < 10 . 244 . 0 . 161:37487 >, fd is 10 
2021-06-15T05:29:19.942098016Z condor_startd[ 167 ]: Buf::write(): condor_write() failed

It seems the problem is in CLAIM_WORKLIFE = 0 (and maybe in UNUSED_CLAIM_TIMEOUT = 120 as well) - claim was released because of CLAIM_WORKLIFE (probably too fast) and schedd didn't know about it.
I want unused claims to be released quickly so that other jobs with different requirements may start sooner, is there any recommendation how to choose value of CLAIM_WORKLIFE?


Best regards,
Stanislav Markevich


----- Original Message -----
From: "Dudu Handelman" <duduhandelman@xxxxxxxxxxx>
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx>
Sent: Monday, 14 June, 2021 21:33:30
Subject: Re: [HTCondor-users] job has all necessary slots claimed but still	is	idle

Stanislav. 

You have slots that are claimed and stay idle and you have enough resources avaliable for that job? 
If this is the case i would recommend turn on full debug on the startd daemon. 

I have seen misconfiguration in the past that might affect. 
For example if you have quantize for memory or disk. 

If i understand correctly there is match the partitionable slot on the executer has enough resources for the job, it creates a dynamic slot but the dynamic slot does not have enough resources beacuse of something. You need to find that something 

Good luck 
David 



Get [ https://aka.ms/AAb9ysg | Outlook for Android ] 


From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> on behalf of Stanislav V. Markevich via HTCondor-users <htcondor-users@xxxxxxxxxxx> 
Sent: Monday, June 14, 2021 7:43:40 PM 
To: htcondor-users <htcondor-users@xxxxxxxxxxx> 
Cc: Stanislav V. Markevich <stanislav.markevich@xxxxxxxxxxxxxx> 
Subject: Re: [HTCondor-users] job has all necessary slots claimed but still is idle 
Hi, 

I performed additional tests, tried HTCondor 8.9.11 and 9.0.1, tried different combinations of setting CLAIM_PARTITIONABLE_LEFTOVERS and CONSUMPTION_POLICY (False-True, True-False, False-False) and the problem was always there. 
For some jobs schedd request and claims additional slots it won't use. If there are resources for such additional slots - the job will run, if no - the job will be IDLE forever. 

Any ideas? 


Best regards, 
Stanislav Markevich 


----- Original Message ----- 
From: "htcondor-users" <htcondor-users@xxxxxxxxxxx> 
To: "htcondor-users" <htcondor-users@xxxxxxxxxxx> 
Cc: "Stanislav V. Markevich" <stanislav.markevich@xxxxxxxxxxxxxx> 
Sent: Thursday, 10 June, 2021 23:15:54 
Subject: [HTCondor-users] job has all necessary slots claimed but still is idle 

Hi all, 

My job stuck in IDLE state and I cannot understand why. 


It's the only job: 

-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... @ 06/10/21 18:26:37 
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS 
user20002 ID: 49 6/10 17:06 _ _ 2 2 49.0-1 

Total for query: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 
Total for all users: 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended 


There are suitable slots (dynamic slots created for this job): 

Name OpSys State Activity Cpus Memory 
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 75 1544 
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 25 256 
slot1@xxxxxxxxxxxxxxxxxxxx LINUX Unclaimed Idle 0 8 
slot1_1@xxxxxxxxxxxxxxxxxxxx LINUX Claimed Idle 100 1792 


But "condor_q -better-analyze" says there is no successful match (and last failed match was an hour ago): 

-- Schedd: parallel_schedd@xxxxxxxxxxxxxxxxxx : <10.42.0.253:45173?... 
The Requirements expression for job 49.000 is 
... 

Job 49.000 defines the following attributes: 

DiskUsage = 0 
FileSystemDomain = "p7-htcondor" 
NumJobStarts = 0 
RequestCpus = 25 
RequestDisk = DiskUsage 
RequestMemory = 256 

The Requirements expression for job 49.000 reduces to these conditions: 
... 

No successful match recorded. 
Last failed match: Thu Jun 10 17:26:33 2021 

Reason for last match failure: PREEMPTION_REQUIREMENTS == False 

049.000: Run analysis summary ignoring user priority. Of 2 machines, 
0 are rejected by your job's requirements 
0 reject your job because of their own requirements 
0 match and are already running your jobs 
0 match but are serving other users 
2 are able to run your job 

The Requirements expression for job 49.001 is 
... 

Job 49.001 defines the following attributes: 

DiskUsage = 42 
FileSystemDomain = "p7-htcondor" 
NumJobStarts = 0 
RequestCpus = 100 
RequestDisk = DiskUsage 
RequestMemory = 1792 

The Requirements expression for job 49.001 reduces to these conditions: 
... 

No successful match recorded. 
Last failed match: Thu Jun 10 17:26:33 2021 

Reason for last match failure: no match found 

049.001: Run analysis summary ignoring user priority. Of 2 machines, 
0 are rejected by your job's requirements 
0 reject your job because of their own requirements 
0 match and are already running your jobs 
0 match but are serving other users 
2 are able to run your job 


And there are successful matches in logs: 

17:57:50.766 condor_negotiator[51]: ---------- Started Negotiation Cycle ---------- 
17:57:50.766 condor_negotiator[51]: Phase 1: Obtaining ads from collector ... 
17:57:50.766 condor_negotiator[51]: Getting startd private ads ... 
17:57:50.771 condor_negotiator[51]: Getting Scheduler, Submitter and Machine ads ... 
17:57:50.777 condor_negotiator[51]: Sorting 4 ads ... 
17:57:50.777 condor_negotiator[51]: Got ads: 4 public and 2 private 
17:57:50.777 condor_negotiator[51]: Public ads include 1 submitter, 2 startd 
17:57:50.777 condor_negotiator[51]: Phase 2: Performing accounting ... 
17:57:50.789 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399> 
17:57:50.789 condor_schedd[143]: Using negotiation protocol: NEGOTIATE 
17:57:50.789 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor 
17:57:50.789 condor_negotiator[51]: Phase 3: Sorting submitter ads by priority ... 
17:57:50.789 condor_negotiator[51]: Starting prefetch round; 1 potential prefetches to do. 
17:57:50.789 condor_negotiator[51]: Starting prefetch negotiation for DedicatedScheduler@parallel_schedd@submit.p7-htcondor. 
17:57:50.791 condor_schedd[143]: Finished sending rrls to negotiator 
17:57:50.791 condor_schedd[143]: Finished negotiating for DedicatedScheduler@parallel_schedd in local pool: 0 matched, 0 rejected 
17:57:50.791 condor_schedd[143]: Activity on stashed negotiator socket: <10.42.0.18:35399> 
17:57:50.791 condor_schedd[143]: Using negotiation protocol: NEGOTIATE 
17:57:50.791 condor_schedd[143]: Negotiating for owner: DedicatedScheduler@parallel_schedd@submit.p7-htcondor 
17:57:50.791 condor_negotiator[51]: Got NO_MORE_JOBS; schedd has no more requests 
17:57:50.791 condor_negotiator[51]: Prefetch summary: 1 attempted, 1 successful. 
17:57:50.791 condor_negotiator[51]: Phase 4.1: Negotiating with schedds ... 
17:57:50.791 condor_negotiator[51]: Negotiating with DedicatedScheduler@parallel_schedd@submit.p7-htcondor at <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> 
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this submitter 
17:57:50.791 condor_negotiator[51]: 0 seconds so far for this schedd 
17:57:50.791 condor_negotiator[51]: Request 00049.00000: autocluster -1 (request count 1 of 0) 
17:57:50.791 condor_negotiator[51]: Matched 49.0 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.17:33779?addrs=10.42.0.17-33779&alias=p7-htcondorexecute-gb8z4.p7-htcondor&noUDP&sock=startd_99_cc5f> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
17:57:50.800 condor_startd[135]: slot1_1: New machine resource of type -1 allocated 
17:57:50.801 condor_startd[135]: slot1_1: Request accepted. 
17:57:50.802 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2 
17:57:50.804 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2 
17:57:50.805 condor_schedd[143]: Found 2 potential dedicated resources in 0 seconds 
17:57:50.805 condor_schedd[143]: Adding submitter DedicatedScheduler@parallel_schedd@submit.p7-htcondor to the submitter map for default pool. 
17:57:50.807 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
17:57:50.807 condor_negotiator[51]: Request 00049.00001: autocluster -1 (request count 1 of 0) 
17:57:50.807 condor_negotiator[51]: Matched 49.1 DedicatedScheduler@parallel_schedd@submit.p7-htcondor <10.42.0.253:37745?addrs=10.42.0.253-37745&alias=submit.p7-htcondor&noUDP&sock=schedd_98_1cc6> preempting none <10.42.0.19:37885?addrs=10.42.0.19-37885&alias=p7-htcondorexecute-vxbqp.p7-htcondor&noUDP&sock=startd_87_913c> slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
17:57:50.811 condor_startd[123]: slot1_1: New machine resource of type -1 allocated 
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2 
17:57:50.819 condor_schedd[143]: SetAttribute modifying attribute Scheduler in nonexistent job 49.2 
17:57:50.818 condor_startd[135]: slot1_1: Remote owner is user20002@p7-htcondor 
17:57:50.819 condor_startd[135]: slot1_1: State change: claiming protocol successful 
17:57:50.819 condor_startd[135]: slot1_1: Changing state: Owner -> Claimed 
17:57:50.819 condor_startd[123]: slot1_1: Request accepted. 
17:57:50.826 condor_negotiator[51]: Successfully matched with slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
17:57:50.826 condor_negotiator[51]: negotiateWithGroup resources used submitterAds length 0 
17:57:50.827 condor_negotiator[51]: ---------- Finished Negotiation Cycle ---------- 

Maybe this is important, in the configuration files I have: 
CLAIM_PARTITIONABLE_LEFTOVERS = FALSE 
CONSUMPTION_POLICY = TRUE 

Am I missing something? Is this a bug? 


PS: after two hours of waiting I submitted another job and surprisingly both jobs ran pretty soon. 


Best regards, 
Stanislav V. Markevich 
_______________________________________________ 
HTCondor-users mailing list 
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a 
subject: Unsubscribe 
You can also unsubscribe by visiting 
[ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users | https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users ] 

The archives can be found at: 
[ https://lists.cs.wisc.edu/archive/htcondor-users/ | https://lists.cs.wisc.edu/archive/htcondor-users/ ] 
_______________________________________________ 
HTCondor-users mailing list 
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a 
subject: Unsubscribe 
You can also unsubscribe by visiting 
[ https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users | https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users ] 

The archives can be found at: 
[ https://lists.cs.wisc.edu/archive/htcondor-users/ | https://lists.cs.wisc.edu/archive/htcondor-users/ ]