[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Matchmaking errors



Dear HTCondor experts,

I don't understand (anymore) how HTCondor matchmaking works. 

Here's what I see happening, which completely stupefies me. 

First, the schedd:
-----------------------
ShadowLog.20180411T175118:04/11/18 13:29:47 (215.13) (515508): Job 215.13 is being evicted from slot1_26@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:38:52 (215.13) (519901): Job 215.13 is being evicted from slot1_36@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:39:30 (215.13) (520207): Job 215.13 is being evicted from slot1_10@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:39:33 (215.13) (520228): Job 215.13 is being evicted from slot1_34@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:39:47 (215.13) (520265): Job 215.13 is being evicted from slot1_41@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:39:55 (215.13) (520315): Job 215.13 is being evicted from slot1_20@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:23 (215.13) (520607): Job 215.13 is being evicted from slot1_26@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:26 (215.13) (520628): Job 215.13 is being evicted from slot1_32@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:28 (215.13) (520635): Job 215.13 is being evicted from slot1_23@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:30 (215.13) (520651): Job 215.13 is being evicted from slot1_20@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:33 (215.13) (520671): Job 215.13 is being evicted from slot1_3@xxxxxxxxxxxxxxxxx
ShadowLog.20180411T175118:04/11/18 13:40:49 (215.13) (520698): Job 215.13 is being removed: The system macro SYSTEM_PERIODIC_REMOVE expression '( RemoteWallClockTime > 80 * 60 * 60 ) || ( RemoteSysCpu + RemoteUserCpu > 80 * 60 * 60 ) || ( ( JobStatus == 5 && ( CurrentTime - EnteredCurrentStatus ) > 30 * 60 ) ) || ( ResidentSetSize_RAW > 1000 * RequestMemory ) || ( JobRunCount > 10 )' evaluated to TRUE
-----------------------
The last line is of course expected since JobRunCount > 10. 

Surrounding these lines, I always see the following context:
-----------------------
708698-04/11/18 13:29:47 (215.13) (515508): Request to run on slot1_26@xxxxxxxxxxxxxxxxx <...> was REFUSED
708699:04/11/18 13:29:47 (215.13) (515508): Job 215.13 is being evicted from slot1_26@xxxxxxxxxxxxxxxxx
708700-04/11/18 13:29:47 (215.13) (515508): logEvictEvent with unknown reason (108), not logging.
708701-04/11/18 13:29:47 (215.13) (515508): **** condor_shadow (condor_SHADOW) pid 515508 EXITING WITH STATUS 108
708702-04/11/18 13:29:47 (213.1612) (495419): WARNING: WriteUserLog::writeEvent user doWriteEvent() failed on normal log /some_user_home/logs/log.213.1612!
708703-04/11/18 13:29:47 (213.1612) (495419): Unable to log ULOG_IMAGE_SIZE event
-----------------------
One the last try reads like:
-----------------------
04/11/18 13:40:39 (215.13) (520698): Request to run on slot1_3@xxxxxxxxxxxxxxxxx <...> was ACCEPTED
04/11/18 13:40:40 (215.13) (520698): File transfer completed successfully.
-----------------------
But of course, the job is killed shortly afterwards, since the JobRunCount was exceeded! 

The startd / worker nodes don't see any of this, at least I checked the Starter-logs for the slots and found nothing. 
Who rejects whom here, and why? 

Close in time to that, the CM does the following:
-----------------------
# grep '215.13 ' NegotiatorLog
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:39:40       Rejected 215.13 myuser@xxxxxxxxxxx <...> : no match found
04/11/18 13:40:39       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
04/11/18 13:40:39       Matched 215.13 myuser@xxxxxxxxxxx <...> preempting none <...> slot1@xxxxxxxxxxxxxxxxx
-----------------------

My question is: What's going on here? 
Do we need to increase JobRunCount? Is there a bad timeout somewhere? 

Questions for my understanding: 
Why is a single job matched so many times? 
Why do shadows start and jobs are evicted, before the Negotiatior has done the actual matchmaking? Who does this evicting? 
Who is actually responsible of matchmaking? I always thought the Negotatior would match up the classads, but it seems Shadows are already talking to Startds before this has happened. 

Any ideas or hints are very welcome. 

Cheers,
	Oliver

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature