[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] fraction of jobs of a cluster behaves weird



Hi,

I see a fraction of jobs of a cluster going wrong within seconds, being rejected by a couple of workernodes that all should be able to run these jobs, will attach logs below.

I am a bit confused by the logs and would like to understand what is going on here, the jobs do not leave any trace on the workernode, and what actually doesthese shadow-log entries mean:

Request to run on slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> was DELAYED (previous job still being vacated)
Request to run on slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> was REFUSED

We do not work with preemption, hence any job that has a claim for a slot should arrive at the claimed free slot as such ?

What could be the reason to delay the run-request on the statd-side and then refuse it and why is this happening up to ten times in a couple of seconds on different wn's ? 

See sched and shadow log extracts below here: 


[root@bird-htc-sched14 condor]# grep 2852592.445 /var/log/condor/SchedLog
11/29/19 14:03:50 (pid:1317609) job_transforms for 2852592.445: 13 considered, 11 applied (T01SysDefaultProject,T02JobDefaults,T03JobValues,T04JobEnhance,T05JobClasses,T06AccountingGroup,T07AccountingStatusHold,T08DefaultToSL6,T10BirdResource,T11ShellEnvironment,T12JobHistory)
11/29/19 14:15:30 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.84.162:36983?addrs=131.169.84.162-36983> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:15:30 (pid:1317609) Shadow pid 177210 switching to job 2852592.445.
11/29/19 14:15:30 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:15:30 (pid:1317609) Shadow pid 177210 for job 2852592.445 exited with status 4
11/29/19 14:15:30 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.84.162:36983?addrs=131.169.84.162-36983> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:27 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:27 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.84.162:36983?addrs=131.169.84.162-36983> for BIRD_belle.lite.denuccio, (shadow pid = 191827)
11/29/19 14:17:28 (pid:1317609) Shadow pid 191827 for job 2852592.445 exited with status 108
11/29/19 14:17:28 (pid:1317609) Match record (slot1@xxxxxxxxxxxxxxx <131.169.84.162:36983?addrs=131.169.84.162-36983> for BIRD_belle.lite.denuccio, 2852592.445) deleted
11/29/19 14:17:29 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.76.196:46386?addrs=131.169.76.196-46386> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:29 (pid:1317609) Shadow pid 191022 switching to job 2852592.445.
11/29/19 14:17:29 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:29 (pid:1317609) Shadow pid 191022 for job 2852592.445 exited with status 4
11/29/19 14:17:29 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.76.196:46386?addrs=131.169.76.196-46386> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:29 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:29 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.76.196:46386?addrs=131.169.76.196-46386> for BIRD_belle.lite.denuccio, (shadow pid = 191963)
11/29/19 14:17:29 (pid:1317609) Shadow pid 191963 for job 2852592.445 exited with status 108
11/29/19 14:17:29 (pid:1317609) Match record (slot1@xxxxxxxxxxxxxxx <131.169.76.196:46386?addrs=131.169.76.196-46386> for BIRD_belle.lite.denuccio, 2852592.445) deleted
11/29/19 14:17:33 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:33 (pid:1317609) Shadow pid 191517 switching to job 2852592.445.
11/29/19 14:17:33 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:33 (pid:1317609) Shadow pid 191517 for job 2852592.445 exited with status 4
11/29/19 14:17:33 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:33 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:33 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> for BIRD_belle.lite.denuccio, (shadow pid = 191987)
11/29/19 14:17:35 (pid:1317609) Shadow pid 191987 for job 2852592.445 exited with status 108
11/29/19 14:17:35 (pid:1317609) Match record (slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> for BIRD_belle.lite.denuccio, 2852592.445) deleted
11/29/19 14:17:38 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:38 (pid:1317609) Shadow pid 190900 switching to job 2852592.445.
11/29/19 14:17:38 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:38 (pid:1317609) Shadow pid 190900 for job 2852592.445 exited with status 4
11/29/19 14:17:38 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:38 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:38 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> for BIRD_belle.lite.denuccio, (shadow pid = 191999)
11/29/19 14:17:40 (pid:1317609) Shadow pid 191999 for job 2852592.445 exited with status 108
11/29/19 14:17:40 (pid:1317609) Match record (slot1@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> for BIRD_belle.lite.denuccio, 2852592.445) deleted
11/29/19 14:17:42 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:42 (pid:1317609) Shadow pid 176263 switching to job 2852592.445.
11/29/19 14:17:42 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:42 (pid:1317609) Shadow pid 176263 for job 2852592.445 exited with status 4
11/29/19 14:17:42 (pid:1317609) match (slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> for BIRD_belle.lite.denuccio) switching to job 2852592.445
11/29/19 14:17:42 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:42 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> for BIRD_belle.lite.denuccio, (shadow pid = 192012)
11/29/19 14:17:44 (pid:1317609) Shadow pid 192012 for job 2852592.445 exited with status 108
11/29/19 14:17:44 (pid:1317609) Match record (slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> for BIRD_belle.lite.denuccio, 2852592.445) deleted
11/29/19 14:17:52 (pid:1317609) Starting add_shadow_birthdate(2852592.445)
11/29/19 14:17:52 (pid:1317609) Started shadow for job 2852592.445 on slot1@xxxxxxxxxxxxxxx <131.169.76.45:44834?addrs=131.169.76.45-44834> for BIRD_belle.lite.denuccio, (shadow pid = 192204)
11/29/19 14:17:53 (pid:1317609) Shadow pid 192204 for job 2852592.445 exited with status 113
11/29/19 14:17:53 (pid:1317609) Removing job 2852592.445

[root@bird-htc-sched14 condor]# grep 2852592.445 /var/log/condor/ShadowLog
11/29/19 14:15:30 (2852592.239) (177210): Switching to new job 2852592.445
11/29/19 14:15:30 (?.?) (177210): Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:27 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:28 (2852592.445) (191827): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.84.162:36983?addrs=131.169.84.162-36983> was REFUSED
11/29/19 14:17:28 (2852592.445) (191827): Job 2852592.445 is being evicted from slot1@xxxxxxxxxxxxxxx
11/29/19 14:17:28 (2852592.445) (191827): logEvictEvent with unknown reason (108), not logging.
11/29/19 14:17:28 (2852592.445) (191827): **** condor_shadow (condor_SHADOW) pid 191827 EXITING WITH STATUS 108
11/29/19 14:17:29 (2852592.396) (191022): Switching to new job 2852592.445
11/29/19 14:17:29 (?.?) (191022): Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:29 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:29 (2852592.445) (191963): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.76.196:46386?addrs=131.169.76.196-46386> was REFUSED
11/29/19 14:17:29 (2852592.445) (191963): Job 2852592.445 is being evicted from slot1@xxxxxxxxxxxxxxx
11/29/19 14:17:29 (2852592.445) (191963): logEvictEvent with unknown reason (108), not logging.
11/29/19 14:17:29 (2852592.445) (191963): **** condor_shadow (condor_SHADOW) pid 191963 EXITING WITH STATUS 108
11/29/19 14:17:33 (2852592.415) (191517): Switching to new job 2852592.445
11/29/19 14:17:33 (?.?) (191517): Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:33 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:33 (2852592.445) (191987): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> was DELAYED (previous job still being vacated)
11/29/19 14:17:35 (2852592.445) (191987): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.84.169:43457?addrs=131.169.84.169-43457> was REFUSED
11/29/19 14:17:35 (2852592.445) (191987): Job 2852592.445 is being evicted from slot1@xxxxxxxxxxxxxxx
11/29/19 14:17:35 (2852592.445) (191987): logEvictEvent with unknown reason (108), not logging.
11/29/19 14:17:35 (2852592.445) (191987): **** condor_shadow (condor_SHADOW) pid 191987 EXITING WITH STATUS 108
11/29/19 14:17:38 (2852592.385) (190900): Switching to new job 2852592.445
11/29/19 14:17:38 (?.?) (190900): Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:38 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:38 (2852592.445) (191999): Request to run on slot1_5@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> was DELAYED (previous job still being vacated)
11/29/19 14:17:40 (2852592.445) (191999): Request to run on slot1_5@xxxxxxxxxxxxxxx <131.169.77.47:38883?addrs=131.169.77.47-38883> was REFUSED
11/29/19 14:17:40 (2852592.445) (191999): Job 2852592.445 is being evicted from slot1_5@xxxxxxxxxxxxxxx
11/29/19 14:17:40 (2852592.445) (191999): logEvictEvent with unknown reason (108), not logging.
11/29/19 14:17:40 (2852592.445) (191999): **** condor_shadow (condor_SHADOW) pid 191999 EXITING WITH STATUS 108
11/29/19 14:17:42 (2852592.121) (176263): Switching to new job 2852592.445
11/29/19 14:17:42 (?.?) (176263): Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:42 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:42 (2852592.445) (192012): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> was DELAYED (previous job still being vacated)
11/29/19 14:17:44 (2852592.445) (192012): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.76.39:45373?addrs=131.169.76.39-45373> was REFUSED
11/29/19 14:17:44 (2852592.445) (192012): Job 2852592.445 is being evicted from slot1@xxxxxxxxxxxxxxx
11/29/19 14:17:44 (2852592.445) (192012): logEvictEvent with unknown reason (108), not logging.
11/29/19 14:17:44 (2852592.445) (192012): **** condor_shadow (condor_SHADOW) pid 192012 EXITING WITH STATUS 108
11/29/19 14:17:52 Initializing a VANILLA shadow for job 2852592.445
11/29/19 14:17:53 (2852592.445) (192204): Request to run on slot1@xxxxxxxxxxxxxxx <131.169.76.45:44834?addrs=131.169.76.45-44834> was ACCEPTED
11/29/19 14:17:53 (2852592.445) (192204): Job 2852592.445 is being removed: The system macro SYSTEM_PERIODIC_REMOVE _expression_ '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount >= 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
11/29/19 14:17:53 (2852592.445) (192204): Job 2852592.445 is being removed: The system macro SYSTEM_PERIODIC_REMOVE _expression_ '((JobStatus == 5 && (CurrentTime - EnteredCurrentStatus) > 14 * 24 * 3600)) || (JobRunCount >= 10) || ((JobStatus == 2) && ((CurrentTime - EnteredCurrentStatus) > MaxJobRetirementTime))' evaluated to TRUE
11/29/19 14:17:53 (2852592.445) (192204): Job 2852592.445 is being evicted from slot1@xxxxxxxxxxxxxxx
11/29/19 14:17:53 (2852592.445) (192204): logEvictEvent with unknown reason (113), not logging.
11/29/19 14:17:53 (2852592.445) (192204): **** condor_shadow (condor_SHADOW) pid 192204 EXITING WITH STATUS 113

Best
Christoph

--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx