Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] Many jobs with condor_shadow EXITING WITH STATUS 108
- Date: Tue, 4 Jun 2013 22:28:49 +0000
- From: <andrew.lahiff@xxxxxxxxxx>
- Subject: [HTCondor-users] Many jobs with condor_shadow EXITING WITH STATUS 108
Hi,
I am having problems with lots of jobs being evicted for reasons I don't understand. Extracts from ShadowLog [1], SchedLog [2] and StartLog [3] are shown below. I am using partitionable slots. The worker nodes include the following in their configuration:
PREEMPT = FALSE
CLAIM_WORKLIFE = 3600
and for the negotiator:
PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False
I'm not sure why the schedd is sending the DEACTIVATE_CLAIM_FORCIBLY. I thought that with the above settings jobs wouldn't be preempted?
I tried setting MAXJOBRETIREMENTTIME earlier today to 2 days to see if this helps, but it doesn't seem to have.
Many Thanks,
Andrew.
[1]
06/04/13 22:13:12 Initializing a VANILLA shadow for job 10617.0
06/04/13 22:13:12 (10617.0) (16691): Request to run on slot1@lxxxx <x.y.z.c:57764> was REFUSED
06/04/13 22:13:12 (10617.0) (16691): Job 10617.0 is being evicted from slot1@xxxx
06/04/13 22:13:12 (10617.0) (16691): logEvictEvent with unknown reason (108), aborting
06/04/13 22:13:12 (10617.0) (16691): **** condor_shadow (condor_SHADOW) pid 16691 EXITING WITH STATUS 108
[2]
06/04/13 22:13:12 Job 10617.0: is runnable
06/04/13 22:13:12 match (slot1@yyyy <x.y.z.c:57764> for group_ATLAS.prodatls.patls012) switching to job 10617.0
06/04/13 22:13:12 Scheduler::start_std - job=10617.0 on <x.y.z.c:57764>
06/04/13 22:13:12 Cleared dirty attributes for job 10617.0
06/04/13 22:13:12 Queueing job 10617.0 in runnable job queue
06/04/13 22:13:12 Match (slot1@yyyy <x.y.z.c:57764> for group_ATLAS.prodatls.patls012) - running 10617.0
06/04/13 22:13:12 Job prep for 10617.0 will not block, calling aboutToSpawnJobHandler() directly
06/04/13 22:13:12 aboutToSpawnJobHandler() completed for job 10617.0, attempting to spawn job handler
06/04/13 22:13:12 Starting add_shadow_birthdate(10617.0)
06/04/13 22:13:12 Added shadow record for PID 16691, job (10617.0)
06/04/13 22:13:12 Started shadow for job 10617.0 on slot1@lxxxx <x.y.z.c:57764> for group_ATLAS.prodatls.patls012, (shadow pid = 16691)
06/04/13 22:13:12 Shadow pid 16691 for job 10617.0 exited with status 108
06/04/13 22:13:12 Cleared dirty attributes for job 10617.0
06/04/13 22:13:12 Match record (slot1@xxxx <x.y.z.c:57764> for group_ATLAS.prodatls.patls012, 10617.0) deleted
06/04/13 22:13:12 Deleting shadow rec for PID 16691, job (10617.0)
06/04/13 22:13:12 Marked job 10617.0 as IDLE
[3]
06/04/13 22:13:12 Received TCP command 404 (DEACTIVATE_CLAIM_FORCIBLY) from condor_pool@yyyy <x.y.z.c:38998>, access level DAEMON
06/04/13 22:13:12 Calling HandleReq <command_handler> (0) for command 404 (DEACTIVATE_CLAIM_FORCIBLY) from condor_pool@yyyy <x.y.z.b:38998>
06/04/13 22:13:12 slot1_5: Computing claimWorklifeExpired(); ClaimAge=45388, ClaimWorklife=3600
06/04/13 22:13:12 slot1_5: Called deactivate_claim_forcibly()
06/04/13 22:13:12 slot1_5: In Starter::kill() with pid 836, sig 3 (SIGQUIT)
06/04/13 22:13:12 Send_Signal(): Doing kill(836,3) [SIGQUIT]
06/04/13 22:13:12 slot1_5: in starter:killHard starting kill timer
06/04/13 22:13:12 slot1_5: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/04/13 22:13:12 slot1_5: In Starter::kill() with pid 836, sig 15 (SIGTERM)
06/04/13 22:13:12 Send_Signal(): Doing kill(836,15) [SIGTERM]
06/04/13 22:13:12 slot1_5: Using max vacate time of 600s for this job.
--
Scanned by iCritical.