[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Many jobs with condor_shadow EXITING WITH STATUS 108



Hi,

I am having problems with lots of jobs being evicted for reasons I don't understand. Extracts from ShadowLog [1], SchedLog [2] and StartLog [3] are shown below. I am using partitionable slots. The worker nodes include the following in their configuration:

PREEMPT = FALSE
CLAIM_WORKLIFE = 3600

and for the negotiator:

PREEMPTION_REQUIREMENTS = False
NEGOTIATOR_CONSIDER_PREEMPTION = False

I'm not sure why the schedd is sending the DEACTIVATE_CLAIM_FORCIBLY. I thought that with the above settings jobs wouldn't be preempted?

I tried setting MAXJOBRETIREMENTTIME earlier today to 2 days to see if this helps, but it doesn't seem to have.

Many Thanks,
Andrew.

[1]
06/04/13 22:13:12 Initializing a VANILLA shadow for job 10617.0
06/04/13 22:13:12 (10617.0) (16691): Request to run on slot1@lxxxx <x.y.z.c:57764> was REFUSED
06/04/13 22:13:12 (10617.0) (16691): Job 10617.0 is being evicted from slot1@xxxx
06/04/13 22:13:12 (10617.0) (16691): logEvictEvent with unknown reason (108), aborting
06/04/13 22:13:12 (10617.0) (16691): **** condor_shadow (condor_SHADOW) pid 16691 EXITING WITH STATUS 108

[2]
06/04/13 22:13:12 Job 10617.0: is runnable
06/04/13 22:13:12 match (slot1@yyyy <x.y.z.c:57764> for group_ATLAS.prodatls.patls012) switching to job 10617.0
06/04/13 22:13:12 Scheduler::start_std - job=10617.0 on <x.y.z.c:57764>
06/04/13 22:13:12 Cleared dirty attributes for job 10617.0
06/04/13 22:13:12 Queueing job 10617.0 in runnable job queue
06/04/13 22:13:12 Match (slot1@yyyy <x.y.z.c:57764> for group_ATLAS.prodatls.patls012) - running 10617.0
06/04/13 22:13:12 Job prep for 10617.0 will not block, calling aboutToSpawnJobHandler() directly
06/04/13 22:13:12 aboutToSpawnJobHandler() completed for job 10617.0, attempting to spawn job handler
06/04/13 22:13:12 Starting add_shadow_birthdate(10617.0)
06/04/13 22:13:12 Added shadow record for PID 16691, job (10617.0)
06/04/13 22:13:12 Started shadow for job 10617.0 on slot1@lxxxx <x.y.z.c:57764> for group_ATLAS.prodatls.patls012, (shadow pid = 16691)
06/04/13 22:13:12 Shadow pid 16691 for job 10617.0 exited with status 108
06/04/13 22:13:12 Cleared dirty attributes for job 10617.0
06/04/13 22:13:12 Match record (slot1@xxxx <x.y.z.c:57764> for group_ATLAS.prodatls.patls012, 10617.0) deleted
06/04/13 22:13:12 Deleting shadow rec for PID 16691, job (10617.0)
06/04/13 22:13:12 Marked job 10617.0 as IDLE

[3]
06/04/13 22:13:12 Received TCP command 404 (DEACTIVATE_CLAIM_FORCIBLY) from condor_pool@yyyy <x.y.z.c:38998>, access level DAEMON
06/04/13 22:13:12 Calling HandleReq <command_handler> (0) for command 404 (DEACTIVATE_CLAIM_FORCIBLY) from condor_pool@yyyy <x.y.z.b:38998>
06/04/13 22:13:12 slot1_5: Computing claimWorklifeExpired(); ClaimAge=45388, ClaimWorklife=3600
06/04/13 22:13:12 slot1_5: Called deactivate_claim_forcibly()
06/04/13 22:13:12 slot1_5: In Starter::kill() with pid 836, sig 3 (SIGQUIT)
06/04/13 22:13:12 Send_Signal(): Doing kill(836,3) [SIGQUIT]
06/04/13 22:13:12 slot1_5: in starter:killHard starting kill timer
06/04/13 22:13:12 slot1_5: Changing state and activity: Claimed/Busy -> Preempting/Vacating
06/04/13 22:13:12 slot1_5: In Starter::kill() with pid 836, sig 15 (SIGTERM)
06/04/13 22:13:12 Send_Signal(): Doing kill(836,15) [SIGTERM]
06/04/13 22:13:12 slot1_5: Using max vacate time of 600s for this job.

--
Scanned by iCritical.