[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Strange Condor Behavior - Possible Bug



Some follow-up information for this issue.

 

Looking at the Starterlog for one of the problem jobs I see the following:

 

09/30/15 08:34:52 ******************************************************

09/30/15 08:34:52 ** condor_starter (CONDOR_STARTER) STARTING UP

09/30/15 08:34:52 ** /btbal3600/opt/condor-8.2.8/sbin/condor_starter

09/30/15 08:34:52 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)

09/30/15 08:34:52 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON

09/30/15 08:34:52 ** $CondorVersion: 8.2.8 Apr 10 2015 $

09/30/15 08:34:52 ** $CondorPlatform: X86_64-sles_11 $

09/30/15 08:34:52 ** PID = 5593

09/30/15 08:34:52 ** Log last touched 9/30 07:51:10

09/30/15 08:34:52 ******************************************************

09/30/15 08:34:52 Using config source: /btbal3600/opt/condor-8.2.8/etc/condor_config

09/30/15 08:34:52 Using local config sources:

09/30/15 08:34:52    /opt/condor/local.btbal3600/condor_config.local

09/30/15 08:34:52    /opt/condor-8.2.8/local.btbal3600/condor_config.local

09/30/15 08:34:52 config Macros = 210, Sorted = 209, StringBytes = 5592, TablesBytes = 7616

09/30/15 08:34:52 CLASSAD_CACHING is OFF

09/30/15 08:34:52 Daemon Log is logging: D_ALWAYS D_ERROR

09/30/15 08:34:52 SharedPortEndpoint: waiting for connections to named socket 27196_cacb_878

09/30/15 08:34:52 DaemonCore: command socket at <10.11.132.72:51529?sock=27196_cacb_878>

09/30/15 08:34:52 DaemonCore: private command socket at <10.11.132.72:51529?sock=27196_cacb_878>

09/30/15 08:34:52 Communicating with shadow <10.11.129.10:47740?noUDP>

09/30/15 08:34:52 Submitting machine is "ci1makebal808.ds.susq.com"

09/30/15 08:34:52 setting the orig job name in starter

09/30/15 08:34:52 setting the orig job iwd in starter

09/30/15 08:34:52 Chirp config summary: IO false, Updates false, Delayed updates true.

09/30/15 08:34:52 Initialized IO Proxy.

09/30/15 08:34:52 Done setting resource limits

09/30/15 08:34:52 File transfer completed successfully.

09/30/15 08:34:53 Job 4565.0 set to execute immediately

09/30/15 08:34:53 Starting a VANILLA universe job with ID: 4565.0

09/30/15 08:34:53 IWD: /hostname/statbuildbot/PHXSLES11ValgrindNightly/build/SLES11SP2_GCC4.3.4_DEBUG_X86-64

09/30/15 08:34:53 Output file: /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/_condor_stdout

09/30/15 08:34:53 Error file: /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/_condor_stderr

09/30/15 08:34:53 Renice expr "0" evaluated to 0

09/30/15 08:34:53 About to exec /opt/condor-8.2.8/local.btbal3600/execute/dir_5593/condor_exec.exe

09/30/15 08:34:53 Running job as user statbuildbot

09/30/15 08:34:53 Create_Process succeeded, pid=5601

09/30/15 09:37:17 Process exited, pid=5601, status=0

 

And condor_q shows the job is transferring data:

 

ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD

4565.0   statbuildbot    9/30 01:33   0+08:45:07 >  0   2197.3 global_yoda-arca_l

 

And then in the StartLog we see:

 

09/30/15 08:34:51 slot1_5: New machine resource of type -1 allocated

09/30/15 08:34:51 slot1: Changing state: Owner -> Unclaimed

09/30/15 08:34:51 slot1: State change: IS_OWNER is TRUE

09/30/15 08:34:51 slot1: Changing state: Unclaimed -> Owner

09/30/15 08:34:51 Setting up slot pairings

09/30/15 08:34:51 slot1_5: Request accepted.

09/30/15 08:34:51 slot1_5: Remote owner is statbuildbot@xxxxxxxxxxx

09/30/15 08:34:51 slot1_5: State change: claiming protocol successful

09/30/15 08:34:51 slot1_5: Changing state: Owner -> Claimed

09/30/15 08:34:52 slot1_5: match_info called

09/30/15 08:34:52 slot1_5: Got activate_claim request from shadow (10.11.129.10)

09/30/15 08:34:52 slot1_5: Remote job ID is 4565.0

09/30/15 08:34:52 slot1_5: Got universe "VANILLA" (5) from request classad

09/30/15 08:34:52 slot1_5: State change: claim-activation protocol successful

09/30/15 08:34:52 slot1_5: Changing activity: Idle -> Busy

 

Followed later by:

 

09/30/15 10:33:23 ERROR: Child pid 5593 appears hung! Killing it hard.

09/30/15 10:33:23 Starter pid 5593 died on signal 9 (signal 9 (Killed))

09/30/15 10:33:23 slot1_5: State change: starter exited

09/30/15 10:33:23 slot1_5: Changing activity: Busy -> Idle

09/30/15 10:33:23 slot1_5: State change: idle claim shutting down due to CLAIM_WORKLIFE

09/30/15 10:33:23 slot1_5: Changing state and activity: Claimed/Idle -> Preempting/Vacating

09/30/15 10:33:23 slot1_5: State change: No preempting claim, returning to owner

09/30/15 10:33:23 slot1_5: Changing state and activity: Preempting/Vacating -> Owner/Idle

09/30/15 10:33:23 slot1_5: Changing state: Owner -> Delete

09/30/15 10:33:23 slot1_5: Resource no longer needed, deleting

 

It appears to me that the job is hung on transferring output for an hour after running the job to completion.  Then after an hour the condor daemon copying the data is determined to be hung and is killed.  However we see the output file transferred to the schedd.  Similar behavior is observed on all the jobs that don’t “finish”.   The behavior only seems to appear in longer running jobs as all of the jobs are setup in the same way.

 

Thanks.

 

--

Will Deck

 




IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.