[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] jobs state switch from R to I in a cycle



Hello,

ÂÂÂÂÂ I'm using condor 7.8.5 in a Rocks 6 cluster. And, I changed condor configuration to give priority to "production" user jobs by setting Rank = (Owner == "production"). I also reset some other settings as the uid of the submitter didn' t look like it was being used during a job execution.

ÂÂÂÂÂ The issue now is the state of the jobs keep switching from R to I, over and over again. But most of the time they are I. All those jobs are merely a sleep for 2min and they never end as their state changes all the time.

ÂÂÂÂÂÂ I can't understand what causes this. Maybe the best clues I have are the log messages from SchedLog and ShadowLog:

==> /opt/condor/log/ShadowLog <==
10/02/15 21:45:23 (19.0) (5324): Calling Handler <HandleSyscalls> (1)
10/02/15 21:45:23 (19.0) (5324): Resource slot1@xxxxxxxxxx changing state from STARTUP to EXECUTING
10/02/15 21:45:23 (19.0) (5324): FileLock::obtain(1) - @1443833123.878981 lock on /tmp/condorLocks/24/15/011566703281310.lockc now WRITE
10/02/15 21:45:23 (19.0) (5324): scheddname = webback2.local
10/02/15 21:45:23 (19.0) (5324): executeHost = <10.1.255.246:42482>
10/02/15 21:45:23 (19.0) (5324): start = <10.1.255.246:42482>
10/02/15 21:45:23 (19.0) (5324): end = :42482>
10/02/15 21:45:23 (19.0) (5324): tmpaddr = 10.1.255.246
10/02/15 21:45:23 (19.0) (5324): Executehost name = nc09.local (hp->h_name)
10/02/15 21:45:23 (19.0) (5324): FileLock::obtain(2) - @1443833123.882742 lock on /tmp/condorLocks/24/15/011566703281310.lockc now UNLOCKED
10/02/15 21:45:23 (19.0) (5324): Started timer to evaluate periodic user policy expressions every 300 seconds
10/02/15 21:45:23 (19.0) (5324): QmgrJobUpdater: started timer to update queue every 900 seconds (tid=10)
10/02/15 21:45:23 (19.0) (5324): Set NumJobStarts to 37
10/02/15 21:45:23 (19.0) (5324): Return from Handler <HandleSyscalls> 0.0049s
10/02/15 21:45:23 (19.0) (5324): Calling Handler <HandleSyscalls> (1)
10/02/15 21:45:23 (19.0) (5324): Inside RemoteResource::updateFromStarter()
10/02/15 21:45:23 (19.0) (5324): Return from Handler <HandleSyscalls> 0.0008s
10/02/15 21:45:23 (19.0) (5324): Calling Handler <HandleSyscalls> (1)
10/02/15 21:45:23 (19.0) (5324): Inside RemoteResource::updateFromStarter()
10/02/15 21:45:23 (19.0) (5324): Inside RemoteResource::resourceExit()
10/02/15 21:45:23 (19.0) (5324): setting exit reason on slot1@xxxxxxxxxx to 107
10/02/15 21:45:23 (19.0) (5324): Resource slot1@xxxxxxxxxx changing state from EXECUTING to FINISHED
10/02/15 21:45:23 (19.0) (5324): Job 19.0 is being evicted from slot1@xxxxxxxxxx
10/02/15 21:45:23 (19.0) (5324): Entering DCStartd::deactivateClaim(forceful)
10/02/15 21:45:23 (19.0) (5324): condor_read(): Socket closed when trying to read 5 bytes from <10.1.255.246:42482>
10/02/15 21:45:23 (19.0) (5324): IO: EOF reading packet header
10/02/15 21:45:23 (19.0) (5324): DCStartd::deactivateClaim: failed to read response ad.
10/02/15 21:45:23 (19.0) (5324): DCStartd::deactivateClaim: successfully sent command
10/02/15 21:45:23 (19.0) (5324): Killed starter (fast) at <10.1.255.246:42482>
10/02/15 21:45:23 (19.0) (5324): FileLock::obtain(1) - @1443833123.950046 lock on /tmp/condorLocks/24/15/011566703281310.lockc now WRITE
10/02/15 21:45:23 (19.0) (5324): FileLock::obtain(2) - @1443833123.951502 lock on /tmp/condorLocks/24/15/011566703281310.lockc now UNLOCKED



==> /opt/condor/log/SchedLog <==
10/02/15 21:45:23 (pid:3247) QMGR Connection closed
10/02/15 21:45:23 (pid:3247) Return from HandleReq <handle_q> (handler: 0.014s, sec: 0.002s, payload: 0.001s)
10/02/15 21:45:23 (pid:3247) Return from Handler <DaemonCore::HandleReqPayloadReady> 0.0140s
10/02/15 21:45:23 (pid:3247) DaemonCore: No more children processes to reap.
10/02/15 21:45:23 (pid:3247) DaemonCore: pid 5324 exited with status 27392, invoking reaper 3 <child_exit>
10/02/15 21:45:23 (pid:3247) Shadow pid 5324 for job 19.0 exited with status 107
10/02/15 21:45:23 (pid:3247) Cleared dirty attributes for job 19.0
10/02/15 21:45:23 (pid:3247) Called send_vacate( <10.1.255.246:42482>, 443 ) via UDP
10/02/15 21:45:23 (pid:3247) Completed RELEASE_CLAIM to startd slot1@xxxxxxxxxx <10.1.255.246:42482>#1443830374#74#... for pbegeland
10/02/15 21:45:23 (pid:3247) Match record (slot1@xxxxxxxxxx <10.1.255.246:42482>#1443830374#74#... for pbegeland, 19.0) deleted
10/02/15 21:45:23 (pid:3247) Deleting shadow rec for PID 5324, job (19.0)
10/02/15 21:45:23 (pid:3247) Entered check_zombie( 5324, 0x0x1a2f024, st=2 )
10/02/15 21:45:23 (pid:3247) Prioritized runnable job list will be rebuilt, because ClassAd attribute JobStatus=1 changed
10/02/15 21:45:23 (pid:3247) Marked job 19.0 as IDLE
10/02/15 21:45:23 (pid:3247) Exited check_zombie( 5324, 0x0x1a2f024 )
10/02/15 21:45:23 (pid:3247) Exited delete_shadow_rec( 5324 )
10/02/15 21:45:23 (pid:3247) DaemonCore: return from reaper for pid 5324
10/02/15 21:45:23 (pid:3247) Calling Handler <DaemonCommandProtocol::WaitForSocketData> (5)
10/02/15 21:45:23 (pid:3247) Received TCP command 443 (RELEASE_CLAIM) from unauthenticated@unmapped <10.1.255.246:44818>, access level READ
10/02/15 21:45:23 (pid:3247) Calling HandleReq <release_claim> (0) for command 443 (RELEASE_CLAIM) from unauthenticated@unmapped <10.1.255.246:44818>
10/02/15 21:45:23 (pid:3247) Got RELEASE_CLAIM from <10.1.255.246:44818>
10/02/15 21:45:23 (pid:3247) mrec for "<10.1.255.246:42482>#1443830374#74#..." not found -- match not deleted (but perhaps it was deleted previously)
10/02/15 21:45:23 (pid:3247) Return from HandleReq <release_claim> (handler: 0.000s, sec: 0.001s, payload: 0.000s)
10/02/15 21:45:23 (pid:3247) Return from Handler <DaemonCommandProtocol::WaitForSocketData> 0.0009s

ÂÂÂÂÂ Does someone know what may be happening?

Thanks,
Patricia