Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs stuck in running state after completion

Date: Wed, 02 Jun 2010 13:30:40 +0200
From: "Joan J. Piles" <jpiles@xxxxxxxxx>
Subject: [Condor-users] Jobs stuck in running state after completion

Hello,

We are having a very strange problem with our condor installation.

We have a pool of ~100 nodes running condor 7.4.1, and both a submitternode and a central manager node with the same version.

We have found that there are some jobs that complete without error, butfor some reason its status is not updated, and shows as running incondor_q. The slot shows as Claimed/Idle, and the user is accounted forthis time. The only different thing about the job is that it shows:


TerminationPending = True

In its ClassAd.

In fact, in the processing node there are no processes from that user(we have a shared UID space), and we can see in the StarterLog:


06/02 12:54:19 Create_Process succeeded, pid=11864
06/02 12:54:59 Process exited, pid=11864, status=0
06/02 12:55:02 Got SIGQUIT.  Performing fast shutdown.
06/02 12:55:02 ShutdownFast all jobs.

06/02 12:55:02 **** condor_starter (condor_STARTER) pid 11862 EXITINGWITH STATUS 0


Furthermore, in the submitter node we have in the ShadowLog:

ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(NumJobStarts = 1)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(DiskUsage = 1978)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(LastJobLeaseRenewal = 1275476102)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(RemoteSysCpu = 0.000000)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(RemoteUserCpu = 11.000000)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(ImageSize = 5973628)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(ExitBySignal = FALSE)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(ExitCode = 0)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(TerminationPending = TRUE)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(CommittedTime = 52)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(BytesSent = 457629.000000)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue:SetAttribute(BytesRecvd = 1152098.000000)ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Job 38963.2 terminated:exited with status 0ShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(1) -@1275476113.058234 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now WRITEShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(2) -@1275476113.060709 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now UNLOCKED

ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Forking Mailer process...

ShadowLog.old:06/02 12:55:13 (38963.2) (26318): **** condor_shadow(condor_SHADOW) pid 26318 EXITING WITH STATUS 100


So it seems to end ok.
However, condor_q still shows the job as running.

The only way the job finishes is when, in some kind of cleanup, we findthis kind of lines in the CollectorLog:

06/02 11:49:43 **** Removing stale ad: "< slot1.8@xxxxxxx ,172.16.3.36 >"


However, this doesn't always happen, and it's mostly random.

Some jobs can't even be removed via "condor_rm" and stay forever in Xstate, forcing us to use "condor_rm -forcex" to really remove then.The strange thing is that the log file shows the jobs as completed, andthe result files are returned correctly.

As it seems some communication problem, we have tried disabling the useof file locks for the event logging and ignoring NFS errorsIGNORE_NFS_LOCK_ERRORS = True and EVENT_LOG_LOCKING = False) andswitcing to TCP for the communications to the collector daemon, withoutsuccess, so we are mostly out of ideas.


Has anybody experienced similar problems or knows which could be the cause?

Thanks,

Joan

Prev by Date: Re: [Condor-users] having multiple schedulers and collectors
Next by Date: Re: [Condor-users] Dagman with a variable number of jobs
Previous by thread: Re: [Condor-users] Dagman with a variable number of jobs
Next by thread: [Condor-users] problem with failure associated with LOG LINE CACHE
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

[Condor-users] Jobs stuck in running state after completion