[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs stuck in running state after completion



Hello,

We are having a very strange problem with our condor installation.

We have a pool of ~100 nodes running condor 7.4.1, and both a submitter node and a central manager node with the same version.

We have found that there are some jobs that complete without error, but for some reason its status is not updated, and shows as running in condor_q. The slot shows as Claimed/Idle, and the user is accounted for this time. The only different thing about the job is that it shows:

TerminationPending = True

In its ClassAd.

In fact, in the processing node there are no processes from that user (we have a shared UID space), and we can see in the StarterLog:

06/02 12:54:19 Create_Process succeeded, pid=11864
06/02 12:54:59 Process exited, pid=11864, status=0
06/02 12:55:02 Got SIGQUIT.  Performing fast shutdown.
06/02 12:55:02 ShutdownFast all jobs.
06/02 12:55:02 **** condor_starter (condor_STARTER) pid 11862 EXITING WITH STATUS 0

Furthermore, in the submitter node we have in the ShadowLog:

ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(NumJobStarts = 1) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(DiskUsage = 1978) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(LastJobLeaseRenewal = 1275476102) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(RemoteSysCpu = 0.000000) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(RemoteUserCpu = 11.000000) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(ImageSize = 5973628) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(ExitBySignal = FALSE) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(ExitCode = 0) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(TerminationPending = TRUE) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(CommittedTime = 52) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(BytesSent = 457629.000000) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Updating Job Queue: SetAttribute(BytesRecvd = 1152098.000000) ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Job 38963.2 terminated: exited with status 0 ShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(1) - @1275476113.058234 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now WRITE ShadowLog.old:06/02 12:55:13 (38963.2) (26318): FileLock::obtain(2) - @1275476113.060709 lock on /xxxxxxxxxxxxxxxxxxxxxxx.log now UNLOCKED
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): Forking Mailer process...
ShadowLog.old:06/02 12:55:13 (38963.2) (26318): **** condor_shadow (condor_SHADOW) pid 26318 EXITING WITH STATUS 100

So it seems to end ok.
However, condor_q still shows the job as running.

The only way the job finishes is when, in some kind of cleanup, we find this kind of lines in the CollectorLog:

06/02 11:49:43 **** Removing stale ad: "< slot1.8@xxxxxxx , 172.16.3.36 >"

However, this doesn't always happen, and it's mostly random.

Some jobs can't even be removed via "condor_rm" and stay forever in X state, forcing us to use "condor_rm -forcex" to really remove then. The strange thing is that the log file shows the jobs as completed, and the result files are returned correctly.

As it seems some communication problem, we have tried disabling the use of file locks for the event logging and ignoring NFS errors IGNORE_NFS_LOCK_ERRORS = True and EVENT_LOG_LOCKING = False) and switcing to TCP for the communications to the collector daemon, without success, so we are mostly out of ideas.

Has anybody experienced similar problems or knows which could be the cause?

Thanks,

Joan