[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Inconsistent status in condor_q



On Tue, 7 Oct 2008, Tung-Han Hsieh wrote:

Hello,

I encountered the following situation. A job is submitted through
"condor_submit" and waiting for
running. But at the moment when the job is going to run, the root run the
"condor_hold" command
to hold the job. The end result is: condor_q shows that the job is held, but
actually the job is running
in one of our computing node.

I was going to say that you should do a condor_release of the
job but from the output below it looks like you already tried that.
Can you tell from the logs if the condor_hold is being done automatically
for some reason?  Maybe there is something in ShadowLog that will
give you a clue.  Also there might be something in the UserLog

You can't correct the condor_q until you know what is making
the job hold.
In the worst case you could just let the job run to completion
and then condor_rm the job.

Steve Timm



My question is, how to correct this inconsistent result in condor_q? One way
is to restart the condor
daemons on the computing node, but that running job will be killed and has
to be resubmitted. Is
there a way to keep that job running, but correct the result of "condor_q"
such that it also shows
that the job is running?

The detailed "condor_q -l" of that job is in the following. When look
carefully into the list, we find
that "condor_q" shows that the job is held, but it also runs in the
computing node "w05". There is no
CPU time statistics. Actually the job is really running. And the computing
node "w05" is occupied by
that user when checking "condor_status -run" command on that node.

Thanks very much for your suggestions. :)

T.H.Hsieh


-- Submitter: w00.xxx.xxx.xxx.tw : <192.168.10.1:53890> : w00.xxx.xxx.xxx.tw
MyType = "Job"
TargetType = "Machine"
ClusterId = 18599
QDate = 1223349567
CompletionDate = 0
Owner = "kwXXXX"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.2 Oct 12 2006 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/work1/hkw/Guanidine"
JobUniverse = 11
Cmd = "/opt/bin/gaussian"
WantIOProxy = TRUE
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
MinHosts = 4
MaxHosts = 4
JobPrio = 0
User = "kwXXXX@xxxxxxxxxxxxxx"
NiceUser = FALSE
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/work1/hkw/Guanidine/deH-TS2.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "deH-TS2.output"
StreamOut = FALSE
Err = "deH-TS2.error"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 1
ImageSize = 10000
ExecutableSize_RAW = 1
ExecutableSize = 10000
DiskUsage_RAW = 1
DiskUsage = 10000
Requirements = (JTYPE == "long") && (Arch == "X86_64") && (OpSys == "LINUX")
&&
(Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) &&
(TARGET.FileSystemDomai
n == MY.FileSystemDomain)
FileSystemDomain = "rcas.sinica.edu.tw"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
OnExitHold = FALSE
OnExitRemove = TRUE
LeaveJobInQueue = FALSE
Args = "deH-TS2.input"
WantParallelSchedulingGroups = TRUE
GlobalJobId = "w00.rcas.sinica.edu.tw#1223349567#18599.0"
ProcId = 0
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
ClaimIds = "<192.168.10.6:50699>#1222327061#1,<192.168.10.6:50699
#1222327061#2,
<192.168.10.6:50699>#1222327061#3,<192.168.10.6:50699>#1222327061#4"
RemoteHosts = "vm1@w05,vm2@w05,vm3@w05,vm4@w05"
CurrentHosts = 4
ClaimId = "<192.168.10.6:50699>#1222327061#1"
LastJobLeaseRenewal = 1223349602
RemoteHost = "vm1@w05"
RemoteVirtualMachineID = 1
ShadowBday = 1223349602
JobStartDate = 1223349602
JobCurrentStartDate = 1223349602
JobRunCount = 1
OrigMaxHosts = 4
LastSuspensionTime = 0
LastHoldReason = "via condor_hold (by user root)"
JobStatus = 5
HoldReason = "via condor_hold (by user root)"
EnteredCurrentStatus = 1223349803
LastReleaseReason = "via condor_release (by user root)"
ServerTime = 1223363491


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.