[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Inconsistent status in condor_q



Hello,

I encountered the following situation. A job is submitted through "condor_submit" and waiting for
running. But at the moment when the job is going to run, the root run the "condor_hold" command
to hold the job. The end result is: condor_q shows that the job is held, but actually the job is running
in one of our computing node.

My question is, how to correct this inconsistent result in condor_q? One way is to restart the condor
daemons on the computing node, but that running job will be killed and has to be resubmitted. Is
there a way to keep that job running, but correct the result of "condor_q" such that it also shows
that the job is running?

The detailed "condor_q -l" of that job is in the following. When look carefully into the list, we find
that "condor_q" shows that the job is held, but it also runs in the computing node "w05". There is no
CPU time statistics. Actually the job is really running. And the computing node "w05" is occupied by
that user when checking "condor_status -run" command on that node.

Thanks very much for your suggestions. :)

T.H.Hsieh


-- Submitter: w00.xxx.xxx.xxx.tw : <192.168.10.1:53890> : w00.xxx.xxx.xxx.tw
MyType = "Job"
TargetType = "Machine"
ClusterId = 18599
QDate = 1223349567
CompletionDate = 0
Owner = "kwXXXX"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.2 Oct 12 2006 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/work1/hkw/Guanidine"
JobUniverse = 11
Cmd = "/opt/bin/gaussian"
WantIOProxy = TRUE
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
MinHosts = 4
MaxHosts = 4
JobPrio = 0
User = "kwXXXX@xxxxxxxxxxxxxx"
NiceUser = FALSE
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/work1/hkw/Guanidine/deH-TS2.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "deH-TS2.output"
StreamOut = FALSE
Err = "deH-TS2.error"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 1
ImageSize = 10000
ExecutableSize_RAW = 1
ExecutableSize = 10000
DiskUsage_RAW = 1
DiskUsage = 10000
Requirements = (JTYPE == "long") && (Arch == "X86_64") && (OpSys == "LINUX") &&
(Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomai
n == MY.FileSystemDomain)
FileSystemDomain = "rcas.sinica.edu.tw"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
>>LeaveJobInQueue = FALSE
Args = "deH-TS2.input"
WantParallelSchedulingGroups = TRUE
GlobalJobId = "w00.rcas.sinica.edu.tw#1223349567#18599.0"
ProcId = 0
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
ClaimIds = "<192.168.10.6:50699>#1222327061#1,<192.168.10.6:50699>#1222327061#2,
<192.168.10.6:50699>#1222327061#3,<192.168.10.6:50699>#1222327061#4"
RemoteHosts = "vm1@w05,vm2@w05,vm3@w05,vm4@w05"
CurrentHosts = 4
ClaimId = "<192.168.10.6:50699>#1222327061#1"
LastJobLeaseRenewal = 1223349602
RemoteHost = "vm1@w05"
RemoteVirtualMachineID = 1
ShadowBday = 1223349602
JobStartDate = 1223349602
JobCurrentStartDate = 1223349602
JobRunCount = 1
OrigMaxHosts = 4
LastSuspensionTime = 0
LastHoldReason = "via condor_hold (by user root)"
JobStatus = 5
HoldReason = "via condor_hold (by user root)"
EnteredCurrentStatus = 1223349803
LastReleaseReason = "via condor_release (by user root)"
ServerTime = 1223363491