[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Inconsistent status in condor_q



T.H.Hsieh,

I would look at the ShadowLog on w00 and the StarterLog on w05 for this job. It may be that a file/directory/permissions problem exists, as it appears from your classad that you have a shared filesystem. Hopefully these logs might shed some more light on what is occurring if you look for messages about 18599.0

Hope this helps and good luck,

Regards,
Rob
-- 

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com


Tung-Han Hsieh wrote:
Hello,

I encountered the following situation. A job is submitted through "condor_submit" and waiting for
running. But at the moment when the job is going to run, the root run the "condor_hold" command
to hold the job. The end result is: condor_q shows that the job is held, but actually the job is running
in one of our computing node.

My question is, how to correct this inconsistent result in condor_q? One way is to restart the condor
daemons on the computing node, but that running job will be killed and has to be resubmitted. Is
there a way to keep that job running, but correct the result of "condor_q" such that it also shows
that the job is running?

The detailed "condor_q -l" of that job is in the following. When look carefully into the list, we find
that "condor_q" shows that the job is held, but it also runs in the computing node "w05". There is no
CPU time statistics. Actually the job is really running. And the computing node "w05" is occupied by
that user when checking "condor_status -run" command on that node.

Thanks very much for your suggestions. :)

T.H.Hsieh


-- Submitter: w00.xxx.xxx.xxx.tw : <192.168.10.1:53890> : w00.xxx.xxx.xxx.tw
MyType = "Job"
TargetType = "Machine"
ClusterId = 18599
QDate = 1223349567
CompletionDate = 0
Owner = "kwXXXX"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.2 Oct 12 2006 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/work1/hkw/Guanidine"
JobUniverse = 11
Cmd = "/opt/bin/gaussian"
WantIOProxy = TRUE
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
MinHosts = 4
MaxHosts = 4
JobPrio = 0
User = "kwXXXX@xxxxxxxxxxxxxx"
NiceUser = FALSE
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/work1/hkw/Guanidine/deH-TS2.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "deH-TS2.output"
StreamOut = FALSE
Err = "deH-TS2.error"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 1
ImageSize = 10000
ExecutableSize_RAW = 1
ExecutableSize = 10000
DiskUsage_RAW = 1
DiskUsage = 10000
Requirements = (JTYPE == "long") && (Arch == "X86_64") && (OpSys == "LINUX") &&
(Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomai
n == MY.FileSystemDomain)
FileSystemDomain = "rcas.sinica.edu.tw"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
> > LeaveJobInQueue = FALSE
Args = "deH-TS2.input"
WantParallelSchedulingGroups = TRUE
GlobalJobId = "w00.rcas.sinica.edu.tw#1223349567#18599.0"
ProcId = 0
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
ClaimIds = "<192.168.10.6:50699>#1222327061#1,<192.168.10.6:50699>#1222327061#2,
<192.168.10.6:50699>#1222327061#3,<192.168.10.6:50699>#1222327061#4"
RemoteHosts = "vm1@w05,vm2@w05,vm3@w05,vm4@w05"
CurrentHosts = 4
ClaimId = "<192.168.10.6:50699>#1222327061#1"
LastJobLeaseRenewal = 1223349602
RemoteHost = "vm1@w05"
RemoteVirtualMachineID = 1
ShadowBday = 1223349602
JobStartDate = 1223349602
JobCurrentStartDate = 1223349602
JobRunCount = 1
OrigMaxHosts = 4
LastSuspensionTime = 0
LastHoldReason = "via condor_hold (by user root)"
JobStatus = 5
HoldReason = "via condor_hold (by user root)"
EnteredCurrentStatus = 1223349803
LastReleaseReason = "via condor_release (by user root)"
ServerTime = 1223363491


_______________________________________________ Condor-users mailing list To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a subject: Unsubscribe You can also unsubscribe by visiting https://lists.cs.wisc.edu/mailman/listinfo/condor-users The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/