Re: [Condor-users] Inconsistent status in condor

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hello,

Thanks very much for your (Steven and Rober) kindly response.

Actually, I did run the command "condor_release" on that job. But the problem
remains. We use the shared network filesystem on the cluster, and there is
a NIS server on the control node "w00". So the file owner and permissions
are consistent between w00 and w05.

The reason that "root" run the "condor_hold" command automatically is
because that we want to make additional control on the resource management.
For example, if one user already submitted a lot of jobs, he will have lower
piority to run new jobs. Or if a user submits a job which requires a lot of CPUs,
but the system does not have enough CPUs at this moment, we want to hold
that job so that other jobs which require fewer CPUs can run without waiting.
Since we don't know how to configure these features in condor, we wrote some
programs to do the "condor_release" or "condor_hold" on jobs for these
control.

Here is the "ShadowLog" in w00.

10/7 11:20:02 ******************************************************
10/7 11:20:02 ** condor_shadow (CONDOR_SHADOW) STARTING UP
10/7 11:20:02 ** /opt/condor/sbin/condor_shadow
10/7 11:20:02 ** $CondorVersion: 6.8.2 Oct 12 2006 $
10/7 11:20:02 ** $CondorPlatform: X86_64-LINUX_RHEL3 $
10/7 11:20:02 ** PID = 26927
10/7 11:20:02 ** Log last touched 10/7 11:15:02
10/7 11:20:02 ******************************************************
10/7 11:20:02 Using config source: /opt/condor/etc/condor_config
10/7 11:20:02 Using local config sources:
10/7 11:20:02 /opt/condor/etc/condor_config.common
10/7 11:20:02 /opt/condor/etc/w00.local
10/7 11:20:02 DaemonCore: Command Socket at <192.168.10.1:58808>
10/7 11:20:02 Initializing a PARALLEL shadow for job 18599.0
10/7 11:20:03 (18599.0) (26927): Request to run on <192.168.10.6:50699> was ACCEPTED
10/7 11:20:03 (18599.0) (26927): Request to run on <192.168.10.6:50699> was ACCEPTED
10/7 11:20:03 (18599.0) (26927): Request to run on <192.168.10.6:50699> was ACCEPTED
10/7 11:20:03 (18599.0) (26927): Request to run on <192.168.10.6:50699> was ACCEPTED

================================================

Here is the "StarterLog" on w05.

10/7 11:19:59 DaemonCore: Command received via TCP from host <192.168.10.1:33081>
10/7 11:19:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
10/7 11:19:59 vm1: Request accepted.
10/7 11:19:59 vm1: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx
10/7 11:19:59 vm1: State change: claiming protocol successful
10/7 11:19:59 vm1: Changing state: Unclaimed -> Claimed
10/7 11:19:59 DaemonCore: Command received via TCP from host <192.168.10.1:52451>
10/7 11:19:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
10/7 11:19:59 vm2: Request accepted.
10/7 11:19:59 vm2: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx
10/7 11:19:59 vm2: State change: claiming protocol successful
10/7 11:19:59 vm2: Changing state: Unclaimed -> Claimed
10/7 11:19:59 DaemonCore: Command received via TCP from host <192.168.10.1:38019>
10/7 11:19:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
10/7 11:19:59 vm3: Request accepted.
10/7 11:19:59 vm3: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx
10/7 11:19:59 vm3: State change: claiming protocol successful
10/7 11:19:59 vm3: Changing state: Unclaimed -> Claimed
10/7 11:19:59 DaemonCore: Command received via TCP from host <192.168.10.1:34693>
10/7 11:19:59 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)
10/7 11:19:59 vm4: Request accepted.
10/7 11:19:59 vm4: Remote owner is DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx
10/7 11:19:59 vm4: State change: claiming protocol successful
10/7 11:19:59 vm4: Changing state: Unclaimed -> Claimed
10/7 11:19:59 DaemonCore: Command received via UDP from host <192.168.10.1:54857>
10/7 11:19:59 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
10/7 11:19:59 vm1: match_info called
10/7 11:19:59 DaemonCore: Command received via UDP from host <192.168.10.1:54858>
10/7 11:19:59 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
10/7 11:19:59 vm2: match_info called
10/7 11:19:59 DaemonCore: Command received via UDP from host <192.168.10.1:54859>
10/7 11:19:59 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
10/7 11:19:59 vm3: match_info called
10/7 11:19:59 DaemonCore: Command received via UDP from host <192.168.10.1:54860>
10/7 11:19:59 DaemonCore: received command 440 (MATCH_INFO), calling handler (command_match_info)
10/7 11:19:59 vm4: match_info called
10/7 11:20:02 DaemonCore: Command received via TCP from host <192.168.10.1:50010>
10/7 11:20:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
10/7 11:20:02 vm1: Got activate_claim request from shadow (<192.168.10.1:50010>)
10/7 11:20:02 vm1: Remote job ID is 18599.0
10/7 11:20:02 vm1: Got universe "PARALLEL" (11) from request classad
10/7 11:20:02 vm1: State change: claim-activation protocol successful
10/7 11:20:02 vm1: Changing activity: Idle -> Busy
10/7 11:20:02 DaemonCore: Command received via TCP from host <192.168.10.1:52492>
10/7 11:20:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
10/7 11:20:02 vm2: Got activate_claim request from shadow (<192.168.10.1:52492>)
10/7 11:20:02 vm2: Remote job ID is 18599.0
10/7 11:20:02 vm2: Got universe "PARALLEL" (11) from request classad
10/7 11:20:02 vm2: State change: claim-activation protocol successful
10/7 11:20:02 vm2: Changing activity: Idle -> Busy
10/7 11:20:02 DaemonCore: Command received via TCP from host <192.168.10.1:53963>
10/7 11:20:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
10/7 11:20:02 vm3: Got activate_claim request from shadow (<192.168.10.1:53963>)
10/7 11:20:02 vm3: Remote job ID is 18599.0
10/7 11:20:02 vm3: Got universe "PARALLEL" (11) from request classad
10/7 11:20:02 vm3: State change: claim-activation protocol successful
10/7 11:20:02 vm3: Changing activity: Idle -> Busy
10/7 11:20:02 DaemonCore: Command received via TCP from host <192.168.10.1:51850>
10/7 11:20:02 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)
10/7 11:20:02 vm4: Got activate_claim request from shadow (<192.168.10.1:51850>)
10/7 11:20:02 vm4: Remote job ID is 18599.0
10/7 11:20:02 vm4: Got universe "PARALLEL" (11) from request classad
10/7 11:20:02 vm4: State change: claim-activation protocol successful
10/7 11:20:02 vm4: Changing activity: Idle -> Busy
10/7 11:27:09 DaemonCore: Command received via UDP from host <192.168.10.6:51154>
10/7 11:27:09 DaemonCore: received command 60000 (DC_RAISESIGNAL), calling handler (HandleSigCommand())
10/7 11:27:09 Got SIGHUP. Re-reading config files.
10/7 11:27:09 "/opt/condor/sbin/condor_starter.pvm -classad" did not produce any
output, ignoring

Everything looks normal. Is there any other suggestion ?

Thanks again for your kindly response.

T.H.Hsieh

2008/10/7 Robert Futrick <rfutrick@xxxxxxxxxxxxxxxxxx>

T.H.Hsieh,

I would look at the ShadowLog on w00 and the StarterLog on w05 for this job. It may be that a file/directory/permissions problem exists, as it appears from your classad that you have a shared filesystem. Hopefully these logs might shed some more light on what is occurring if you look for messages about 18599.0

Hope this helps and good luck,

Regards,
Rob
-- 

===================================
Rob Futrick
main: 888.292.5320

Cycle Computing, LLC
Leader in Condor Grid Solutions
Enterprise Condor Support and CycleServer Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
Tung-Han Hsieh wrote:
Hello,

I encountered the following situation. A job is submitted through "condor_submit" and waiting for
running. But at the moment when the job is going to run, the root run the "condor_hold" command
to hold the job. The end result is: condor_q shows that the job is held, but actually the job is running
in one of our computing node.

My question is, how to correct this inconsistent result in condor_q? One way is to restart the condor
daemons on the computing node, but that running job will be killed and has to be resubmitted. Is
there a way to keep that job running, but correct the result of "condor_q" such that it also shows
that the job is running?

The detailed "condor_q -l" of that job is in the following. When look carefully into the list, we find
that "condor_q" shows that the job is held, but it also runs in the computing node "w05". There is no
CPU time statistics. Actually the job is really running. And the computing node "w05" is occupied by
that user when checking "condor_status -run" command on that node.

Thanks very much for your suggestions. :)

T.H.Hsieh

-- Submitter: w00.xxx.xxx.xxx.tw : <192.168.10.1:53890> : w00.xxx.xxx.xxx.tw
MyType = "Job"
TargetType = "Machine"
ClusterId = 18599
QDate = 1223349567
CompletionDate = 0
Owner = "kwXXXX"
RemoteWallClockTime = 0.000000
LocalUserCpu = 0.000000
LocalSysCpu = 0.000000
RemoteUserCpu = 0.000000
RemoteSysCpu = 0.000000
ExitStatus = 0
NumCkpts_RAW = 0
NumCkpts = 0
NumRestarts = 0
NumSystemHolds = 0
CommittedTime = 0
TotalSuspensions = 0
CumulativeSuspensionTime = 0
ExitBySignal = FALSE
CondorVersion = "$CondorVersion: 6.8.2 Oct 12 2006 $"
CondorPlatform = "$CondorPlatform: X86_64-LINUX_RHEL3 $"
RootDir = "/"
Iwd = "/work1/hkw/Guanidine"
JobUniverse = 11
Cmd = "/opt/bin/gaussian"
WantIOProxy = TRUE
WantRemoteSyscalls = FALSE
WantCheckpoint = FALSE
MinHosts = 4
MaxHosts = 4
JobPrio = 0
User = "kwXXXX@xxxxxxxxxxxxxx"
NiceUser = FALSE
Environment = ""
JobNotification = 2
WantRemoteIO = TRUE
UserLog = "/work1/hkw/Guanidine/deH-TS2.log"
CoreSize = 0
KillSig = "SIGTERM"
Rank = 0.000000
In = "/dev/null"
TransferIn = FALSE
Out = "deH-TS2.output"
StreamOut = FALSE
Err = "deH-TS2.error"
StreamErr = FALSE
BufferSize = 524288
BufferBlockSize = 32768
ShouldTransferFiles = "NO"
TransferFiles = "NEVER"
ImageSize_RAW = 1
ImageSize = 10000
ExecutableSize_RAW = 1
ExecutableSize = 10000
DiskUsage_RAW = 1
DiskUsage = 10000
Requirements = (JTYPE == "long") && (Arch == "X86_64") && (OpSys == "LINUX") &&
(Disk >= DiskUsage) && ((Memory * 1024) >= ImageSize) && (TARGET.FileSystemDomai
n == MY.FileSystemDomain)
FileSystemDomain = "rcas.sinica.edu.tw"
JobLeaseDuration = 1200
PeriodicHold = FALSE
PeriodicRelease = FALSE
PeriodicRemove = FALSE
> > LeaveJobInQueue = FALSE
Args = "deH-TS2.input"
WantParallelSchedulingGroups = TRUE
GlobalJobId = "w00.rcas.sinica.edu.tw#1223349567#18599.0"
ProcId = 0
Scheduler = "DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx"
ClaimIds = "<192.168.10.6:50699>#1222327061#1,<192.168.10.6:50699>#1222327061#2,
<192.168.10.6:50699>#1222327061#3,<192.168.10.6:50699>#1222327061#4"
RemoteHosts = "vm1@w05,vm2@w05,vm3@w05,vm4@w05"
CurrentHosts = 4
ClaimId = "<192.168.10.6:50699>#1222327061#1"
LastJobLeaseRenewal = 1223349602
RemoteHost = "vm1@w05"
RemoteVirtualMachineID = 1
ShadowBday = 1223349602
JobStartDate = 1223349602
JobCurrentStartDate = 1223349602
JobRunCount = 1
OrigMaxHosts = 4
LastSuspensionTime = 0
LastHoldReason = "via condor_hold (by user root)"
JobStatus = 5
HoldReason = "via condor_hold (by user root)"
EnteredCurrentStatus = 1223349803
LastReleaseReason = "via condor_release (by user root)"
ServerTime = 1223363491
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: 
https://lists.cs.wisc.edu/archive/condor-users/
  
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Mailing List Archives

Public Access

Re: [Condor-users] Inconsistent status in condor_q