[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CLAIMED but IDLE on Windows XP machines



Hi all,
we are running about 270 Window XP machines running Condor v6.6.5
We have found that our system will run jobs on all these machines, however
at the end of completing a job the machine will be locked in IDLE mode for a
long time. Some of the worker nodes were locked out for about 17 minutes
each in the IDLE state. 

The only weird stuff I could find was this error message on the ShadowLog on
the submit machine

5/6 16:52:33 (583907.0) (3016): getpeername failed so connect must have
failed

and this a bit later on:

5/6 16:53:32 (583907.0) (3016): **** condor_shadow (condor_SHADOW) EXITING
WITH STATUS 100

In fact every job listed in the ShadowLog on the submit machine seems to
start and end this way with these errors, however it still does run.



The condor_q -analyze did not reveal anything interesting.
For example this is an output from a typical machine in IDLE mode.

MyType = "Machine"
TargetType = "Job"
Name = "A1"
Machine = "A1"
Rank = 0.000000
CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)
CondorVersion = "$CondorVersion: 6.6.5 May  4 2004 $"
CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"
VirtualMachineID = 1
ImageSize = 1
ExecutableSize = 1
JobUniverse = 5
NiceUser = FALSE
VirtualMemory = 2351940
Disk = 32668588
CondorLoadAvg = 0.000000
LoadAvg = 0.000000
KeyboardIdle = 3302
ConsoleIdle = 3302
Memory = 1024
Cpus = 1
StartdIpAddr = "<129.127.237.51:1055>"
Arch = "INTEL"
OpSys = "WINNT51"
UidDomain = "A1"
FileSystemDomain = "A1"
Subnet = "129.127.237"
HasIOProxy = TRUE
TotalVirtualMemory = 2351940
TotalDisk = 32668588
KFlops = 698186
Mips = 2140
LastBenchmark = 1115276842
TotalLoadAvg = 0.000000
TotalCondorLoadAvg = 0.000000
ClockMin = 1022
ClockDay = 5
TotalVirtualMachines = 1
HasFileTransfer = TRUE
HasMPI = TRUE
HasJICLocalConfig = TRUE
HasJICLocalStdin = TRUE
JavaVendor = "Sun Microsystems Inc."
JavaVersion = "1.4.2_07"
JavaMFlops = 192.357315
HasJava = TRUE
StarterAbilityList =
"HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin,HasJava"
CpuBusyTime = 0
CpuIsBusy = FALSE
State = "Claimed"
EnteredCurrentState = 1115362926
Activity = "Idle"
EnteredCurrentActivity = 1115364152
Start = ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <=
0.300000) || (State != "Unclaimed" && State != "Owner")))
Requirements = START
CurrentRank = 0.000000
RemoteUser = "cqhoward@xxxxxxxxxxxxxxx"
RemoteOwner = "cqhoward@xxxxxxxxxxxxxxxx"
ClientMachine = "mechcats1.mecheng.adelaide.edu.au"
DaemonStartTime = 1115276832
UpdateSequenceNumber = 338
MyAddress = "<129.127.237.51:1055>"
LastHeardFrom = 1115364750




The StartLog on the machine shows a typical transition from IDLE->BUSY runs
the jobs successfully and then wait for 17 minutes (16:52:32 to 17:09:00)
before working on another job, even though there are 100 identical jobs
waiting the queue.




5/6 16:49:32 Got activate_claim request from shadow (<129.127.14.42:2540>)
5/6 16:49:32 Remote job ID is 583907.0
5/6 16:49:32 Got universe "VANILLA" (5) from request classad
5/6 16:49:32 State change: claim-activation protocol successful
5/6 16:49:32 Changing activity: Idle -> Busy
5/6 16:52:31 DaemonCore: Command received via TCP from host
<129.127.14.42:3043>
5/6 16:52:31 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY),
calling handler (command_handler)
5/6 16:52:31 Called deactivate_claim_forcibly()
5/6 16:52:31 DaemonCore: Command received via UDP from host
<129.127.237.51:4419>
5/6 16:52:31 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling
handler (HandleProcessExitCommand())
5/6 16:52:31 Starter pid 1604 exited with status 0
5/6 16:52:32 State change: starter exited
5/6 16:52:32 Changing activity: Busy -> Idle
5/6 17:09:00 DaemonCore: Command received via TCP from host
<129.127.14.42:4874>
5/6 17:09:00 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
handler (command_activate_claim)
5/6 17:09:00 Got activate_claim request from shadow (<129.127.14.42:4874>)
5/6 17:09:00 Remote job ID is 584041.0
5/6 17:09:00 Got universe "VANILLA" (5) from request classad
5/6 17:09:00 State change: claim-activation protocol successful
5/6 17:09:00 Changing activity: Idle -> Busy


The startlog shows the same stuff:

5/6 16:49:33 ******************************************************
5/6 16:49:33 ** condor_starter (CONDOR_STARTER) STARTING UP
5/6 16:49:33 ** $CondorVersion: 6.6.5 May  4 2004 $
5/6 16:49:33 ** $CondorPlatform: INTEL-WINNT40 $
5/6 16:49:33 ** PID = 1604
5/6 16:49:33 ******************************************************
5/6 16:49:33 Using config file: C:\Condor\condor_config
5/6 16:49:33 Using local config files: C:\Condor/condor_config.local
5/6 16:49:33 DaemonCore: Command Socket at <129.127.237.51:4407>
5/6 16:49:33 Setting resource limits not implemented!
5/6 16:49:33 Starter communicating with condor_shadow <129.127.14.42:2522>
5/6 16:49:33 Submitting machine is "mechcats1.mecheng.adelaide.edu.au"
5/6 16:49:36 File transfer completed successfully.
5/6 16:49:37 Starting a VANILLA universe job with ID: 583907.0
5/6 16:49:37 IWD: C:\Condor/execute\dir_1604
5/6 16:49:37 Output file:
C:\Condor/execute\dir_1604\evaluate_condor_int583907.out
5/6 16:49:37 Error file:
C:\Condor/execute\dir_1604\evaluate_condor_int583907.err
5/6 16:49:37 Renice expr "10" evaluated to 10
5/6 16:49:37 About to exec C:\WINDOWS\System32\cmd.exe /Q /C condor_exec.bat

5/6 16:49:37 Create_Process succeeded, pid=2772
5/6 16:52:30 Process exited, pid=2772, status=0
5/6 16:52:31 Got SIGQUIT.  Performing fast shutdown.
5/6 16:52:31 ShutdownFast all jobs.
5/6 16:52:31 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0
5/6 17:09:00 ******************************************************
5/6 17:09:00 ** condor_starter (CONDOR_STARTER) STARTING UP
5/6 17:09:00 ** $CondorVersion: 6.6.5 May  4 2004 $
5/6 17:09:00 ** $CondorPlatform: INTEL-WINNT40 $
5/6 17:09:00 ** PID = 2992
5/6 17:09:00 ******************************************************
5/6 17:09:00 Using config file: C:\Condor\condor_config
5/6 17:09:00 Using local config files: C:\Condor/condor_config.local
5/6 17:09:00 DaemonCore: Command Socket at <129.127.237.51:4428>
5/6 17:09:00 Setting resource limits not implemented!
5/6 17:09:00 Starter communicating with condor_shadow <129.127.14.42:4866>
5/6 17:09:00 Submitting machine is "mechcats1.mecheng.adelaide.edu.au"
5/



Any clues as to why these machines wait so long before accepting another
job?


Thanks

Carl