[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] idle + claimed, Ian Chesel?



I saw on the posts from March that Ian Chesal had this same problem.  Does anyone know if it was resolved off the user list?  I had every machine matched but then would go back to the unclaimed state.  Then I changed the match timeout from the default of 300 to 600, and suddenly they all became claimed, but idle.  I’m still at this current problem.  Only a select few ever go to busy and run jobs, though they are usually the faster CPU machines, and usually when there are fewer machines in the pool (if there are only 8 vm’s in the pool, they almost always just run with no problem, does this mean the submitter is being overloaded?).

 

In condor_q, it SAYS as many jobs are running as are there are machines claimed, but doing a condor_status –verbose on any idle machine shows something like:

 

M:\>condor_status -l bsanchez

MyType = "Machine"

TargetType = "Job"

Name = "vm2@xxxxxxxxxxxxxxxxx"

Machine = "BSANCHEZ.BHI.CORP"

Rank = 0.000000

CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)

COLLECTOR_HOST_STRING = "a-abq-lic.bhi.corp"

CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"

CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"

VirtualMachineID = 2

ImageSize = 1

ExecutableSize = 1

JobUniverse = 5

NiceUser = FALSE

VirtualMemory = 1186088

Disk = 22151548

CondorLoadAvg = 0.000000

LoadAvg = 0.000000

KeyboardIdle = 40204

ConsoleIdle = 40204

Memory = 511

Cpus = 1

StartdIpAddr = "<192.168.100.190:2394>"

Arch = "INTEL"

OpSys = "WINNT51"

UidDomain = "bhi.corp"

FileSystemDomain = "bhi.corp"

Subnet = "192.168.100"

HasIOProxy = TRUE

TotalVirtualMemory = 2372176

TotalDisk = 44303096

KFlops = 879778

Mips = 2804

LastBenchmark = 1111828400

TotalLoadAvg = 0.000000

TotalCondorLoadAvg = 0.000000

ClockMin = 803

ClockDay = 6

TotalVirtualMachines = 2

HasFileTransfer = TRUE

HasMPI = TRUE

HasJICLocalConfig = TRUE

HasJICLocalStdin = TRUE

StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

 

CpuBusyTime = 0

CpuIsBusy = FALSE

State = "Claimed"

EnteredCurrentState = 1111829916

Activity = "Idle"

EnteredCurrentActivity = 1111866159

Start = KeyboardIdle > 5 * 60

Requirements = START

CurrentRank = 0.000000

RemoteUser = "jnipper@xxxxxxxx"

RemoteOwner = "jnipper@xxxxxxxx"

ClientMachine = "cy2-conferece"

DaemonStartTime = 1111828391

UpdateSequenceNumber = 149

MyAddress = "<192.168.100.190:2394>"

LastHeardFrom = 1111868605

UpdatesTotal = 148

UpdatesSequenced = 148

UpdatesLost = 6

UpdatesHistory = "0x005000a0006000000000000000000000"

 

I’m running everything on Windows XP, a mix of SP1 and SP2.  I changed the condor_config on the submitting machine so it could run up to 2000 jobs, I changed the value in the registry to 1280 as it suggested on the “Windows specific issues” in the manual, and the submitter has a gigabit Ethernet card so it never goes over about 10%.  Worker nodes are 100Base-T and the jobs are only about 50 MB, and nobody is on these at night, so I don’t think network bandwidth is not a problem.  A fetchlog on the above machine for STARD will look like this typically…

 

3/26 12:08:05 DaemonCore: Command received via UDP from host <192.168.100.190:3600>

3/26 12:08:05 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:08:05 Starter pid 1020 exited with status 0

3/26 12:08:05 vm1: State change: starter exited

3/26 12:08:05 vm1: Changing activity: Busy -> Idle

3/26 12:38:12 DaemonCore: Command received via TCP from host <192.168.101.116:4695>

3/26 12:38:12 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)

3/26 12:38:12 vm2: Got activate_claim request from shadow (<192.168.101.116:4695>)

3/26 12:38:12 vm2: Remote job ID is 5772.0

3/26 12:38:12 vm2: Got universe "VANILLA" (5) from request classad

3/26 12:38:12 vm2: State change: claim-activation protocol successful

3/26 12:38:12 vm2: Changing activity: Idle -> Busy

3/26 12:42:39 DaemonCore: Command received via TCP from host <192.168.101.116:4893>

3/26 12:42:39 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)

3/26 12:42:39 vm2: Called deactivate_claim_forcibly()

3/26 12:42:39 DaemonCore: Command received via UDP from host <192.168.100.190:3675>

3/26 12:42:39 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:42:39 Starter pid 4052 exited with status 0

3/26 12:42:39 vm2: State change: starter exited

3/26 12:42:39 vm2: Changing activity: Busy -> Idle

3/26 12:55:18 DaemonCore: Command received via TCP from host <192.168.101.116:1543>

3/26 12:55:18 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)

3/26 12:55:18 vm1: Got activate_claim request from shadow (<192.168.101.116:1543>)

3/26 12:55:18 vm1: Remote job ID is 5807.0

3/26 12:55:18 vm1: Got universe "VANILLA" (5) from request classad

3/26 12:55:18 vm1: State change: claim-activation protocol successful

3/26 12:55:18 vm1: Changing activity: Idle -> Busy

3/26 12:59:17 DaemonCore: Command received via TCP from host <192.168.101.116:1608>

3/26 12:59:17 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)

3/26 12:59:17 vm1: Called deactivate_claim_forcibly()

3/26 12:59:17 DaemonCore: Command received via UDP from host <192.168.100.190:3725>

3/26 12:59:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:59:17 Starter pid 3164 exited with status 0

3/26 12:59:17 vm1: State change: starter exited

3/26 12:59:17 vm1: Changing activity: Busy -> Idle

 

And a fetchlog on the submitting machine for SCHEDD will look like:

 

3/26 13:35:45 Started shadow for job 5870.0 on "<192.168.100.190:2394>", (shadow pid = 1276)

3/26 13:35:45 DaemonCore: Command received via UDP from host <192.168.101.116:3154>

3/26 13:35:45 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

 

The machine 192.168.100.190 is bsanchez, so I just included the relevant part of the log for that machine.  I don’t know if bsanchez isn’t waiting long enough to start the job somehow, or how to change that setting, or if the submitter, cy2-conf, is timing out before sending it out or what, but it seems to be a timing/load issue, since if there are only 2 machines in the pool with 4 processors each, they usually run fine.

 

Zachary L. Stauber

Systems Analyst

Spatial Data

BohannanHuston

Courtyard One, 7500 Jefferson N.E.

Albuquerque, New Mexico  87109

Office: 505-823-1000

Direct: 505-798-7970

Fax: 505-798-7932

Email: zstauber@xxxxxxxxx