[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] idle + claimed, Ian Chesel?



The problem is usually the result of a schedd that's starved for CPU time. Your change simply let your system tolerate a longer disconnect between the schedd and startd. The schedd is a single threaded daemon so it has to deal with negotiating jobs, preempting startd's, spawning shadow's, dumping state for condor_q calls, etc. It's a lot of work.
 
Are you using dedicated schedds or are you installing schedds on user desktops? If it's desktops, switch to dedicated schedds -- your users may be occupying the CPU on their desktops with non-condor work such that the schedd doesn't get a long enough slice of compute time to complete all the tasks it has to complete in a timely fashion. If it's a dedicated schedd you can help the situation by partitioning your jobs among several dedicated schedds. They can even run on the same machine if you like (assuming you're using 6.7.x binaries). This psuedo-multithreads the schedd. If you're all vanilla jobs consider dropping schedd preemption altogether in your system (set 'PREEMPTION_REQUIREMENTS = False' on your negotiator and make sure RANK on all your startd's is the same for all jobs like 'RANK = 0 ') and using MaxJobRetirementTime with a PREEMPT statement on your startd's that auto-preempts startd's after a period of time insteading of using the schedd's to preempt the machines.
 
Hope that helps.
 
- Ian


From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Zachary Stauber
Sent: March 26, 2005 3:50 PM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] idle + claimed, Ian Chesel?

I saw on the posts from March that Ian Chesal had this same problem.  Does anyone know if it was resolved off the user list?  I had every machine matched but then would go back to the unclaimed state.  Then I changed the match timeout from the default of 300 to 600, and suddenly they all became claimed, but idle.  I’m still at this current problem.  Only a select few ever go to busy and run jobs, though they are usually the faster CPU machines, and usually when there are fewer machines in the pool (if there are only 8 vm’s in the pool, they almost always just run with no problem, does this mean the submitter is being overloaded?).

 

In condor_q, it SAYS as many jobs are running as are there are machines claimed, but doing a condor_status –verbose on any idle machine shows something like:

 

M:\>condor_status -l bsanchez

MyType = "Machine"

TargetType = "Job"

Name = "vm2@xxxxxxxxxxxxxxxxx"

Machine = "BSANCHEZ.BHI.CORP"

Rank = 0.000000

CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000)

COLLECTOR_HOST_STRING = "a-abq-lic.bhi.corp"

CondorVersion = "$CondorVersion: 6.6.9 Mar 10 2005 $"

CondorPlatform = "$CondorPlatform: INTEL-WINNT40 $"

VirtualMachineID = 2

ImageSize = 1

ExecutableSize = 1

JobUniverse = 5

NiceUser = FALSE

VirtualMemory = 1186088

Disk = 22151548

CondorLoadAvg = 0.000000

LoadAvg = 0.000000

KeyboardIdle = 40204

ConsoleIdle = 40204

Memory = 511

Cpus = 1

StartdIpAddr = "<192.168.100.190:2394>"

Arch = "INTEL"

OpSys = "WINNT51"

UidDomain = "bhi.corp"

FileSystemDomain = "bhi.corp"

Subnet = "192.168.100"

HasIOProxy = TRUE

TotalVirtualMemory = 2372176

TotalDisk = 44303096

KFlops = 879778

Mips = 2804

LastBenchmark = 1111828400

TotalLoadAvg = 0.000000

TotalCondorLoadAvg = 0.000000

ClockMin = 803

ClockDay = 6

TotalVirtualMachines = 2

HasFileTransfer = TRUE

HasMPI = TRUE

HasJICLocalConfig = TRUE

HasJICLocalStdin = TRUE

StarterAbilityList = "HasFileTransfer,HasMPI,HasJICLocalConfig,HasJICLocalStdin"

 

CpuBusyTime = 0

CpuIsBusy = FALSE

State = "Claimed"

EnteredCurrentState = 1111829916

Activity = "Idle"

EnteredCurrentActivity = 1111866159

Start = KeyboardIdle > 5 * 60

Requirements = START

CurrentRank = 0.000000

RemoteUser = "jnipper@xxxxxxxx"

RemoteOwner = "jnipper@xxxxxxxx"

ClientMachine = "cy2-conferece"

DaemonStartTime = 1111828391

UpdateSequenceNumber = 149

MyAddress = "<192.168.100.190:2394>"

LastHeardFrom = 1111868605

UpdatesTotal = 148

UpdatesSequenced = 148

UpdatesLost = 6

UpdatesHistory = "0x005000a0006000000000000000000000"

 

I’m running everything on Windows XP, a mix of SP1 and SP2.  I changed the condor_config on the submitting machine so it could run up to 2000 jobs, I changed the value in the registry to 1280 as it suggested on the “Windows specific issues” in the manual, and the submitter has a gigabit Ethernet card so it never goes over about 10%.  Worker nodes are 100Base-T and the jobs are only about 50 MB, and nobody is on these at night, so I don’t think network bandwidth is not a problem.  A fetchlog on the above machine for STARD will look like this typically…

 

3/26 12:08:05 DaemonCore: Command received via UDP from host <192.168.100.190:3600>

3/26 12:08:05 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:08:05 Starter pid 1020 exited with status 0

3/26 12:08:05 vm1: State change: starter exited

3/26 12:08:05 vm1: Changing activity: Busy -> Idle

3/26 12:38:12 DaemonCore: Command received via TCP from host <192.168.101.116:4695>

3/26 12:38:12 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)

3/26 12:38:12 vm2: Got activate_claim request from shadow (<192.168.101.116:4695>)

3/26 12:38:12 vm2: Remote job ID is 5772.0

3/26 12:38:12 vm2: Got universe "VANILLA" (5) from request classad

3/26 12:38:12 vm2: State change: claim-activation protocol successful

3/26 12:38:12 vm2: Changing activity: Idle -> Busy

3/26 12:42:39 DaemonCore: Command received via TCP from host <192.168.101.116:4893>

3/26 12:42:39 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)

3/26 12:42:39 vm2: Called deactivate_claim_forcibly()

3/26 12:42:39 DaemonCore: Command received via UDP from host <192.168.100.190:3675>

3/26 12:42:39 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:42:39 Starter pid 4052 exited with status 0

3/26 12:42:39 vm2: State change: starter exited

3/26 12:42:39 vm2: Changing activity: Busy -> Idle

3/26 12:55:18 DaemonCore: Command received via TCP from host <192.168.101.116:1543>

3/26 12:55:18 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)

3/26 12:55:18 vm1: Got activate_claim request from shadow (<192.168.101.116:1543>)

3/26 12:55:18 vm1: Remote job ID is 5807.0

3/26 12:55:18 vm1: Got universe "VANILLA" (5) from request classad

3/26 12:55:18 vm1: State change: claim-activation protocol successful

3/26 12:55:18 vm1: Changing activity: Idle -> Busy

3/26 12:59:17 DaemonCore: Command received via TCP from host <192.168.101.116:1608>

3/26 12:59:17 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)

3/26 12:59:17 vm1: Called deactivate_claim_forcibly()

3/26 12:59:17 DaemonCore: Command received via UDP from host <192.168.100.190:3725>

3/26 12:59:17 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

3/26 12:59:17 Starter pid 3164 exited with status 0

3/26 12:59:17 vm1: State change: starter exited

3/26 12:59:17 vm1: Changing activity: Busy -> Idle

 

And a fetchlog on the submitting machine for SCHEDD will look like:

 

3/26 13:35:45 Started shadow for job 5870.0 on "<192.168.100.190:2394>", (shadow pid = 1276)

3/26 13:35:45 DaemonCore: Command received via UDP from host <192.168.101.116:3154>

3/26 13:35:45 DaemonCore: received command 60001 (DC_PROCESSEXIT), calling handler (HandleProcessExitCommand())

 

The machine 192.168.100.190 is bsanchez, so I just included the relevant part of the log for that machine.  I don’t know if bsanchez isn’t waiting long enough to start the job somehow, or how to change that setting, or if the submitter, cy2-conf, is timing out before sending it out or what, but it seems to be a timing/load issue, since if there are only 2 machines in the pool with 4 processors each, they usually run fine.

 

Zachary L. Stauber

Systems Analyst

Spatial Data

BohannanHuston

Courtyard One, 7500 Jefferson N.E.

Albuquerque, New Mexico  87109

Office: 505-823-1000

Direct: 505-798-7970

Fax: 505-798-7932

Email: zstauber@xxxxxxxxx