[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with jobs



Basically I have a pool with a shared file system and 25 machines. These are all very powerfull. I think the weakest link in my chain is my submitting machine which is just a lone server with its own configuration. Its a 1ghz 512mb Mini ITX box. Not the fastest in the world and has a few
other applications running (required).

Is this the machine I should set JOB_START_COUNT on? or should it be set on the machines that
actually run the jobs?

On my submitting machine thats the one I see the condor_shadow daemons firing up.


chris 18253 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.9 <146.191.100.202:46251> - chris 18325 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.12 <146.191.100.202:46251> - chris 18362 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.13 <146.191.100.202:46251> - chris 18396 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.14 <146.191.100.202:46251> - chris 18454 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.15 <146.191.100.202:46251> - chris 18464 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.16 <146.191.100.202:46251> - chris 18499 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.17 <146.191.100.202:46251> - chris 18533 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.18 <146.191.100.202:46251> - chris 18570 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.19 <146.191.100.202:46251> - chris 18579 5552 0 14:01 ? 00:00:00 condor_shadow -f 80.20 <146.191.100.202:46251> -

How many of these is normal? I am submitting a 1000 job cluster to the pool with 25 machines (50 vms).

Looks like I may be running low on memory on my submitting machine as well.

top - 14:01:52 up 20 days, 21:24,  3 users,  load average: 0.79, 1.21, 0.92
Tasks:  71 total,   1 running,  70 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.3% us,  1.7% sy,  0.7% ni, 97.0% id,  0.0% wa,  0.3% hi,  0.0% si
Mem:    484284k total,   476868k used,     7416k free,    14216k buffers
Swap:   999928k total,        0k used,   999928k free,   249768k cached

Im still unsure to some of this.... where exactly is the problem lying,, the submitter or the executers?

thanks again

Chris

----- Original Message ----- From: "Matt Hope" <matthew.hope@xxxxxxxxx>
To: "Condor-Users Mail List" <condor-users@xxxxxxxxxxx>
Cc: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Sent: Thursday, December 08, 2005 8:21 AM
Subject: Re: [Condor-users] Problems with jobs


On 12/7/05, Chris Miles <chrismiles@xxxxxxxxxxxxxxxx> wrote:
I have managed to get that number up as high as 20 and even 50 with only
little difference. I am seeing
more running jobs, but not much more. Only 7vms max so far

How many (non held) clusters and jobs* are in your queue and how often
do you negotiate?

Since the schedd can only do one of the two tasks (starting shadows
and serving queue info requests) it can fail to keep up

A similar situation can occur if something/someone is running condor_q
against your schedd repeatedly.

* if NEGOTIATE_ALL_JOBS_IN_CLUSTER is true then jobs matter, if not
then clusters matter.

Matt

_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users