[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with jobs



> Basically I have a pool with a shared file system and 25 machines.
These
> are
> all very powerfull.
> I think the weakest link in my chain is my submitting machine which is
> just
> a lone server with its
> own configuration. Its a 1ghz 512mb Mini ITX box. Not the fastest in
the
> world and has a few
> other applications running (required).
> 
> Is this the machine I should set JOB_START_COUNT on? or should it be
set
> on
> the machines that
> actually run the jobs?

The submitting (the machine running condor_schedd) needs this setting.
This is a schedd setting.
 
> How many of these is normal? I am submitting a 1000 job cluster to the
> pool
> with 25 machines (50 vms).

You'll get a condor_shadow process spawned for every job that is running
in your system. So if you have 50 VMs you'll get 50 condor_shadow
processes spawned on your submitting machine if you end up with all VMs
occupied.
 
> Looks like I may be running low on memory on my submitting machine as
> well.

That can hurt.
 
> Im still unsure to some of this.... where exactly is the problem
lying,,
> the
> submitter or the executers?

It's hard to say. Are all your execute nodes the same type of machine
(memory, disk space, architecture, processor, OS)? If you take one
cluster of jobs where a few jobs are running but the majority are not
and you say:

condor_q -ananlyze <cluster>.<proc>

Where <cluster> is the cluster ID and <proc> is the process ID of one
the jobs that is NOT running, what does the output look like?

- Ian