Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with jobs

Date: Mon, 5 Dec 2005 21:18:41 -0000
From: "Chris Miles" <chrismiles@xxxxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Problems with jobs

Well a schedd daemon runs on every node in the cluster.
if thats what you mean?

Chris

Ahh, but do you only have the one schedd? It looks like you have three jobsrunning (Claimed + Busy) according to your output -- are they all from >thesame schedd? Maybe there's another schedd in your system that's not able torespond to it's claims in time?

- Ian

----- Original Message -----From: Ian Chesal

To: Chris Miles
Sent: Monday, December 05, 2005 4:26 PM
Subject: RE: [Condor-users] Problems with jobs

The dreaded Claimed+Idle. It generally happens to us when our schedd can'tkeep up with the processing required to start our jobs. Check the resourceson your schedd machine: can your machine handle spawing all the necessaryshadows? Or is it running out of CPU, memory, disk, etc?

- Ian

From: Chris Miles [mailto:chrismiles@xxxxxxxxxxxxxxxx]
Sent: December 5, 2005 11:11 AM
To: Condor-Users Mail List; Ian Chesal
Subject: Re: [Condor-users] Problems with jobs

Hi Thanks for the response.

SUBMIT_SEND_RESCHEDULE has not specified in any of my config files which
means that its automatically set to true does it not?

condor_q -ana says jobs being serviced.

It seems a lot of machines go into the claimed state but stay idle.

tux.neuralgri LINUX INTEL Unclaimed Idle 0.250 5120+00:34:48vm1@xxxxxxxxx LINUX X86_64 Owner Idle 0.750 20480+00:00:02vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.120 20480+00:01:20vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:01:21vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.270 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.180 20480+00:00:07vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:11vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.050 20480+00:00:07vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:10vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.100 20480+00:00:07vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:11vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.100 20480+00:00:07vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:11vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:07vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.210 20480+00:00:11vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.080 20480+00:00:08vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:03vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.050 20480+00:00:02vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:03vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.160 20480+00:00:04vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Owner Idle 1.000 20480+20:15:59vm2@xxxxxxxxx LINUX X86_64 Owner Idle 0.310 20480+00:00:02vm1@xxxxxxxxx LINUX X86_64 Owner Idle 1.000 20480+20:16:02vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.220 20480+00:00:09vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.130 20480+00:00:05vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:06vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:04vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.460 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.200 20480+00:00:04vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Unclaimed Idle 0.130 20480+00:00:05vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.170 20480+00:00:04vm2@xxxxxxxxx LINUX X86_64 Claimed Busy 0.000 20480+00:00:06vm1@xxxxxxxxx LINUX X86_64 Claimed Busy 0.020 20480+00:00:04vm2@xxxxxxxxx LINUX X86_64 Claimed Busy 0.000 20480+00:00:06vm1@xxxxxxxxx LINUX X86_64 Claimed Idle 0.110 20480+00:00:05vm2@xxxxxxxxx LINUX X86_64 Claimed Idle 0.000 20480+00:00:05


Chris

----- Original Message -----From: Ian Chesal

To: Condor-Users Mail List
Sent: Monday, December 05, 2005 2:39 PM
Subject: Re: [Condor-users] Problems with jobs



Hi.

When im submitting jobs into my pool it seems to take ages to start runningthejobs unless i run condor_reschedule. Is there a way to speed the process upwithout

me running this command?

[Ian Chesal] See:http://www.cs.wisc.edu/condor/manual/v6.7/3_3Configuration.html#11494 --make sure you have that set to True in the config file on the machine you'recalling condor_submit from. It will automatically issue a reschedule aftersubmission.

My second problem is that job results are not returning to me any quickerthan If i ranmy jobs one a one machine pool. I.e im checking condor_q and the queue isonly goingdown 1 at a time at roughly the same speed as if there was only one machinein that pool.It is also slower than if I actually ran my jobs sequentially on one machineusing a batch

file or shell script.

[Ian Chesal] What does condor_q -ana say? Are you setting your jobrequirements such that only one VM in the system is able to match with allyour jobs in your cluster? What about the MAX_JOBS_RUNNING setting on yourschedd? Make sure that isn't set to 1.




_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx

https://lists.cs.wisc.edu/mailman/listinfo/condor-users

Prev by Date: Re: [Condor-users] kill exprs with job props
Next by Date: [Condor-users] Condor on Xbox??
Previous by thread: Re: [Condor-users] Problems with jobs
Next by thread: Re: [Condor-users] Problems with jobs
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Problems with jobs