[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Condor job submission delayed
- Date: Wed, 1 Sep 2004 11:59:40 -0700
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: RE: [Condor-users] Condor job submission delayed
I have one client (Windows XP) for running jobs, one master (Linux, RH9) for control and one machine for submitting jobs from (Windows XP) and I'm seeing this long delay as well so I'm not thinking it's a load issue. Hopefully one of the condor team members has some insight into the issue...
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of John Wheez
Sent: September 1, 2004 2:04 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] Condor job submission delayed
Funny I was just going to post about this same problem. I have the same
setup as Marc Saric described and I have the same problem as he
described. Mainly, sometimes it takes a long time for the pool controler
to send out a new job to a CPU that has recently finished.I see this
often in my dual processor machines as well.
I have windows processing nodes and a linux master controller. I do have
lots of cluster based node submissions all being submitted by my pool
master. I wonder if this is the problem..The fact that the pool master
is also the one that submits the jobs might be overloading the pool
master somehow since it is also keeping track of all the jobs?? I wonder
if i making so my client machines submit teh job will help clear up the
problem..i'll try this today....
Ian Chesal wrote:
>I never saw an answer to this question. Did one get proffered off the
>list? Could you please cross post it if that is the case. I too am
>curious about this delay as I'm seeing this in my flock of Windows XP
>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Marc Saric
>Sent: August 31, 2004 10:26 AM
>Subject: [Condor-users] Condor job submission delayed
>-----BEGIN PGP SIGNED MESSAGE-----
>I am experimenting with a small Condor cluster (Condor 6.6.6, mostly on
>Windows-boxes unfortunately) as you can see from my various beginners
>mails popping up in the forum.
>I have set up a bunch of Windows-machines (Win2k SP6 and WinXP Pro SP1)
>and a central Linux-Master-Server.
>Submission of jobs works in principle (tested it with the
>hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
>but sometimes I observe a strange behaviour in that certain jobs need a very long time until they are beeing executed.
>This happens while most of the machines are not busy and are listed as
>availabe (15 min no user + low CPU-utilization).
>"condor_status" gives something like:
>Name OpSys Arch State Activity LoadAv Mem
>u-191-srv2.pr LINUX INTEL Unclaimed Idle 0.010 1004
>u-099-cpc-esi WINNT50 INTEL Owner Idle 0.240 512
>vm1@u-099-csr WINNT50 INTEL Claimed Busy 0.000 1024
>vm2@u-099-csr WINNT50 INTEL Unclaimed Idle 0.000 1024
>u-099-cbb1 WINNT51 INTEL Unclaimed Idle 0.000 511
>u-099-cnb2 WINNT51 INTEL Owner Idle 0.020 511
>u-099-cpc-sek WINNT51 INTEL Owner Idle 0.040 512
>u-099-cpc1 WINNT51 INTEL Owner Idle 0.000 512
>u-099-cpc2 WINNT51 INTEL Owner Idle 0.030 512
>u-099-cpc3 WINNT51 INTEL Unclaimed Idle 0.000 512
>u-099-cpc4 WINNT51 INTEL Owner Idle -0.010 512
>u-099-cpc5 WINNT51 INTEL Unclaimed Idle 0.000 512
>so there are at least 4 unclaimed machines in the pool which should
>match requirements ((OpSys == "WINNT50") || (OpSys == "WINNT51"))..
>The result of a "condor_q -analyze" takes quite a long time and gives
>back something like:
>045.000: Run analysis summary. Of 12 machines,
>~ 1 are rejected by your job's requirements
>~ 6 reject your job because of their own requirements
>~ 0 match, but are serving users with a better priority in the pool
>~ 4 match, match, but reject the job for unknown reasons
>~ 1 match, but will not currently preempt their existing job
>~ 0 are available to run your job
>I can't see why the 4 should reject for unknown reasons. Is there any
>place where I could look at to find out these unknown reasons
>(systemlog, local condor-log on machines???).
>Thanks in advance!
>Dr. Marc Saric, Bioinformatik, Proteom Centrum Tübingen,
>Auf der Morgenstelle 15, D-72076 Tübingen, Germany,
>Tel: +49 (0)7071 29 70557, marc.saric@xxxxxxxxxxxxxxxx
>-----BEGIN PGP SIGNATURE-----
>Version: GnuPG v1.2.4 (GNU/Linux)
>Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>-----END PGP SIGNATURE-----
>Condor-users mailing list
>Condor-users mailing list
Condor-users mailing list