[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor job submission delayed



Hi,

Funny I was just going to post about this same problem. I have the same setup as Marc Saric described and I have the same problem as he described. Mainly, sometimes it takes a long time for the pool controler to send out a new job to a CPU that has recently finished.I see this often in my dual processor machines as well.

I have windows processing nodes and a linux master controller. I do have lots of cluster based node submissions all being submitted by my pool master. I wonder if this is the problem..The fact that the pool master is also the one that submits the jobs might be overloading the pool master somehow since it is also keeping track of all the jobs?? I wonder if i making so my client machines submit teh job will help clear up the problem..i'll try this today....

JW



Ian Chesal wrote:

I never saw an answer to this question. Did one get proffered off the list? Could you please cross post it if that is the case. I too am curious about this delay as I'm seeing this in my flock of Windows XP machines.


Thanks! Ian

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Marc Saric
Sent: August 31, 2004 10:26 AM
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Condor job submission delayed


-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Hi all,

I am experimenting with a small Condor cluster (Condor 6.6.6, mostly on Windows-boxes unfortunately) as you can see from my various beginners mails popping up in the forum.

I have set up a bunch of Windows-machines (Win2k SP6 and WinXP Pro SP1) and a central Linux-Master-Server.

Submission of jobs works in principle (tested it with the hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
but sometimes I observe a strange behaviour in that certain jobs need a very long time until they are beeing executed.

This happens while most of the machines are not busy and are listed as availabe (15 min no user + low CPU-utilization).

"condor_status" gives something like:

saric@u-191-srv2:~/tmp> condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

u-191-srv2.pr LINUX       INTEL  Unclaimed  Idle       0.010  1004
0+01:52:13
u-099-cpc-esi WINNT50     INTEL  Owner      Idle       0.240   512
0+01:16:34
vm1@u-099-csr WINNT50     INTEL  Claimed    Busy       0.000  1024
0+00:10:56
vm2@u-099-csr WINNT50     INTEL  Unclaimed  Idle       0.000  1024
0+01:43:03
u-099-cbb1    WINNT51     INTEL  Unclaimed  Idle       0.000   511
0+01:46:27
u-099-cnb2    WINNT51     INTEL  Owner      Idle       0.020   511
0+04:31:59
u-099-cpc-sek WINNT51     INTEL  Owner      Idle       0.040   512
0+00:10:14
u-099-cpc1    WINNT51     INTEL  Owner      Idle       0.000   512
0+00:06:20
u-099-cpc2    WINNT51     INTEL  Owner      Idle       0.030   512
0+00:01:20
u-099-cpc3    WINNT51     INTEL  Unclaimed  Idle       0.000   512
0+00:06:21
u-099-cpc4    WINNT51     INTEL  Owner      Idle       -0.010   512
0+04:57:30
u-099-cpc5    WINNT51     INTEL  Unclaimed  Idle       0.000   512
0+00:31:21

so there are at least 4 unclaimed machines in the pool which should match requirements ((OpSys == "WINNT50") || (OpSys == "WINNT51"))..

The result of a "condor_q -analyze" takes quite a long time and gives back something like:

045.000:  Run analysis summary.  Of 12 machines,
~      1 are rejected by your job's requirements
~      6 reject your job because of their own requirements
~      0 match, but are serving users with a better priority in the pool
~      4 match, match, but reject the job for unknown reasons
~      1 match, but will not currently preempt their existing job
~      0 are available to run your job

I can't see why the 4 should reject for unknown reasons. Is there any place where I could look at to find out these unknown reasons (systemlog, local condor-log on machines???).

Thanks in advance!

- --
Bye,
Marc Saric

Dr. Marc Saric, Bioinformatik, Proteom Centrum Tübingen,
Auf der Morgenstelle 15, D-72076 Tübingen, Germany,
Tel: +49 (0)7071 29 70557, marc.saric@xxxxxxxxxxxxxxxx http://www.proteom-centrum-tuebingen.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBNIqQBLD6PjSWyL4RAlKLAJ4l64RE870+vfqESQJL5Cz5oMSGjQCbBmA6
WLrzxNGTr1sGB3oJv4bDW48=
=nKWt
-----END PGP SIGNATURE----- _______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users