[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor job submission delayed



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,

I am experimenting with a small Condor cluster (Condor 6.6.6, mostly on
Windows-boxes unfortunately) as you can see from my various beginners
mails popping up in the forum.

I have set up a bunch of Windows-machines (Win2k SP6 and WinXP Pro SP1)
and a central Linux-Master-Server.

Submission of jobs works in principle (tested it with the
hello-world-examples from http://www.liv.ac.uk/e-science/condor/hello.html
but sometimes I observe a strange behaviour in that certain jobs need a
very long time until they are beeing executed.

This happens while most of the machines are not busy and are listed as
availabe (15 min no user + low CPU-utilization).

"condor_status" gives something like:

saric@u-191-srv2:~/tmp> condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

u-191-srv2.pr LINUX       INTEL  Unclaimed  Idle       0.010  1004
0+01:52:13
u-099-cpc-esi WINNT50     INTEL  Owner      Idle       0.240   512
0+01:16:34
vm1@u-099-csr WINNT50     INTEL  Claimed    Busy       0.000  1024
0+00:10:56
vm2@u-099-csr WINNT50     INTEL  Unclaimed  Idle       0.000  1024
0+01:43:03
u-099-cbb1    WINNT51     INTEL  Unclaimed  Idle       0.000   511
0+01:46:27
u-099-cnb2    WINNT51     INTEL  Owner      Idle       0.020   511
0+04:31:59
u-099-cpc-sek WINNT51     INTEL  Owner      Idle       0.040   512
0+00:10:14
u-099-cpc1    WINNT51     INTEL  Owner      Idle       0.000   512
0+00:06:20
u-099-cpc2    WINNT51     INTEL  Owner      Idle       0.030   512
0+00:01:20
u-099-cpc3    WINNT51     INTEL  Unclaimed  Idle       0.000   512
0+00:06:21
u-099-cpc4    WINNT51     INTEL  Owner      Idle       -0.010   512
0+04:57:30
u-099-cpc5    WINNT51     INTEL  Unclaimed  Idle       0.000   512
0+00:31:21

so there are at least 4 unclaimed machines in the pool which should
match requirements ((OpSys == "WINNT50") || (OpSys == "WINNT51")).

The result of a "condor_q -analyze" takes quite a long time and gives
back something like:

045.000:  Run analysis summary.  Of 12 machines,
~      1 are rejected by your job's requirements
~      6 reject your job because of their own requirements
~      0 match, but are serving users with a better priority in the pool
~      4 match, match, but reject the job for unknown reasons
~      1 match, but will not currently preempt their existing job
~      0 are available to run your job

I can't see why the 4 should reject for unknown reasons. Is there any
place where I could look at to find out these unknown reasons
(systemlog, local condor-log on machines???).

Thanks in advance!

- --
Bye,
Marc Saric

Dr. Marc Saric, Bioinformatik, Proteom Centrum Tübingen,
Auf der Morgenstelle 15, D-72076 Tübingen, Germany,
Tel: +49 (0)7071 29 70557, marc.saric@xxxxxxxxxxxxxxxx
http://www.proteom-centrum-tuebingen.de
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFBNIqQBLD6PjSWyL4RAlKLAJ4l64RE870+vfqESQJL5Cz5oMSGjQCbBmA6
WLrzxNGTr1sGB3oJv4bDW48=
=nKWt
-----END PGP SIGNATURE-----