[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] CondorG and "Globus error 121: the job state file doesn't exist"



Hi,
I am using Condor to submit a lot of jobs to a bunch of CREAM and LCG CE execution hosts. When I start with a freshly started Condor everything works fine, but after a few days of sustained submission, held jobs begin to pile up with the HoldReason in the subject. By looking at the Globus logs on one execution machine, we found out that the GRAM two-phase submission for these jobs is never completed, so the Globus state file is removed. Still, Condor seems to ask for the jobs' status even if they were never submitted, and Globus answers with the "error 121".

The machine where the Condor runs is a virtual machine with one CPU and 2GB of RAM. The load average is always over 3, and most of the CPU is taken by these processes (percentage changes, but they are always the top 4 processes):

top - 11:15:20 up 5 days, 22:06,  1 user,  load average: 3.77, 3.36, 3.13
Tasks: 106 total,   4 running, 101 sleeping,   0 stopped,   1 zombie
Cpu(s): 55.5%us, 44.5%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   2058304k total,  1945344k used,   112960k free,    92836k buffers
Swap:  2000368k total,       64k used,  2000304k free,   434464k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
21591 rebatto   17   0 52384  15m 3244 R 80.4  0.8 638:38.04 gahp_server
21571 rebatto   18   0  154m 126m 3696 R 14.0  6.3 538:42.25 condor_gridmana
21607 rebatto   18   0  244m  63m 2928 S  5.0  3.2  26:26.13 cream_gahp
21600 rebatto   15   0 54400  17m 3248 S  0.3  0.9 786:30.39 gahp_server
[...]

The average number of jobs managed by Condor is ~ 5000.
My only guess at the moment is that the gahp_server (or the grid_manager) cannot cope with all the submissions, either for CPU or for network limitations. Still, I'd like to have an opinion from more experienced users before asking the system managers for a bigger machine...

Thanks for any hint you can give me.

--
David Rebatto
I.N.F.N. - Sezione di Milano
Via Celoria, 16 - 20133 Milano ITALY
tel: +39 02503.17623 e-mail: David.Rebatto@xxxxxxxxxx
URL: http://www.mi.infn.it/~rebatto

"There are 10 kinds of people in the world:
those who understand binary and those who don't..."


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature