Subject: [Condor-users] eviction of all jobs after a period of successfulruns.
Hi folks,
I have some parametric problems, where I run as many as 40,000 small
(1.5 minute) jobs on a condor cluster.
The binaries are compiled with checkpointing. Condor version is 6.6.6
on RH 9.0 intel.
What I find is that, besides being less than optimal time-wise to use
many 1.5 minute jobs, that the condor cluster after a time stops
running the jobs. The Negotiator matches them up with run nodes, but
the jobs are evicted immediately after being Started. This happens
spontaneously after about two hours of successful runs.
The cluster remains idle even with these jobs in queue.
The Negotiator log just shows successful matchups that schedule jobs on
all available resources, the SchedLog on the submitting node shows that
a shadow job is started for the jobs, and then the job is VACATED, and
the StarterLog on the run node shows that the job receives a CHILD_EXIT
event before the job really runs at all.
The logs are below.
Any thoughts on why this is happening? I've seen something like this
before.
10/12 07:47:19 Started shadow for job 9.11517 on "<10.101.1.4:40709>",
(shadow pid = 21560)
10/12 07:47:21 Started shadow for job 9.11518 on "<10.101.1.4:40709>",
(shadow pid = 21563)
10/12 07:47:21 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709>
10/12 07:47:21 Match record (<10.101.1.4:40709>, 9, 11517) deleted
10/12 07:47:21 DaemonCore: Command received via TCP from host
<10.101.1.4:43557>
10/12 07:47:21 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/12 07:47:21 Got VACATE_SERVICE from <10.101.1.4:43557>
10/12 07:47:22 Sent RELEASE_CLAIM to startd on <10.101.1.4:40709>
10/12 07:47:22 Match record (<10.101.1.4:40709>, 9, 11518) deleted
10/12 07:47:22 DaemonCore: Command received via TCP from host
<10.101.1.4:43561>
10/12 07:47:22 DaemonCore: received command 443 (VACATE_SERVICE),
calling handler (vacate_service)
10/12 07:47:22 Got VACATE_SERVICE from <10.101.1.4:43561>