[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Jobs being evicted and replaced by other jobs in the same cluster



Hi all,

I have a Condor pool that is used only for Java Universe runs. These jobs cannot checkpoint and should never get preempted. If a job is terminated early, it should not restart because the output database needs to be cleared first. I have attempted to set up Condor to operate this way, but I sometimes see execute nodes evict one job only to replace it with another job from the same user and cluster. According to the log (Starter log, I think), the first job is preempted based on user priority. This makes no sense because both jobs are from the same user. I have temporarily solved the problem by setting MAXJOBRETIREMENTTIME to 5184000, which is plenty of time to finish even the longest jobs. However, I'm not sure if this is the best way to fix this. I'm afraid the second job might wait around for the first job to finish instead of being sent to another execute node that becomes available in the mean time. I wondered if I needed to change the START _expression_ to be something other than TRUE, so I tried State == "Unclaimed". With this setting, no jobs ever started even though condor_status reported all execute nodes to be Unclaimed and Idle. Any ideas?

Chris

P.S.
Here are a few of the configuration settings I think are relevant:

RANK = 0
START = TRUE
SUSPEND = FALSE
CONTINUE = TRUE
PREEMPT = FALSE
KILL = FALSE
PERIODIC_CHECKPOINT = FALSE
PREEMPTION_REQUIREMENTS = FALSE
PREEMPTION_RANK = 0
CLAIM_WORKLIFE = 0
NEGOTIATOR_CONSIDER_PREEMPTION = FALSE