[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason



On 11/7/2013 4:45 PM, Ralph Finch wrote:
Bump. Still have this problem, and it's become more serious with a new
calibration program we're running that doesn't like its job being killed
and restarted.


Some pithy thoughts inline below...


On Thu, Aug 29, 2013 at 9:47 AM, Ralph Finch <ralphmariafinch@xxxxxxxxx>wrote:

HTCondor 8.0.2, pool is entirely Windows 7x64.

Being a Windows pool, there is no checkpointing and we do not want
eviction or preemption. Therefore in the global config file I have (copied
from the manual):

#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
#Since we are disabling claim preemption, we
# may as well optimize negotiation for this case:
NEGOTIATOR_CONSIDER_PREEMPTION = False
# Without preemption, it is advisable to limit the time during
# which the submit node may keep reusing the same slot for
# more jobs.
CLAIM_WORKLIFE = 3600
UPDATE_INTERVAL  = 180
WANT_SUSPEND  = TRUE
KILL = FALSE


When you say "global" config file, does that mean the above config settings are set not only on all your execute machines, but also upon your central manager? I ask because some of the above settings are read by the condor_startd running on your execute nodes, but some (like PREEMPTION_REQUIREMENTS) are read by the condor_negotiator running on the central manager. Also, did you remember to do a condor_reconfig -all after setting the above?

Side note: in HTCondor v8.0 and above, you can disable preemption just by setting one config knob MaxJobRetirementTime
( see http://goo.gl/thLqTh )

However, jobs continue to be stopped on one machine, and restarted (from
new, since no checkpointing) on the same or another machine [from a job
.log file]:

000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
...
001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
...
006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
     1  -  MemoryUsage of job (MB)
     400  -  ResidentSetSize of job (KB)

001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>


Are you "editing" the above .log file? It is pretty strange that there is no event saying the job was evicted from .246 before an execute event for .102 appears.

The job started on host .246, ran 20 minutes, then started over on .102.


Pretty suspicious that it goes for almost exactly 20 minutes, as 20 minutes is the default job_lease_time... see
  http://goo.gl/ce4Lyg
The idea of the job_lease is if the execute machine fails to communicate with the submit machine for 20 minutes, the job will get killed. So perhaps the job is being killed off on the execute machine because it cannot communicate with the condor_schedd on the submit machine.... maybe there is a firewall preventing your execute machines from connecting to your submit machine? To test my wild guess, here is something to try: lets say your submit machine is my.submit.com (doing a condor_status -schedd will show all your submit machine names), can you login to an execute machine that kicked off the job lik .246 and run:

condor_ping -type schedd -name my.submit.com read

This command will say "read...succeeded" or "read...failed" depending upon if it successfully could contact the schedd on your submit machine. If it says "failed", then we know what is happening, and you'll need to fix your firewall/network issue.

So finally, my question: how can I examine the details of why HTC is doing
this machine switching? I've poked around in various log files but don't
see anything obvious. Or, what condor_status or condor_q commands would
reveal the motive for the switching?


I would want to see the StartLog on a machine like .246 from the time a job starts until it leaves. You will see in the log it saying the slot going to Claimed->Busy, then you will want to see the messages around where it changes away from Claimed->Busy...

Hope the above helps,
Todd