Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason

Date: Thu, 07 Nov 2013 18:08:28 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason

On 11/7/2013 4:45 PM, Ralph Finch wrote:

Bump. Still have this problem, and it's become more serious with a new
calibration program we're running that doesn't like its job being killed
and restarted.


Some pithy thoughts inline below...


On Thu, Aug 29, 2013 at 9:47 AM, Ralph Finch <ralphmariafinch@xxxxxxxxx>wrote:

HTCondor 8.0.2, pool is entirely Windows 7x64.

Being a Windows pool, there is no checkpointing and we do not want
eviction or preemption. Therefore in the global config file I have (copied
from the manual):

#Disable preemption by machine activity.
PREEMPT = False
#Disable preemption by user priority.
PREEMPTION_REQUIREMENTS = False
#Disable preemption by machine RANK by ranking all jobs equally.
RANK = 0
#Since we are disabling claim preemption, we
# may as well optimize negotiation for this case:
NEGOTIATOR_CONSIDER_PREEMPTION = False
# Without preemption, it is advisable to limit the time during
# which the submit node may keep reusing the same slot for
# more jobs.
CLAIM_WORKLIFE = 3600
UPDATE_INTERVAL  = 180
WANT_SUSPEND  = TRUE
KILL = FALSE

When you say "global" config file, does that mean the above configsettings are set not only on all your execute machines, but also uponyour central manager? I ask because some of the above settings are readby the condor_startd running on your execute nodes, but some (likePREEMPTION_REQUIREMENTS) are read by the condor_negotiator running onthe central manager. Also, did you remember to do a condor_reconfig-all after setting the above?

Side note: in HTCondor v8.0 and above, you can disable preemption justby setting one config knob MaxJobRetirementTime

( see http://goo.gl/thLqTh )

However, jobs continue to be stopped on one machine, and restarted (from
new, since no checkpointing) on the same or another machine [from a job
.log file]:

000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
...
001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
...
006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
     1  -  MemoryUsage of job (MB)
     400  -  ResidentSetSize of job (KB)

001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>

Are you "editing" the above .log file? It is pretty strange that thereis no event saying the job was evicted from .246 before an execute eventfor .102 appears.

The job started on host .246, ran 20 minutes, then started over on .102.

Pretty suspicious that it goes for almost exactly 20 minutes, as 20minutes is the default job_lease_time... see

  http://goo.gl/ce4Lyg

The idea of the job_lease is if the execute machine fails to communicatewith the submit machine for 20 minutes, the job will get killed. Soperhaps the job is being killed off on the execute machine because itcannot communicate with the condor_schedd on the submit machine....maybe there is a firewall preventing your execute machines fromconnecting to your submit machine? To test my wild guess, here issomething to try: lets say your submit machine is my.submit.com (doinga condor_status -schedd will show all your submit machine names), canyou login to an execute machine that kicked off the job lik .246 and run:


condor_ping -type schedd -name my.submit.com read

This command will say "read...succeeded" or "read...failed" dependingupon if it successfully could contact the schedd on your submit machine.If it says "failed", then we know what is happening, and you'll needto fix your firewall/network issue.

So finally, my question: how can I examine the details of why HTC is doing
this machine switching? I've poked around in various log files but don't
see anything obvious. Or, what condor_status or condor_q commands would
reveal the motive for the switching?

I would want to see the StartLog on a machine like .246 from the time ajob starts until it leaves. You will see in the log it saying the slotgoing to Claimed->Busy, then you will want to see the messages aroundwhere it changes away from Claimed->Busy...


Hope the above helps,
Todd

Follow-Ups:
- Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
  - From: Ralph Finch
- Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
  - From: Ralph Finch

References:
- Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
  - From: Ralph Finch

Prev by Date: Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
Next by Date: Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
Previous by thread: Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
Next by thread: Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason