[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason



Pithy thoughts most certainly welcome and appreciated! Hopefully pithy responses follow:

When you say "global" config file, does that mean the above config settings are set not only on all your execute machines, but also upon your central manager?

Yes, the condor_config machine which we make sure is identical on all machines in the pool.
 
 I ask because some of the above settings are read by the condor_startd running on your execute nodes, but some (like PREEMPTION_REQUIREMENTS) are read by the condor_negotiator running on the central manager.  

Did I do a Bad Thing by having these lines on execute nodes?
 
Also, did you remember to do a condor_reconfig -all after setting the above?

Yes, and condor_restart -all. Several times, in the hope something would finally "take".

Side note:  in HTCondor v8.0 and above, you can disable preemption just by setting one config knob MaxJobRetirementTime
( see http://goo.gl/thLqTh )

I shall take a look, thanks.

However, jobs continue to be stopped on one machine, and restarted (from
new, since no checkpointing) on the same or another machine [from a job
.log file]:

000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
...
001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
...
006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
     1  -  MemoryUsage of job (MB)
     400  -  ResidentSetSize of job (KB)

001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>


 
Are you "editing" the above .log file?  It is pretty strange that there is no event saying the job was evicted from .246 before an execute event for .102 appears.

Only to obscure our actual IP numbers. I didn't delete relevant lines. The ellipses indicate removal of unimportant lines about Image size changes, etc.

The job started on host .246, ran 20 minutes, then started over on .102.

Pretty suspicious that it goes for almost exactly 20 minutes, as 20 minutes is the default job_lease_time... see
  http://goo.gl/ce4Lyg

OK, again I shall take a look.
 
The idea of the job_lease is if the execute machine fails to communicate with the submit machine for 20 minutes, the job will get killed.  So perhaps the job is being killed off on the execute machine because it cannot communicate with the condor_schedd on the submit machine.... maybe there is a firewall preventing your execute machines from connecting to your submit machine? To test my wild guess, here is something to try:  lets say your submit machine is my.submit.com (doing a condor_status -schedd will show all your submit machine names), can you login to an execute machine that kicked off the job like .246 and run:

condor_ping -type schedd -name my.submit.com read

This command will say "read...succeeded" or "read...failed" depending upon if it successfully could contact the schedd on your submit machine.  If it says "failed", then we know what is happening, and you'll need to fix your firewall/network issue.


So finally, my question: how can I examine the details of why HTC is doing
this machine switching? I've poked around in various log files but don't
see anything obvious. Or, what condor_status or condor_q commands would
reveal the motive for the switching?


I would want to see the StartLog on a machine like .246 from the time a job starts until it leaves.  You will see in the log it saying the slot going to Claimed->Busy, then you will want to see the messages around where it changes away from Claimed->Busy...

Hope the above helps,

Quite a bit, actually. We do have somewhat strict security measures, perhaps not perfectly implemented. I'll get back with the results of the assignments you provided...will be a few days. Thanks again--

Ralph Finch
Calif. Dept. of Water Resources