[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Jobs on Windows Pool are being preempted for no obvious reason



Why is WANT_SUSPEND set to true?

On Thu, Nov 7, 2013 at 4:45 PM, Ralph Finch <ralphmariafinch@xxxxxxxxx> wrote:
> Bump. Still have this problem, and it's become more serious with a new
> calibration program we're running that doesn't like its job being killed and
> restarted.
>
>
> On Thu, Aug 29, 2013 at 9:47 AM, Ralph Finch <ralphmariafinch@xxxxxxxxx>
> wrote:
>>
>> HTCondor 8.0.2, pool is entirely Windows 7x64.
>>
>> Being a Windows pool, there is no checkpointing and we do not want
>> eviction or preemption. Therefore in the global config file I have (copied
>> from the manual):
>>
>> #Disable preemption by machine activity.
>> PREEMPT = False
>> #Disable preemption by user priority.
>> PREEMPTION_REQUIREMENTS = False
>> #Disable preemption by machine RANK by ranking all jobs equally.
>> RANK = 0
>> #Since we are disabling claim preemption, we
>> # may as well optimize negotiation for this case:
>> NEGOTIATOR_CONSIDER_PREEMPTION = False
>> # Without preemption, it is advisable to limit the time during
>> # which the submit node may keep reusing the same slot for
>> # more jobs.
>> CLAIM_WORKLIFE = 3600
>> UPDATE_INTERVAL  = 180
>> WANT_SUSPEND  = TRUE
>> KILL = FALSE
>>
>> However, jobs continue to be stopped on one machine, and restarted (from
>> new, since no checkpointing) on the same or another machine [from a job .log
>> file]:
>>
>> 000 (231.001.000) 08/29 08:06:11 Job submitted from host: <1.2.3.189:9685>
>> ...
>> 001 (231.001.000) 08/29 08:06:29 Job executing on host: <1.2.3.246:9651>
>> ...
>> 006 (231.001.000) 08/29 08:06:37 Image size of job updated: 2500
>>     1  -  MemoryUsage of job (MB)
>>     400  -  ResidentSetSize of job (KB)
>>
>> 001 (231.001.000) 08/29 08:27:30 Job executing on host: <1.2.3.102:9619>
>>
>> The job started on host .246, ran 20 minutes, then started over on .102.
>>
>> So finally, my question: how can I examine the details of why HTC is doing
>> this machine switching? I've poked around in various log files but don't see
>> anything obvious. Or, what condor_status or condor_q commands would reveal
>> the motive for the switching?
>>
>> Thanks,
>> Ralph Finch
>> Calif. Dept. of Water Resources
>>
>
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/



-- 
HTCondor Project Windows Developer / NEOS Maintainer