Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?

Date: Sat, 25 Jan 2014 08:44:46 -0600
From: Todd Tannenbaum <tannenba@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?

On 1/25/2014 2:38 AM, éææ wrote:

Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each
will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be
transformed to another machine silently to execute after executing for a
while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing
between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked
*.dag.nodes.log file,  *.dagman.out file and found nothing helpful.

I do not config the RANK expression for startd. The rank for job is:
-SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason.
HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine.
Thanks.

First guess is your job was preemtped, i.e. it was running on 10.1.1.254and then kicked off to make room for either a higher priority job orbecause "owner" activity was detected. To see how to disablepreemption, see the Manual or the HOWTO recipes on the wiki,specifically http://goo.gl/kFf9O7


regards,
Todd

Follow-Ups:
- Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
  - From: 钱晓明

References:
- [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
  - From: 钱晓明

Prev by Date: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
Next by Date: Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
Previous by thread: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
Next by thread: Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?