[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?



On 1/25/2014 2:38 AM, éææ wrote:
Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each
will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be
transformed to another machine silently to execute after executing for a
while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing
between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked
*.dag.nodes.log file,  *.dagman.out file and found nothing helpful.

I do not config the RANK expression for startd. The rank for job is:
-SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason.
HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine.
Thanks.

First guess is your job was preemtped, i.e. it was running on 10.1.1.254 and then kicked off to make room for either a higher priority job or because "owner" activity was detected. To see how to disable preemption, see the Manual or the HOWTO recipes on the wiki, specifically http://goo.gl/kFf9O7

regards,
Todd