[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?



Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be transformed to another machine silently to execute after executing for a while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked *.dag.nodes.log file,  *.dagman.out file and found nothing helpful.

I do not config the RANK _expression_ for startd. The rank for job is: -SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason. HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine. Thanks.