[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?



I will try to confirm. Could you tell me which signal is used to stop job in preemption? Is it just SIGKILL?

2014-1-25 PM10:45于 "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>写道:
On 1/25/2014 2:38 AM, 钱晓明 wrote:
Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and each
will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be
transformed to another machine silently to execute after executing for a
while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing
between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked
*.dag.nodes.log file,  *.dagman.out file and found nothing helpful.

I do not config the RANK _expression_ for startd. The rank for job is:
-SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason.
HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine.
Thanks.

First guess is your job was preemtped, i.e. it was running on 10.1.1.254 and then kicked off to make room for either a higher priority job or because "owner" activity was detected.  To see how to disable preemption, see the Manual or the HOWTO recipes on the wiki, specifically http://goo.gl/kFf9O7

regards,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/