[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Why jobs transferred to another machine silently after executing long time in one machine?



Thank you!
How can I forget to check the log file in the execute machine! I will do this after the Spring Festival.

2014-1-25 PM11:52于 "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>写道:
On 1/25/2014 9:34 AM, 钱晓明 wrote:
I will try to confirm. Could you tell me which signal is used to stop job
in preemption? Is it just SIGKILL?

I cannot tell you for certain as this is all configurable, both by the admin in the condor_config file and also via end-users with per-job options in the job submit file.  IIRC, by default, the parent process for the job is first sent a SIGTERM.  If the job is still around after X seconds (I think X defaults to ~30), then all processes in the job are sent a SIGKILL.

You could also look in the StarterLog.slotX in HTCondor's log directory (where X is the slot number which ran the job) on the execute machine; the log will state what happened (including what signal HTCondor sent) at the time the job was kicked off.  You can find the log directory via
  condor_config_val log

Hope the above helps
Todd

2014-1-25 PM10:45于 "Todd Tannenbaum" <tannenba@xxxxxxxxxxx>写道:

On 1/25/2014 2:38 AM, 钱晓明 wrote:

Hi, I am using htcondor 7.8.5 on CentOS6.3. I have gpu jobs to run and
each
will take 25~90 minutes to run. Each machine have 2 GPUs.

All GPU jobs are in one node of a DAG job. I find that some jobs will be
transformed to another machine silently to execute after executing for a
while in one machine. This is the event sequence for this job:
SUBMIT
EXECUTE on 10.1.1.254
IMAGE_SIZE_UPDATE
IMAGE_SIZE_UPDATE
EXECUTE on 10.1.1.251
......
I want to know why the second EXECUTE event occurred. There is nothing
between the last IMAGE_SIZE_UPDATE event and EXECUTE event. I also checked
*.dag.nodes.log file,  *.dagman.out file and found nothing helpful.

I do not config the RANK _expression_ for startd. The rank for job is:
-SlotId + HasGPU*1000+GPUCores. But I think this will not be the reason.
HasGPU is true and GPUCores is 2496 now.

Thanks. I have to figure out why jobs transferred to another machine.
Thanks.


First guess is your job was preemtped, i.e. it was running on 10.1.1.254
and then kicked off to make room for either a higher priority job or
because "owner" activity was detected.  To see how to disable preemption,
see the Manual or the HOWTO recipes on the wiki, specifically
http://goo.gl/kFf9O7

regards,
Todd

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with
a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput Computing   Department of Computer Sciences
HTCondor Technical Lead                1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                  Madison, WI 53706-1685
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/