[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM



Rob,

> On Windows, when a local user evicts a job from their system using
> condor_vacate, the job gets a SIGTERM, shuts down gracefully, and is
> re-queued:
> 
> 
> 12/18 13:54:20 Create_Process succeeded, pid=2332
> 12/18 13:56:28 Got SIGTERM. Performing graceful shutdown.
> 12/18 13:56:28 ShutdownGraceful all jobs.
> 12/18 13:56:28 Process exited, pid=2332, status=-1073741510
> 12/18 13:56:28 Last process exited, now Starter is exiting
> 12/18 13:56:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Your analysis here is wrong.  The STARTER gets SIGTERM.  It then kills
the job.

> In the job .log, this action is described as:
> 
> 004 (036.000.000) 12/18 15:51:55 Job was evicted.
>          (0) Job was not checkpointed.
> 
> The job will be run anew on another client, as expected.

This is because the STARTER knows it killed the job itself.

> But when a local administrator uses the task manager to end the
> condor_exec process, the job gets a SIGQUIT, shuts down quickly and is
> not re-queued:
> 
> 12/18 14:04:20 Create_Process succeeded, pid=2628
> 12/18 14:12:33 Process exited, pid=2628, status=1
> 12/18 14:12:33 Got SIGQUIT.  Performing fast shutdown.
> 12/18 14:12:33 ShutdownFast all jobs.
> 12/18 14:12:33 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

Note that the SIGQUIT comes after the job has exited.  This is part of
the normal termination of the STARTER by the STARTD after the job has
finished.  The STARTER doesn't know why the job exited, only that it did.
 
> The job .log still shows "normal termination", as if the job had run to
> completion, but with return value 1 instead of 0:
> 
> 005 (037.000.000) 12/18 15:53:37 Job terminated.
>          (1) Normal termination (return value 1)

This is correct, the STARTER saw the job exit normally (the STARTER
didn't kill it).  The exit status is from the job, not the STARTER.

> Condor apparently knows something is wrong, and sets exit status 1
> accordingly, but doesn't reschedule, so I've now "lost" a job. What is
> the reasoning behind this behavior, and how can I change it so I don't
> lose jobs when administrators send them SIGQUIT?

Again, it is the job, not Condor, that sets the exit status.

Why are administrators killing Condor jobs?  Note I don't say sending
them SIGQUIT because that isn't what is happening, they are killing
the jobs outside of Condor.  Why aren't they using condor_vacate or
condor_vacate_job for this purpose?  There is no way for Condor to
know why the job exited otherwise.

-- 
Daniel K. Forrest	Laboratory for Molecular and
forrest@xxxxxxxxxxxxx	Computational Genomics
(608) 262 - 9479	University of Wisconsin, Madison