[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] administrator SIGQUIT vs condor_vacate SIGTERM



Hello,

On Windows, when a local user evicts a job from their system using
condor_vacate, the job gets a SIGTERM, shuts down gracefully, and is
re-queued:


12/18 13:54:20 Create_Process succeeded, pid=2332
12/18 13:56:28 Got SIGTERM. Performing graceful shutdown.
12/18 13:56:28 ShutdownGraceful all jobs.
12/18 13:56:28 Process exited, pid=2332, status=-1073741510
12/18 13:56:28 Last process exited, now Starter is exiting
12/18 13:56:28 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

In the job .log, this action is described as:

004 (036.000.000) 12/18 15:51:55 Job was evicted.
        (0) Job was not checkpointed.

The job will be run anew on another client, as expected.

But when a local administrator uses the task manager to end the
condor_exec process, the job gets a SIGQUIT, shuts down quickly and is
not re-queued:

12/18 14:04:20 Create_Process succeeded, pid=2628
12/18 14:12:33 Process exited, pid=2628, status=1
12/18 14:12:33 Got SIGQUIT.  Performing fast shutdown.
12/18 14:12:33 ShutdownFast all jobs.
12/18 14:12:33 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

The job .log still shows "normal termination", as if the job had run to
completion, but with return value 1 instead of 0:

005 (037.000.000) 12/18 15:53:37 Job terminated.
        (1) Normal termination (return value 1)

Condor apparently knows something is wrong, and sets exit status 1
accordingly, but doesn't reschedule, so I've now "lost" a job. What is
the reasoning behind this behavior, and how can I change it so I don't
lose jobs when administrators send them SIGQUIT?

Thanks,

Rob de Graaf