[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Suspended - Stuck



On 1/16/2014 12:02 PM, Andrey Kuznetsov wrote:
Hi,

Here's the log file from a job that appears to be suspended, and I cannot
resume it.
Short of removing the job and resubmitting it, is there another way to
force it to restart or continue?


The story here is your job landed on a machine that is configured to suspend jobs running on that machine when some condition becomes true (e.g. activity on the keyboard or increased non-condor load average) and then unsuspend or restart the job after X amount of time. This sort of policy is common when running jobs on non-dedicated desktop machines.

As a user submitting jobs, if you never want your jobs to suspend, you're only recourse is to add a requirement to your submit file to avoid machines with such a policy (if there are any such machines in your pool).

If you are also the administrator of the machines in your pool, you could put
  SUSPEND =  FALSE
into your condor_config file...

Todd

001 (1321.003.000) 01/15 15:57:23 Job executing on host: <128.114.*.*:9944>
...
006 (1321.003.000) 01/15 15:57:32 Image size of job updated: 24704
     2  -  MemoryUsage of job (MB)
     1572  -  ResidentSetSize of job (KB)
...
010 (1321.003.000) 01/15 15:58:56 Job was suspended.
     Number of processes actually suspended: 2