[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job Suspended - Stuck



I get:
condor_vacate_job 1321.3
Job 1321.3 not running to be vacated
So it didn't seem to work.

Thanks, this is a desktop machine and it must allow 1 slot for desktop user.

I was wondering why the job didn't get resumed after the user left the machine, and the slot should have freed up.

DESKTOP = TRUE
SLOTS_CONNECTED_TO_CONSOLE = 1

WANT_VACATE  = ($(ActivationTimer) > 10 * $(MINUTE))

SUSPEND = ifThenElse($(DESKTOP),(($(KeyboardBusy) || $(ConsoleBusy)) && ((SlotID <= $(SLOTS_CONNECTED_TO_CONSOLE)) || (SlotID <= $(SLOTS_CONNECTED_TO_CONSOLE))) || ((CpuBusyTime > 2 * $(MINUTE)) && $(ActivationTimer) > 90)),FALSE)

CONTINUE = ($(CPUIdle) && ($(ActivityTimer) > 10) && (KeyboardIdle > $(ContinueIdleTime)))

PREEMPT = ifThenElse($(DESKTOP),((Activity == "Suspended" && ($(ActivityTimer) > $(MaxSuspendTime))) || $(SUSPEND)),FALSE)


On Thu, Jan 16, 2014 at 11:29 AM, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:
On 1/16/2014 12:02 PM, Andrey Kuznetsov wrote:
Hi,

Here's the log file from a job that appears to be suspended, and I cannot
resume it.
Short of removing the job and resubmitting it, is there another way to
force it to restart or continue?


The story here is your job landed on a machine that is configured to suspend jobs running on that machine when some condition becomes true (e.g. activity on the keyboard or increased non-condor load average) and then unsuspend or restart the job after X amount of time. This sort of policy is common when running jobs on non-dedicated desktop machines.

As a user submitting jobs, if you never want your jobs to suspend, you're only recourse is to add a requirement to your submit file to avoid machines with such a policy (if there are any such machines in your pool).

If you are also the administrator of the machines in your pool, you could put
  SUSPEND =  FALSE
into your condor_config file...

Todd


001 (1321.003.000) 01/15 15:57:23 Job executing on host: <128.114.*.*:9944>
...
006 (1321.003.000) 01/15 15:57:32 Image size of job updated: 24704
     2  -  MemoryUsage of job (MB)
     1572  -  ResidentSetSize of job (KB)
...
010 (1321.003.000) 01/15 15:58:56 Job was suspended.
     Number of processes actually suspended: 2
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/



--
Andrey Kuznetsov <akuznet1@xxxxxxxx>