[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Suspend on Windows
- Date: Fri, 05 May 2006 14:26:06 -0400
- From: Jess Cannata <jac67@xxxxxxxxxxxxxx>
- Subject: Re: [Condor-users] Suspend on Windows
Thank you for your help.
I am using unmodified UWCS expressions. I will also look at my startd
I changed MaxSuspendTime back to 10 minutes since SUSPEND on Windows
does not seem to do what I would like for it to do due to the lack of
check-pointing. I think that I would like for a suspended job to be
evicted and returned to the queue so it be launched again on an free
machine since having it restart on the same machine doesn't save me much
time since the amount of files transferred is only a couple of MBs.
I am a bit unclear as to the purpose of the CONTINUE setting under
Windows. With no check-pointing, there is no such thing as continuing a
job, right? It would simply restart the job. Would this job restart show
up in the job's log file?
One more thing, I'm launching 12,000 to 20,000 jobs via one Linux submit
box (it is the portal to the Windows pool). It takes several hours to
submit all of the jobs. I've seen postings suggesting that one schedd
cannot reliably handle thousands of jobs. Should I plan on using more
than one schedd? I don't see anything about this in the manual. Do you
know of any sample configurations for building a pool with multiple
schedds but only one portal piece? Or should I just use a DNS
round-robin approach and have the web portal machine send submission
requests to various submission machines?
Dan Bradley wrote:
The StartLog on the machine where the job was running is a good place
to look to see what policy expressions were responsible for the fate of
a job. My guess is that your PREEMPT expression is evaluating to true
for some reason. Are you using modified UWCS expressions or are they
On May 4, 2006, at 10:33 PM, Jess Cannata wrote:
I am running Condor 6.6.10 on Windows on a set of lab machines. I am
seeing problems with jobs never finishing once they are suspended due
someone physically using the computer. I believe that I have the
machines set to suspend the jobs, but keep the job on the machine so
they can continue when the machine returns to unclaimed and idle.
However, the suspended jobs never seem to unsuspend and continue
(even if they have to start from scratch). Instead, they get evicted
seconds after the job supposedly unsuspends. Is this how it should
I've included a snippet from my condor_config file and the job log
Any help would be appreciated.
StartIdleTime = 15 * $(MINUTE)
ContinueIdleTime = 5 * $(MINUTE)
MaxSuspendTime = 300 * $(MINUTE)
MaxVacateTime = 10 * $(MINUTE)
WANT_SUSPEND = TRUE
WANT_VACATE = FALSE
START = $(UWCS_START)
SUSPEND = $(UWCS_SUSPEND)
CONTINUE = $(UWCS_CONTINUE)
PREEMPT = $(UWCS_PREEMPT)
KILL = $(UWCS_KILL)
PERIODIC_CHECKPOINT = $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS = $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK = $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)
000 (619.000.000) 05/03 15:27:29 Job submitted from host:
001 (619.000.000) 05/04 09:46:54 Job executing on host:
010 (619.000.000) 05/04 09:53:08 Job was suspended.
Number of processes actually suspended: 1
006 (619.000.000) 05/04 09:53:08 Image size of job updated: 986304
011 (619.000.000) 05/04 09:53:10 Job was unsuspended.
004 (619.000.000) 05/04 09:53:11 Job was evicted.
(0) Job was not checkpointed.
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
0 - Run Bytes Sent By Job
4480328 - Run Bytes Received By Job
001 (619.000.000) 05/04 22:08:37 Job executing on host:
Condor-users mailing list
Condor-users mailing list