Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Suspend on Windows

Date: Fri, 05 May 2006 14:26:06 -0400
From: Jess Cannata <jac67@xxxxxxxxxxxxxx>
Subject: Re: [Condor-users] Suspend on Windows

Dan,

Thank you for your help.

I am using unmodified UWCS expressions. I will also look at my startdlog files.

I changed MaxSuspendTime back to 10 minutes since SUSPEND on Windowsdoes not seem to do what I would like for it to do due to the lack ofcheck-pointing. I think that I would like for a suspended job to beevicted and returned to the queue so it be launched again on an freemachine since having it restart on the same machine doesn't save me muchtime since the amount of files transferred is only a couple of MBs.

I am a bit unclear as to the purpose of the CONTINUE setting underWindows. With no check-pointing, there is no such thing as continuing ajob, right? It would simply restart the job. Would this job restart showup in the job's log file?

One more thing, I'm launching 12,000 to 20,000 jobs via one Linux submitbox (it is the portal to the Windows pool). It takes several hours tosubmit all of the jobs. I've seen postings suggesting that one scheddcannot reliably handle thousands of jobs. Should I plan on using morethan one schedd? I don't see anything about this in the manual. Do youknow of any sample configurations for building a pool with multipleschedds but only one portal piece? Or should I just use a DNSround-robin approach and have the web portal machine send submissionrequests to various submission machines?


Jess

Dan Bradley wrote:

The StartLog on the machine where the job was running is a good placeto look to see what policy expressions were responsible for the fate ofa job. My guess is that your PREEMPT expression is evaluating to truefor some reason. Are you using modified UWCS expressions or are theyunchanged?


--Dan

On May 4, 2006, at 10:33 PM, Jess Cannata wrote:

I am running Condor 6.6.10 on Windows on a set of lab machines. I am

seeing problems with jobs never finishing once they are suspended duetosomeone physically using the computer. I believe that I have theexecute

machines set to suspend the jobs, but keep the job on the machine so
they can continue when the machine returns to unclaimed and idle.

However, the suspended jobs never seem to unsuspend and continueworking

(even if they have to start from scratch). Instead, they get evicted

seconds after the job supposedly unsuspends. Is this how it shouldwork?I've included a snippet from my condor_config file and the job logfile.

Any help would be appreciated.

Jess Cannata

condor_config

StartIdleTime		= 15 * $(MINUTE)
ContinueIdleTime	=  5 * $(MINUTE)
MaxSuspendTime		= 300 * $(MINUTE)
MaxVacateTime		= 10 * $(MINUTE)

WANT_SUSPEND = TRUE
WANT_VACATE = FALSE
START			= $(UWCS_START)
SUSPEND			= $(UWCS_SUSPEND)
CONTINUE		= $(UWCS_CONTINUE)
PREEMPT			= $(UWCS_PREEMPT)
KILL			= $(UWCS_KILL)
PERIODIC_CHECKPOINT	= $(UWCS_PERIODIC_CHECKPOINT)
PREEMPTION_REQUIREMENTS	= $(UWCS_PREEMPTION_REQUIREMENTS)
PREEMPTION_RANK		= $(UWCS_PREEMPTION_RANK)
NEGOTIATOR_PRE_JOB_RANK = $(UWCS_NEGOTIATOR_PRE_JOB_RANK)
NEGOTIATOR_POST_JOB_RANK = $(UWCS_NEGOTIATOR_POST_JOB_RANK)


log file

000 (619.000.000) 05/03 15:27:29 Job submitted from host:
<141.161.x.156:17835>
...

001 (619.000.000) 05/04 09:46:54 Job executing on host:<141.161.x.246:1217>

...
010 (619.000.000) 05/04 09:53:08 Job was suspended.
        Number of processes actually suspended: 1
...
006 (619.000.000) 05/04 09:53:08 Image size of job updated: 986304
...
011 (619.000.000) 05/04 09:53:10 Job was unsuspended.
...
004 (619.000.000) 05/04 09:53:11 Job was evicted.
        (0) Job was not checkpointed.
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        0  -  Run Bytes Sent By Job
        4480328  -  Run Bytes Received By Job
...

001 (619.000.000) 05/04 22:08:37 Job executing on host:<141.161.x.233:1223>

...
_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

References:
- [Condor-users] Suspend on Windows
  - From: Jess Cannata
- Re: [Condor-users] Suspend on Windows
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] Suspend on Windows
Next by Date: Re: [Condor-users] condor_negotiator/condor_collector scheduling problem
Previous by thread: Re: [Condor-users] Suspend on Windows
Next by thread: Re: [Condor-users] condor_compile and gfortran
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [Condor-users] Suspend on Windows