[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] New user question: Eviction of long jobs
- Date: Sun, 18 Apr 2010 09:29:52 -0500 (CDT)
- From: Laura Balzano <sunbeam@xxxxxxxxxxxx>
- Subject: Re: [Condor-users] New user question: Eviction of long jobs
Thank you Alan, this was very helpful. In the end my job ran in 7 hours,
55 minutes, and the overall sum of time from all the runs was 9 hours and
10 minutes-- not too bad.
I only wanted to offload such jobs away from my machine, so this works
just fine. If I had much longer experiments, do you have a suggested
experiment duration (say > 24 hours?) for which I should follow your
suggestion of getting dedicated CAE resources?
Alan De Smet wrote: --------------------------------------
Laura Balzano <sunbeam@xxxxxxxxxxxx> wrote:
Hi all, I have just started using Condor for the first time. I submitted
a matlab job that is going to take a long while (I estimate 8 hours). I
saw that it already got evicted four times. I believe that means it had
to start over on another machine, is that correct?
You understand correctly.
Is there any way to tell condor to save my job for a machine/a time when
it can run to completion?
Generally speaking, no. A particular installation might be able to make
commitments, but exactly how you would access those resources would depend
on your local configuration. I'm guessing you're working on UW-Madison
CAE resources, and I would recommend asking the CAE if they can help. I
don't actually know their specific policies, but a common solution is to
have some dedicated computers that only run Condor jobs and generally
won't evict them. Those computers would be marked somehow, perhaps with a
setting like "IsCluster=TRUE". Then in your submit file you might put
something like "Requirements==IsCluster". But the specifics will depend
on the CAE's configuration, and I don't know if they offer such a service.
I'm guessing the machines in question are more idle at night, so if you
haven't already, you might try leaving it alone overnight. I would expect
(but can't promise) that eventually you'll land on a machine that's
unoccupied and it will finish.
If none of the above helps, we may have additional Condor computing for
options UW-Madison users; contact us at condor-admin@xxxxxxxxxxx
As a complex option, you might be able to use DMTCP
(http://dmtcp.sourceforge.net/) to checkpoint your job at regular
intervals, allowing it to restart when interrupted. Setting this is
fiddly, but possible.
Also is there any one place where documentation on these various things
Our manual is here: http://www.cs.wisc.edu/condor/manual/v7.4/
Unfortunately what you really need to know is about the CAE site specific
configuration. I don't know how the CAE manages or documents that
Alan De Smet Condor Project Research