[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New user question: Eviction of long jobs



Laura Balzano <sunbeam@xxxxxxxxxxxx> wrote:
> Hi all, I have just started using Condor for the first time. I submitted a 
> matlab job that is going to take a long while (I estimate 8 hours). I saw 
> that it already got evicted four times. I believe that means it had to 
> start over on another machine, is that correct?

You understand correctly.

> Is there any way to tell condor to save my job for a machine/a
> time when it can run to completion?

Generally speaking, no.  A particular installation might be able
to make commitments, but exactly how you would access those
resources would depend on your local configuration.  I'm guessing
you're working on UW-Madison CAE resources, and I would recommend
asking the CAE if they can help.  I don't actually know their
specific policies, but a common solution is to have some
dedicated computers that only run Condor jobs and generally won't
evict them.  Those computers would be marked somehow, perhaps
with a setting like "IsCluster=TRUE".  Then in your submit file
you might put something like "Requirements==IsCluster".  But the
specifics will depend on the CAE's configuration, and I don't
know if they offer such a service. 

I'm guessing the machines in question are more idle at night, so
if you haven't already, you might try leaving it alone overnight.
I would expect (but can't promise) that eventually you'll land on
a machine that's unoccupied and it will finish.

If none of the above helps, we may have additional Condor
computing for options UW-Madison users; contact us at
condor-admin@xxxxxxxxxxx

As a complex option, you might be able to use DMTCP
(http://dmtcp.sourceforge.net/) to checkpoint your job at regular
intervals, allowing it to restart when interrupted.  Setting this
is fiddly, but possible.

> Also is there any one place where documentation on these
> various things resides?

Our manual is here: http://www.cs.wisc.edu/condor/manual/v7.4/
Unfortunately what you really need to know is about the CAE site
specific configuration.  I don't know how the CAE manages or
documents that configuration.

-- 
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                http://www.cs.wisc.edu/condor/