[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] New user question: Eviction of long jobs



Thank you Alan, this was very helpful. In the end my job ran in 7 hours, 55 minutes, and the overall sum of time from all the runs was 9 hours and 10 minutes-- not too bad.

I only wanted to offload such jobs away from my machine, so this works just fine. If I had much longer experiments, do you have a suggested experiment duration (say > 24 hours?) for which I should follow your suggestion of getting dedicated CAE resources?

Thanks again,
Laura



Alan De Smet wrote: --------------------------------------

Laura Balzano <sunbeam@xxxxxxxxxxxx> wrote:
Hi all, I have just started using Condor for the first time. I submitted a matlab job that is going to take a long while (I estimate 8 hours). I saw that it already got evicted four times. I believe that means it had to start over on another machine, is that correct?

You understand correctly.

Is there any way to tell condor to save my job for a machine/a time when it can run to completion?

Generally speaking, no. A particular installation might be able to make commitments, but exactly how you would access those resources would depend on your local configuration. I'm guessing you're working on UW-Madison CAE resources, and I would recommend asking the CAE if they can help. I don't actually know their specific policies, but a common solution is to have some dedicated computers that only run Condor jobs and generally won't evict them. Those computers would be marked somehow, perhaps with a setting like "IsCluster=TRUE". Then in your submit file you might put something like "Requirements==IsCluster". But the specifics will depend on the CAE's configuration, and I don't know if they offer such a service.

I'm guessing the machines in question are more idle at night, so if you haven't already, you might try leaving it alone overnight. I would expect (but can't promise) that eventually you'll land on a machine that's unoccupied and it will finish.

If none of the above helps, we may have additional Condor computing for options UW-Madison users; contact us at condor-admin@xxxxxxxxxxx

As a complex option, you might be able to use DMTCP (http://dmtcp.sourceforge.net/) to checkpoint your job at regular intervals, allowing it to restart when interrupted. Setting this is fiddly, but possible.

Also is there any one place where documentation on these various things resides?

Our manual is here: http://www.cs.wisc.edu/condor/manual/v7.4/ Unfortunately what you really need to know is about the CAE site specific configuration. I don't know how the CAE manages or documents that configuration.

--
Alan De Smet                              Condor Project Research
adesmet@xxxxxxxxxxx                http://www.cs.wisc.edu/condor/