[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Running long jobs




On Wed, 30 Nov 2005, Erik Paulson wrote:

On Wed, Nov 30, 2005 at 07:28:40PM +0100, Daniel R Figueiredo wrote:
Hello,

Thanks for your reply. Unfortunately, I am a user of a Condor cluster (and
not the administrator). From what I understood, the condor config file you
sent below is meant to configurate the behavior of the entire cluster.
Although I could request the system administrator to alter the
configuration, I still wonder if a user can do something about this
problem. Can a user request its job to be suspended instead of terminated?
Any thoughts are welcome.

No, they can't. The administrator has to configure Condor to do this.
Even suspended, the job uses some resources on the execute machine (if
nothing else, the disk space used to transfer it there) so the blessing
must come from the administrator.

In our department, we set up special VMs that suspend the running job
when a job starts running on another VM on the system. You need to use
the 6.7 series to enable it, but the details are in the 'Condor and
The Bolonga Batch System':

http://www.cs.wisc.edu/condor/technical.html

Hi,

Thanks for your message. It's now clear that I'll need support from the Condor administrator. However, I looked through the report "Condor and The Bolonga Batch System" as you suggested, but it was not clear how to configurate Condor to run long jobs with preemption implemented via suspension (as opposed to preemption via termination). In particular, I would like to know what is the minimal set of configuration fields that must be changed in order to achieve this? Recall that I would like for long jobs to be preempted via suspension (as opposed to terminated through a signal) and later resume from where they stopped (as opposed to restarting from the beginning). Any ideas on how to this? I could then suggest something concrete to our local Condor administrator.

Thanks,
Daniel