[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Multiple job queues



On Saturday, 28 January, 2012 at 12:18 AM, Raman Sehgal wrote:
Still i have some doubts,
    I dont want to use the MaxRunHours parameters in submit file, coz
that gives the control to users. I want this control to be with condor.
Well, one option is to make the default for MaxRunHours be one hour and let people know how to override it. So you give power users additional information. Condor is less about rule enforcement than it is about just getting work through a system.

Based on how you've formed your question I'll guess that you were maybe an LSF user in the past? The queues-to-enforce-policy stuff doesn't work well in Condor. You'll find it can quickly fall apart (though the specific case you're trying to solve now is okay). 
        So is there any way to resolve this issue. I dont know why it happens that
job continues in running states even for several day, which should actually get
complete in 8-9 hours.
Without know more about your jobs it's impossible to answer this. Log in to the machine where the job is running  and start inspecting the process tree. It should be something like:

condor_master
 - condor_startd
   - condor_starter
     - condor_exe (which is what you told Condor your cmd is in the submit file)
       - all the sub-processes from your job…

Could be your job is hung waiting on user input. Or it can't find input data and it doesn't deal with that situation gracefully. There are just too many reasons to say. If you post more specific information about your processes we might be able to give you a better answer on this.     
         One more thing i would like to know that is there any way to specify
some time parameter in submit file may be  MaxRunHours  after that,  the
particular job get resubmitted automatically. I hope something for this must
be there in condor which i am missing, or if some helping scripts are there
which can do this work of checking the job status depending upon MaxRunHours
and then resubmit the job. It will be very helpful to me.
Yes, see SYSTEM_PERIODIC_HOLD: http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#18746
And SYSTEM_PERIODIC_RELEASE: http://research.cs.wisc.edu/condor/manual/v7.6/3_3Configuration.html#18757

You want to hold it if it's been running for longer than okay:

SYSTEM_PERIODIC_HOLD = JobStatus == 2 && (CurrentTime - EnteredCurrentStatus > 3600*MaxRunHours) 

And then release it if it's been put on hold only a few times:

SYSTEM_PERIODIC_RELEASE = JobStatus == 5 && JobRunCount < 5

But remove it if it goes through this cycle too many times:

SYSTEM_PERIODIC_REMOVE = JobStatus == 5 && JobRunCount >= 5

Regards,
- Ian

---
Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools

http://www.cyclecomputing.com
http://www.cyclecloud.com
http://twitter.com/cyclecomputing