[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Job rescheduling



Matthew Farrellee wrote:


If you're seeing a 2 hour timeout that sounds fairly familiar. I
believe Todd answered it previously. I'd assume his answer was to
reverse the direction on the alive messages. I'll ping him to include
details.


Here is what we think can be done with the current Condor binaries to address the problem :

Set  in the condor_config on **both** the submit machines (running the
condor_schedds) AND the execute machines (running the condor_startds)
the following setting:

    STARTD_SENDS_ALIVES = True

Then do a condor_reconfig as usual to both submit and execute machines (or a condor_reconfig -all). Note that the default setting for this parameter is False, so if it is not specified in the config it is False. Unfortunately, Condor will not (yet) gracefully handle the situation where the value is different on the submit -vs- execute machines.

Upon doing the above, your job ClassAds will contain an attribute "LastJobLeaseRenewal" which will contain an integer representing the epoch time (number of seconds since 1/1/1970) since it last heard from the startd on the execute machine.

So in your job submit description file (which you give to condor_submit), you could add the following:

  PeriodicHold = JobLeaseDuration =!= UNDEFINED &&  \
      ((JobLeaseDuration - (CurrentTime - LastJobLeaseRenewal)) <= 0 )
  PeriodicRelease = PeriodicHold =?= True

The above says that if the job has a job lease, and the lease has expired, put the job on hold, thereby move it from Running state to Hold state. Then the periodic release expression says if the lease is expired (ergo the PeriodicHold expression is true), then release the job from Hold state back to Idle state -- at which point it will be rescheduled someplace else. Note you can use SUBMIT_EXPRS (see Manual) to have condor_submit automatically add the above policy into every job submitted.

Let us know how the above suggestions go.

In a future release of Condor, we wish to do the following:
  a) make STARTD_SENDS_ALIVE default to True
b) have the schedd automatically move a job with an expired lease from Running back to idle the moment the lease expires, without requiring the user to utilize the periodic hold/release expressions, and the polling delay the use of these expressions introduces (the schedd only periodically evaluates the periodic expressions).


--
Todd Tannenbaum                       University of Wisconsin-Madison
Condor Project Research               Department of Computer Sciences
tannenba@xxxxxxxxxxx                  1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132                 Madison, WI 53706-1685