[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] determining if a job has failed to get a license



Hello -

We've been working on a couple of different things that will help in 
only running a job if there is a license available for it to run with.

Regardless of what the final solution looks like, it's still probably
going to be possible that Condor will make a mistake and run a job 
that it won't have a license for - so the first thing we want to have
in place is a way to make sure we can clean up any mess that we might
create. Once we're sure we can clean up, then we can move on to making 
the mess. 

Figuring out of if a job exited because of being denied a license can
be tricky. Depending on the job, you may not be able to depend on the
job exit code to mean anything if there was a license failure. You
can't always use a post script to read the output and look for
"license denied" errors, and even if you do you have to change that
script for every different type of job.

Another place that has information about denied licenses are the logs
of FLEXlm itself. We've written a simple server that allows you ask if
a machine was denied a license in some time interval. The simple usage
scenario we see for this is:

note job starttime
run the job
note the end time
connect to the FLEXlm monitor, and ask "did you deny a license to 
myusername@mycurrenthost between starttime and endttime"
if yes, exit with a well-known status, and have Condor requeue this job
if no, exit with a regular status and have Condor remove this job from the 
queue as normal

(For simplicity, we're assuming mostly-synchronized clocks - NTP is
pretty universal now, but we could do some sort of NTP-like thing if 
we needed to)

It's not perfect, but most of the problems are that it's too conservative
- if it's an SMP, the job that failed might have been on the other
processor, but there's no way to correlate the job to something FLEXlm
is tracking.  FLEXlm also only writes to its logfiles periodically, so
you have to wait some "slop time" before connecting to the FLEXlm monitor
and asking it a question (FLEXlm seems to write every 15 or 20 seconds)

The scripts are available here:
http://www.cs.wisc.edu/~epaulson/license/

Feel free to use them or modify them for whatever - if you add anything
fun, please consider sending me the changes. (I also don't think much
of my Perl, so I apologize in advance)

Thanks,

-Erik