[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] debugging STARTD_CRON jobs



> I am using the STARTD_CRON facility to keep track of how many
> licenses are available for my software. I have a routine "fad"
> that produces the following output to stdout
>
>       software1_available = 5
>       software2_available = 0
>       ...
>
>       then in my ClassAd I have the line
>
>       Requirements = ( software1_available > 0 )
>
>
>       1. Is this the correct way to do this?

This is a bit like Perl programming: there's more than one way to do it.
:)

It'll definitely reduce your number of failed tool starts due to
out-of-license errors. But will it stop them altogether? I don't think
so. There's a propagation delay from when you update a startd's classad
to when it's seen at the collector and used by the negotiator to
determine if a machine can run a job. Someone from the Condor team can
chime in here at let us know if a job's requirements are also evaluated
directly against a machine's classad before the job is actually kicked
off on the machine (so after negotiation and assignment, but before the
starter daemon is forked to run the actual job I guess). If that's the
case then this is pretty good. You'll obviously not catch all changes
because the cron job is pulling the license information, but it's not
bad (especially if your jobs run for a long time and you've got a small
number of licenses). If you can figure out how to requeue jobs when they
fail to start because of out-of-license errors you can have the whole
system heal itself and you're probably done. Nice.

The downside to this approach is if you add a new license-limited tool
you have to update all the startd cron jobs in your system to advertise
the new tool on your execute nodes.

If it's important that you *never* have jobs fail to start because of
out-of-license errors a different approach would be to have jobs
submitted held and a cron job that releases jobs from the held state
such that: I+R jobs <= number of available licenses. This only works if
you have a dedicated scheduler for the license-limited jobs really.
Otherwise you have to get into load-balancing between your schedds. No
fun.

But this approach is centralized so it's easy to add license limited
tools. And it's a little easier (I think) to predicted the behaviour of
and debug. Does require that you've got tight control over how user's
submit their job so you can hold everything that uses license-limited
tools.

>       2. How do I see the value of software_available to make sure it
is getting set correctly from the STARTD_CRON job.

You need to tell the startd to advertise the classad attribute. You do
this with the STARTD_EXPRS. Hmm...I was going to point you to
documentation on STARTD_EXPRS but the index only lists two entries and
neither of them explain what STARTD_EXPRS is used for just...wierd.

STARTD_EXPRS is a list of classad attributes the startd should advertise
to the world. There's a bunch of defaults (this is the stuff you see
when you do condor_status -long). You can advertise any arbitrary
classad attribute in a startd's configuration by adding it to
STARTD_EXPRS. You can even advertise stuff on a per-slot basis. But for
your purposes it should suffice to advertise the same attribute for all
the slots on machine. So set:

        STARTD_EXPRS = software1_available, software2_available, etc...

In your startd's configuration file and then tell it to reconfigure
itself to start it advertising the attributes:

        condor_reconfig -full -startd

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.