[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Vacating job and attaching meta data for the next host to take over the vacated job



Tx Matt, pls see below

few more clarifications...
I don't use checkpointing and the universe used is Vanilla, as you might
have already guessed ;)

>From what I understand I must use preemption, since if there a render job
with higher priority I really want to inform the machines which an Higher
priority job has been matched for those (Preempting right?) and somehow this
machines needs to vacate the current running job, but not at all expense. I
want flexible preemption rules and give a chance with the current rendered
frame, by letting it finish, prior accepting the vacation(due to preempting)
notification.

> So in answer to your question the job itself should respond to the
> preemption notification (note it isn't a request!) by either. writing
> any saved state it requires as well as some means of flagging its
> success such as a flag file to the working directory then exiting.

That is what I thought I would do at the beginning. I am seeking ways to
promote the information without poluting file server with temporary files
which eventually need to be cleaned afterwards.
I guess I will have to pollute ;)

> you could hack round this and use condor_qedit on a user set classad on
> the job but be careful.

condor will be installed on ALL machines, such my perl wrapper can have
access to condor_qedit, but afterthough, it might be a bad idea, since it
will not be real time. The collector get the machine specific attributes
every 5 mins or so ( i think right?) I may want to implement a timer/alarm
in my perl wrapper such that "latest" render progression is published in a
text file, on the node's log directory (on the network), which can then be
parsed by another script ( web page / cgi bin / php ). I think will keep it
simple so it is reliable and predictable ;)

> The reconfigurability would be somewhat risky though...I wouldn't
> recommend it and would just suggest you stop preemption if the
> throughput losses become too big (you usage may well trend to a steady
> state where this is not required much so don't totally discard the
> idea of just seeing what happens.

I fear I can't walk away from preemption in this setup since the renderfarm
must remains responsive whenever higher priority jobs are made available.

> Note that the default preemption rank is to preempt the longest
> running jobs - while fine for standard as a default this is sorely
> lacking for a vanilla only farm - I recommend inverting this logic to
> start with

Good to know

if Preemption_rank is based on how long a job has been active, I think
I may run into problem where hanged application could be running for
several hours, thus lowering the probability of being preempted. ( as you
mentioned )

That the reason why I was interested to "inject" a new job custom attribute
which would get considered by the preemption_rank.

Example: if JOB_RENDER_PROGRESSION > 80 then tune ranking to lower
probability of being preempted

Tx again for your time
Dave.