[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] When do machine RANK settings apply?



On Wed, 5 Jan 2005 14:34:46 -0500, Ian Chesal <ICHESAL@xxxxxxxxxx> wrote:
> Hmm. Actually, it isn't working out at all. I had users with jobs in the
> system with JobPrio's of 10 and 11 respecitivily. I sent a single job in
> with a JobPrio of 16 and expected my job to run on the next available
> machine. Not the case. My job is still sitting there while the two
> lower-JobPrio user's are passing jobs through the system. I have
> attached some cumluative logging output that I'm collecting every 10
> minutes. I have a custom command called abc_who that groups users jobs
> by priority and shows running and queued. You can see that I (ichesal)
> have a priority 16 job queued, but that bchan's priority 12 jobs keep
> getting resources.

> Looking at the NegotiatorLog it wants to preempt bchan's jobs for mine
> but it can't because PREEMPTION_REQUIREMENTS are false. I think what I'm
> observing here is that bchan's schedd holding on to the startd machine
> after a job finishes and just running the next job in her list. Why is
> my higher-ranking job not taking over this machine?

That is an issue - essentially if a user retains a claim to the
machine then they can keep sending lower priority jobs too it. It
seems the negotiator <annoyingly> decides that it tried checking
preemption based on the user priority being higher, that said no so it
won't bother checking if the machine rank makes a difference...

Just to check if you  release all those jobs at the same time (with
only 2 machines to execute the three of them)  so that a single
negotiation cycle happens does the right allocation occur?

I was aware of the problem you describe on 6.6 (I very occasionally
have to execute a condor_vacate to force things to realign if two
users have identically tiered jobs but one got a 'head start' and
therefore holding onto it) but the 6.7 retirement in theory should
have allowed me to enable user preemption where a slight disparity
exists coupled with max job retirement to avoid thrashing.

All is not lost though - I think you may have forgotten about your 2
day retirement time... the negotiator does recheck when a "premption
pending retirement" exists in case the premting job goes away, this
lets the retirement be withdrawn.

If the retirement is present but the schedd is still accepting jobs
then thats a BUG (didn't someone else mention this a while back, did
it get identified/resolved)...

Any one at cs.wisc can see a why this might be happening please do
chip in here but I'm hitting a brick wall now.

Clearly more than one group would like to use condor in a "Job then
User" setting condor, for all it's vaunted flexibility does not make
this easy (jury still out on possible) allow this. I see the reasons
it doesn't since considerable optimization of the startd/negotiator
comms overhead can be performed this way.
However these optimizations make what we are attempting to do
excruciatingly unpleasant

> I've tracking user priority, a custom condor_status view and another
> view of running and queued jobs every ten minutes. Here is the data that
> relevant to this time frame in the negotiator log. You can see every ten
> minutes the EUPs are dumped for the active users. Then custom
> condor_status output that lists the jobid, the rank of the job, the
> owner, and the activity as well as the vmem in the machine and the
> imagesize of the job. Then there's the output from my custom tool that
> shows running and queued jobs sorted by priority and by user.
> 
> What's weird about this is that even thought I (ichesal) haven't
> actually gotten to run any jobs my EUP has increased from 13:30 to
> 13:50! What's up with that?

because of your pending retirement claim I'm guessing...

> 14.29   35443.000000    bchan@xxxxxxxxxx        Retiring        2097151
> 754568
> 14.45   35464.000000    bchan@xxxxxxxxxx        Busy    2097151 134016
> 11.16   34644.000000    bchan@xxxxxxxxxx        Retiring        2097151

Well they seem to be Retiring ok - try pushing your retirement down
temporarilly and see if it gets the proper job - I'm feeling slightly
more optimistic now

> As for your question about our setup, we have the following
> configuration:
> 
> Central Manager: RH 9 machines
> Dedicated Startd Machines: Mostly Win2k machines, dual proc
> Client Schedd Machines: Mix of Win2k/WinXP machines, dual proc
> 
> There are 9 dedicated startd machines running jobs for users. This is
> our test pool. The plan was to expand this to over 100 on Monday, but
> with things are the way they are right now with this scheduling stuff
> causing me so much trouble I'm probably going to push that off. All in
> all, if Condor works out it will be sitting on upwards of 300 dedicated
> startd machines by the end of the month at our site. With another 1000
> or so machines at other sites around the world. There are about 4
> submission nodes in use right now. Each user has their desktop machine
> set up as a submission node. There are only really 4 active users of our
> test installation right now.

Sounds similar to ours, which is a good sign since I could do with 6.7
heading to stability and (hopefully) 6.8
 
> With 6.7.3 all
> crashing has disappeared. And we've really been trying to pound on it.
> We run vanilla jobs. With minimal file transfer done using Condor.
> Basically a small starup script to kick off the job and that's it.

Yep, sounds good (rubs hands)

thanks for the info,
Matt