[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Adjusting machine RANK classad expr basedontotalqueue time for a job



It looks like it was my use of condor_rm that messed up my
predictability. I continued the experiment but this time I made sure the
running 44.1 process finished normally instead of being pre-maturly
terminated by condor_rm.

I had two queued jobs with their EnteredCurrentStatus times:

44.2 1098912677
44.3 1098910808

I expected 44.2 to rank lower than 44.3 by ~31. So 44.3 should be the
next job picked up. 

And this was the case. My rank expression worked this time. Excellent.

So here's a question for the condor team: If I was a "sneaky user" I
could write a job that, after processing was complete sent me an email
and then went to sleep for a long, long time. Upon receiving that email,
if I used condor_rm to terminate the job I'd be able to hang on to the
resource it was using and run another job on it. Even if another job,
from another user, had a higher rank because condor_rm seems to prevent
the machine from re-negotiating. This would give me infinite access to a
resource. Can this happen?


Ian




> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> Sent: October 27, 2004 5:18 PM
> To: Condor-Users Mail List
> Subject: RE: [Condor-users] Adjusting machine RANK classad 
> expr based ontotalqueue time for a job
> 
> Hmm. So I went with the RANK expression:
> 
> RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> TARGET.EnteredCurrentStatus)/60))
> 
> My plan was to make sure jobs that are queued rank higher the 
> longer they've been in the queued state. In this case, +1 for 
> every minute they've been sitting idle.
> 
> To test this I submitted some jobs in the held state. Jobs are simple:
> go to the machine and sleep for an hour.
> 
> I released three of the held jobs. My machine immediately 
> picked up 44.0 from the cluster and started running. 
> 
> I let the other two released jobs build up some queue time 
> while 44.0 slept on a machine. At one point I did see 
> condor_status show my 44.0 as being in the "Retiring" state 
> instead of the "Busy" state -- that is good news. We have a 
> long  MaxJobRetirementTime so this is expected.
> 
> I let about 8 minutes lapse I then I issued the commmand:
> 
> condor_hold 44.1
> condor_release 44.1
> 
> So this reset the EnteredCurrentStatus time on 44.1. I now 
> have 44.0 running, but retiring and the remaining two jobs 
> each have EnteredCurrentStatus as follows:
> 
> 44.1 1098910859
> 44.2 1098910279
> 
> By this output I expect 44.2 to have the higher rank. 44.0 is 
> still running so I removed it with:
> 
> condor_rm 44.0
> 
> I expected the machine to pick up 44.2 as the next job 
> because it's rank is higher, having been queued for a longer 
> time that 44.1.
> 
> Not so. The machine picked up 44.1. I'm the only user in the 
> system so it's not a matter of EUP. What's up? Why is it 44.2 
> didn't rank higher?
> Can anyone see how I messed up my prediction for next job to 
> run? I'm stumped. I thought I had it all figured out.
> 
> Thanks!
> 
> Ian
> 
> > -----Original Message-----
> > From: condor-users-bounces@xxxxxxxxxxx 
> > [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> > Sent: October 27, 2004 11:34 AM
> > To: Condor-Users Mail List
> > Subject: [Condor-users] Adjusting machine RANK classad expr 
> based on 
> > totalqueue time for a job
> > 
> > I'm toying with adjusting the RANK expression to achieve a more 
> > FIFO-like consideration when condor runs jobs. The idea is to rank 
> > jobs on machines based on their time in the queue.
> > I wanted to bounce the rank expression and idea off the list. 
> > The rank expression for machines I'm thinking of using is:
> > 
> > RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> > TARGET.EnteredCurrentStatus)/600))
> > 
> > This would give a job queued 10 minutes longer than another job a 
> > higher rank on the machine.
> > 
> > The other option is:
> > 
> > RANK = ((CurrentTime - TARGET.QDate)/600)
> > 
> > But this would track cumulative queue time (so if the job 
> queued, ran 
> > for a bit, then got sent back to the queue) right? Or is 
> Qdate reset 
> > every time a job returns to the queue, not just the first time it's 
> > queued up by condor_submit?
> > 
> > Comments? Opinions? Much appreciated.
> > 
> > Ian
> > 
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> > 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>