[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Adjusting machine RANK classadexprbased ontotalqueue time for a job



All the jobs in the system belonged to me. It was just the one cluster
of jobs present at the time I saw this happening. We are doing pretty
much what you suggested to force re-negotiation of the claim after every
job. Indeed, the job was in the retiring state when I issued the
condor_rm command to remove it.

If the negoiator is simply connecting a startd with a schedd then is
there something amiss with the schedd when condor_rm is invoked? I would
have expected 44.2 to run after 44.0 whether the negotiator or the
schedd was deciding which job to had to the startd next. 

Ian

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: October 27, 2004 6:16 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Adjusting machine RANK classad 
> exprbased ontotalqueue time for a job
> 
> 
> Ian,
> 
> Could you specify which of the jobs in your various tests are 
> being run by different users, if any?  One potential point of 
> confusion is that, by design, the Condor negotiator does not 
> micromanage what the schedd does with a claim.  Once the 
> schedd gets a claim on behalf of a user, it will continue to 
> run jobs on that claim until the claim is taken away or the 
> user runs out of jobs.  The negotiator doesn't tell the 
> schedd which job to run next on the claim.
> 
> You can force renegotiation of claims after every job if you want.  
> Something like the following policy will do this:
> 
> MaxJobRetirementTime = 1000000
> WANT_SUSPEND = FALSE
> PREEMPT = TRUE
> 
> --Dan
> 
> Ian Chesal wrote:
> 
> >It looks like it was my use of condor_rm that messed up my 
> >predictability. I continued the experiment but this time I made sure 
> >the running 44.1 process finished normally instead of being 
> pre-maturly 
> >terminated by condor_rm.
> >
> >I had two queued jobs with their EnteredCurrentStatus times:
> >
> >44.2 1098912677
> >44.3 1098910808
> >
> >I expected 44.2 to rank lower than 44.3 by ~31. So 44.3 
> should be the 
> >next job picked up.
> >
> >And this was the case. My rank expression worked this time. 
> Excellent.
> >
> >So here's a question for the condor team: If I was a "sneaky user" I 
> >could write a job that, after processing was complete sent 
> me an email 
> >and then went to sleep for a long, long time. Upon receiving that 
> >email, if I used condor_rm to terminate the job I'd be able 
> to hang on 
> >to the resource it was using and run another job on it. Even 
> if another 
> >job, from another user, had a higher rank because condor_rm seems to 
> >prevent the machine from re-negotiating. This would give me infinite 
> >access to a resource. Can this happen?
> >
> >
> >Ian
> >
> >
> >
> >
> >  
> >
> >>-----Original Message-----
> >>From: condor-users-bounces@xxxxxxxxxxx 
> >>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> >>Sent: October 27, 2004 5:18 PM
> >>To: Condor-Users Mail List
> >>Subject: RE: [Condor-users] Adjusting machine RANK classad 
> expr based 
> >>ontotalqueue time for a job
> >>
> >>Hmm. So I went with the RANK expression:
> >>
> >>RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> >>TARGET.EnteredCurrentStatus)/60))
> >>
> >>My plan was to make sure jobs that are queued rank higher 
> the longer 
> >>they've been in the queued state. In this case, +1 for every minute 
> >>they've been sitting idle.
> >>
> >>To test this I submitted some jobs in the held state. Jobs 
> are simple:
> >>go to the machine and sleep for an hour.
> >>
> >>I released three of the held jobs. My machine immediately picked up 
> >>44.0 from the cluster and started running.
> >>
> >>I let the other two released jobs build up some queue time 
> while 44.0 
> >>slept on a machine. At one point I did see condor_status 
> show my 44.0 
> >>as being in the "Retiring" state instead of the "Busy" 
> state -- that 
> >>is good news. We have a long  MaxJobRetirementTime so this is 
> >>expected.
> >>
> >>I let about 8 minutes lapse I then I issued the commmand:
> >>
> >>condor_hold 44.1
> >>condor_release 44.1
> >>
> >>So this reset the EnteredCurrentStatus time on 44.1. I now 
> have 44.0 
> >>running, but retiring and the remaining two jobs each have 
> >>EnteredCurrentStatus as follows:
> >>
> >>44.1 1098910859
> >>44.2 1098910279
> >>
> >>By this output I expect 44.2 to have the higher rank. 44.0 is still 
> >>running so I removed it with:
> >>
> >>condor_rm 44.0
> >>
> >>I expected the machine to pick up 44.2 as the next job because it's 
> >>rank is higher, having been queued for a longer time that 44.1.
> >>
> >>Not so. The machine picked up 44.1. I'm the only user in 
> the system so 
> >>it's not a matter of EUP. What's up? Why is it 44.2 didn't rank 
> >>higher?
> >>Can anyone see how I messed up my prediction for next job 
> to run? I'm 
> >>stumped. I thought I had it all figured out.
> >>
> >>Thanks!
> >>
> >>Ian
> >>
> >>    
> >>
> >>>-----Original Message-----
> >>>From: condor-users-bounces@xxxxxxxxxxx 
> >>>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> >>>Sent: October 27, 2004 11:34 AM
> >>>To: Condor-Users Mail List
> >>>Subject: [Condor-users] Adjusting machine RANK classad expr
> >>>      
> >>>
> >>based on
> >>    
> >>
> >>>totalqueue time for a job
> >>>
> >>>I'm toying with adjusting the RANK expression to achieve a more 
> >>>FIFO-like consideration when condor runs jobs. The idea is to rank 
> >>>jobs on machines based on their time in the queue.
> >>>I wanted to bounce the rank expression and idea off the list. 
> >>>The rank expression for machines I'm thinking of using is:
> >>>
> >>>RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> >>>TARGET.EnteredCurrentStatus)/600))
> >>>
> >>>This would give a job queued 10 minutes longer than another job a 
> >>>higher rank on the machine.
> >>>
> >>>The other option is:
> >>>
> >>>RANK = ((CurrentTime - TARGET.QDate)/600)
> >>>
> >>>But this would track cumulative queue time (so if the job
> >>>      
> >>>
> >>queued, ran
> >>    
> >>
> >>>for a bit, then got sent back to the queue) right? Or is
> >>>      
> >>>
> >>Qdate reset
> >>    
> >>
> >>>every time a job returns to the queue, not just the first 
> time it's 
> >>>queued up by condor_submit?
> >>>
> >>>Comments? Opinions? Much appreciated.
> >>>
> >>>Ian
> >>>
> >>>_______________________________________________
> >>>Condor-users mailing list
> >>>Condor-users@xxxxxxxxxxx
> >>>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>>      
> >>>
> >>_______________________________________________
> >>Condor-users mailing list
> >>Condor-users@xxxxxxxxxxx
> >>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>    
> >>
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >  
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>