[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Trouble with job priority and job retirement



No, the jobs aren't nice'd. The only "strange" thing in our setup is the
RANK expression on our startd machines:

RANK = (TARGET.JobPrio * 2880) + ( (TARGET.JobStatus =?= 1) *
((CurrentTime - TARGET.EnteredCurrentStatus) / 60) ) 

Which promotes a fifo/priority like ranking among user jobs. I don't
think this is the culprit though.

I'm going to try reproducing this problem tomorrow by submitting some
low priority jobs, artificially jacking up my real user priority with
condor_userprio and then submitting the high priority job to see if it
again gets ignored. Like I mentioned below, myself and colleague with
0.5 real user prios were unable to recreate this -- our higher priority
jobs were taken before our lower priority jobs. But it'll have to wait
until the morning.

-Ian

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: December 14, 2004 5:11 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Trouble with job priority and job 
> retirement
> 
> 
> The rounding off of resource share is known to cause a 
> resource to go unused under certain circumstances.  I don't 
> understand how this could happen with only one submitter, 
> however.  Is she also submitting nice-user jobs?
> 
> --Dan
> 
> Ian Chesal wrote:
> 
> >I really think this has to do with the fact that my one user had 
> >received 0 resources from the system during the negoiation 
> cycle. Even 
> >though there were no other users vying for resources here effective 
> >user priority was high and netted here 0 resources so the negotiator 
> >ignored her new job that had higher priority than her old jobs. Does 
> >this seem plausible?
> >
> >- Ian
> >
> >  
> >
> >>-----Original Message-----
> >>From: condor-users-bounces@xxxxxxxxxxx 
> >>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> >>Sent: December 14, 2004 12:49 PM
> >>To: Condor-Users Mail List
> >>Subject: Re: [Condor-users] Trouble with job priority and job 
> >>retirement
> >>
> >>
> >>I cannot reproduce any problems with a match record not getting 
> >>deleted when a claim timeout happens.  If you are still having a 
> >>problem, please send the relevant StartLog, NegotiatorLog, and 
> >>SchedLog to condor-admin and I'll try to see what is going on.
> >>
> >>--Dan
> >>
> >>Dan Bradley wrote:
> >>
> >>    
> >>
> >>>Ian,
> >>>
> >>>In a case such as the one you describe, where job 2.0
> >>>      
> >>>
> >>preempts job 1.0
> >>    
> >>
> >>>and has to wait around for 1.0 to finish, there are two possible 
> >>>cases.  One is that 1.0 finishes and 2.0 claims the machine.  The 
> >>>other is that the schedd times out waiting for 2.0 to get 
> an active 
> >>>claim (controlled by REQUEST_CLAIM_TIMEOUT), and it tries 
> getting a 
> >>>new match for 2.0.  From your description of what is
> >>>      
> >>>
> >>happening, I am
> >>    
> >>
> >>>concerned that when the timeout happens, the previous match is not 
> >>>getting correctly removed.  I will double-check this case
> >>>      
> >>>
> >>and get back
> >>    
> >>
> >>>to you.  If you set REQUEST_CLAIM_TIMEOUT to a very large
> >>>      
> >>>
> >>number, you
> >>    
> >>
> >>>should be able to remove this case from even being a possibility.
> >>>
> >>>You also asked about the meaning of, "Over submitter resource limit
> >>>(0) ... only consider startd ranks."  This means that when Condor 
> >>>sliced up the resource pie between job submittors, this user got a 
> >>>slice of size 0.
> >>>
> >>>--Dan
> >>>
> >>>Ian Chesal wrote:
> >>>
> >>>      
> >>>
> >>>>I'm trying to get a better handle on job retirement. I'm
> >>>>        
> >>>>
> >>observing a
> >>    
> >>
> >>>>strange situation in our current 6.7.2 system which uses the 
> >>>>retirement feature. We have a fairly long retirement time set (2 
> >>>>days). I have a user that has 100 jobs queued as cluster
> >>>>        
> >>>>
> >>1. 2 of the
> >>    
> >>
> >>>>jobs are running on the available resources. She queues 
> up a 101th 
> >>>>job at a higher priority than the previously 100 queued
> >>>>        
> >>>>
> >>jobs as cluster 2.
> >>    
> >>
> >>>>The negotiator log at time t indicates that is has 
> matched her 2.0 
> >>>>job and is preempting job 1.0 running on machine-A. At 
> negotiation 
> >>>>cycle t+1 later job 1.1 finishes running on machine-B. 
> Rather than 
> >>>>assign the high priority job, 2.0, to the now free machine-B at 
> >>>>negotiation cycle t+2 I'm seeing a lower priority job,
> >>>>        
> >>>>
> >>1.11, get assigned to the machine.
> >>    
> >>
> >>>>My question is this: once a job is moved to retirement on
> >>>>        
> >>>>
> >>behalf of a
> >>    
> >>
> >>>>queued, higher priority job, is that waiting job bound to
> >>>>        
> >>>>
> >>be assigned
> >>    
> >>
> >>>>to that particular machine? Can it not use the next available 
> >>>>resource? I get the feeling that the job is exempted from future 
> >>>>negotiation cycles because once I see a message saying job 1.0 is 
> >>>>being preempted for job 2.0 I don't see any more
> >>>>        
> >>>>
> >>negotiator messages
> >>    
> >>
> >>>>for job 2.0 in subsequent negotiation cycles. Is there a point in 
> >>>>time when the 2.0 job will give up waiting for the 1.0 job
> >>>>        
> >>>>
> >>to retire and be renegotiated?
> >>    
> >>
> >>>>I am also seeing this very odd message in my NegotiatorLog
> >>>>        
> >>>>
> >>printed at
> >>    
> >>
> >>>>the start of her portion of the negotiation cycle:
> >>>>
> >>>>12/13 16:00:02     Over submitter resource limit (0) ... 
> >>>>        
> >>>>
> >>only consider
> >>    
> >>
> >>>>startd ranks
> >>>>
> >>>>This is printed for the user "bchan" who is experiencing the 
> >>>>inability to get her higher priority job running before
> >>>>        
> >>>>
> >>her lower priority jobs.
> >>    
> >>
> >>>>What does this message mean? I couldn't find an answer
> >>>>        
> >>>>
> >>searching the
> >>    
> >>
> >>>>archives unfortunatly, although I did notice this question
> >>>>        
> >>>>
> >>has been
> >>    
> >>
> >>>>asked a few times.
> >>>>
> >>>>Myself and another user tested that priority works, and for us it 
> >>>>wasn't a problem. But in the NegotiatorLog file there were
> >>>>        
> >>>>
> >>no "Over submitter"
> >>    
> >>
> >>>>messages for our sections of the negotiation cycle. I suspect her 
> >>>>problems relate to this message.
> >>>>
> >>>>Thanks!
> >>>>
> >>>>- Ian Chesal
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>--
> >>>>Ian R. Chesal <ichesal@xxxxxxxxxx>
> >>>>Senior Software Engineer
> >>>>
> >>>>Altera Corporation
> >>>>Toronto Technology Center
> >>>>Tel: (416) 926-8300
> >>>>
> >>>>
> >>>>_______________________________________________
> >>>>Condor-users mailing list
> >>>>Condor-users@xxxxxxxxxxx
> >>>>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>> 
> >>>>
> >>>>        
> >>>>
> >>>_______________________________________________
> >>>Condor-users mailing list
> >>>Condor-users@xxxxxxxxxxx
> >>>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>      
> >>>
> >>_______________________________________________
> >>Condor-users mailing list
> >>Condor-users@xxxxxxxxxxx
> >>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>    
> >>
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >  
> >
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>