Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [Condor-users] Trouble with job priority and job retirement

Date: Tue, 14 Dec 2004 16:30:13 -0500
From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
Subject: RE: [Condor-users] Trouble with job priority and job retirement

I really think this has to do with the fact that my one user had
received 0 resources from the system during the negoiation cycle. Even
though there were no other users vying for resources here effective user
priority was high and netted here 0 resources so the negotiator ignored
her new job that had higher priority than her old jobs. Does this seem
plausible?

- Ian 

> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx 
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: December 14, 2004 12:49 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Trouble with job priority and job 
> retirement
> 
> 
> I cannot reproduce any problems with a match record not 
> getting deleted when a claim timeout happens.  If you are 
> still having a problem, please send the relevant StartLog, 
> NegotiatorLog, and SchedLog to condor-admin and I'll try to 
> see what is going on.
> 
> --Dan
> 
> Dan Bradley wrote:
> 
> > Ian,
> >
> > In a case such as the one you describe, where job 2.0 
> preempts job 1.0 
> > and has to wait around for 1.0 to finish, there are two possible 
> > cases.  One is that 1.0 finishes and 2.0 claims the machine.  The 
> > other is that the schedd times out waiting for 2.0 to get an active 
> > claim (controlled by REQUEST_CLAIM_TIMEOUT), and it tries getting a 
> > new match for 2.0.  From your description of what is 
> happening, I am 
> > concerned that when the timeout happens, the previous match is not 
> > getting correctly removed.  I will double-check this case 
> and get back 
> > to you.  If you set REQUEST_CLAIM_TIMEOUT to a very large 
> number, you 
> > should be able to remove this case from even being a possibility.
> >
> > You also asked about the meaning of, "Over submitter resource limit
> > (0) ... only consider startd ranks."  This means that when Condor 
> > sliced up the resource pie between job submittors, this user got a 
> > slice of size 0.
> >
> > --Dan
> >
> > Ian Chesal wrote:
> >
> >> I'm trying to get a better handle on job retirement. I'm 
> observing a 
> >> strange situation in our current 6.7.2 system which uses the 
> >> retirement feature. We have a fairly long retirement time set (2 
> >> days). I have a user that has 100 jobs queued as cluster 
> 1. 2 of the 
> >> jobs are running on the available resources. She queues up a 101th 
> >> job at a higher priority than the previously 100 queued 
> jobs as cluster 2.
> >>
> >> The negotiator log at time t indicates that is has matched her 2.0 
> >> job and is preempting job 1.0 running on machine-A. At negotiation 
> >> cycle t+1 later job 1.1 finishes running on machine-B. Rather than 
> >> assign the high priority job, 2.0, to the now free machine-B at 
> >> negotiation cycle t+2 I'm seeing a lower priority job, 
> 1.11, get assigned to the machine.
> >>
> >> My question is this: once a job is moved to retirement on 
> behalf of a 
> >> queued, higher priority job, is that waiting job bound to 
> be assigned 
> >> to that particular machine? Can it not use the next available 
> >> resource? I get the feeling that the job is exempted from future 
> >> negotiation cycles because once I see a message saying job 1.0 is 
> >> being preempted for job 2.0 I don't see any more 
> negotiator messages 
> >> for job 2.0 in subsequent negotiation cycles. Is there a point in 
> >> time when the 2.0 job will give up waiting for the 1.0 job 
> to retire and be renegotiated?
> >>
> >> I am also seeing this very odd message in my NegotiatorLog 
> printed at 
> >> the start of her portion of the negotiation cycle:
> >>
> >> 12/13 16:00:02     Over submitter resource limit (0) ... 
> only consider
> >> startd ranks
> >>
> >> This is printed for the user "bchan" who is experiencing the 
> >> inability to get her higher priority job running before 
> her lower priority jobs.
> >> What does this message mean? I couldn't find an answer 
> searching the 
> >> archives unfortunatly, although I did notice this question 
> has been 
> >> asked a few times.
> >>
> >> Myself and another user tested that priority works, and for us it 
> >> wasn't a problem. But in the NegotiatorLog file there were 
> no "Over submitter"
> >> messages for our sections of the negotiation cycle. I suspect her 
> >> problems relate to this message.
> >>
> >> Thanks!
> >>
> >> - Ian Chesal
> >>
> >>
> >>
> >>
> >> --
> >> Ian R. Chesal <ichesal@xxxxxxxxxx>
> >> Senior Software Engineer
> >>
> >> Altera Corporation
> >> Toronto Technology Center
> >> Tel: (416) 926-8300
> >>
> >>
> >> _______________________________________________
> >> Condor-users mailing list
> >> Condor-users@xxxxxxxxxxx
> >> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>  
> >>
> >
> > _______________________________________________
> > Condor-users mailing list
> > Condor-users@xxxxxxxxxxx
> > http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> 
> 
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>

Follow-Ups:
- Re: [Condor-users] Trouble with job priority and job retirement
  - From: Dan Bradley

Prev by Date: Re: [Condor-users] naming conventions
Next by Date: [Condor-users] The consequences of having a short PRIORITY_HALFLIFEsetting?
Previous by thread: Re: [Condor-users] Trouble with job priority and job retirement
Next by thread: Re: [Condor-users] Trouble with job priority and job retirement
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

RE: [Condor-users] Trouble with job priority and job retirement