[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Trouble with job priority and job retirement




I cannot reproduce any problems with a match record not getting deleted when a claim timeout happens. If you are still having a problem, please send the relevant StartLog, NegotiatorLog, and SchedLog to condor-admin and I'll try to see what is going on.


--Dan

Dan Bradley wrote:

Ian,

In a case such as the one you describe, where job 2.0 preempts job 1.0 and has to wait around for 1.0 to finish, there are two possible cases. One is that 1.0 finishes and 2.0 claims the machine. The other is that the schedd times out waiting for 2.0 to get an active claim (controlled by REQUEST_CLAIM_TIMEOUT), and it tries getting a new match for 2.0. From your description of what is happening, I am concerned that when the timeout happens, the previous match is not getting correctly removed. I will double-check this case and get back to you. If you set REQUEST_CLAIM_TIMEOUT to a very large number, you should be able to remove this case from even being a possibility.

You also asked about the meaning of, "Over submitter resource limit (0) ... only consider startd ranks." This means that when Condor sliced up the resource pie between job submittors, this user got a slice of size 0.

--Dan

Ian Chesal wrote:

I'm trying to get a better handle on job retirement. I'm observing a
strange situation in our current 6.7.2 system which uses the retirement
feature. We have a fairly long retirement time set (2 days). I have a
user that has 100 jobs queued as cluster 1. 2 of the jobs are running on
the available resources. She queues up a 101th job at a higher priority
than the previously 100 queued jobs as cluster 2.

The negotiator log at time t indicates that is has matched her 2.0 job
and is preempting job 1.0 running on machine-A. At negotiation cycle t+1
later job 1.1 finishes running on machine-B. Rather than assign the high
priority job, 2.0, to the now free machine-B at negotiation cycle t+2
I'm seeing a lower priority job, 1.11, get assigned to the machine.

My question is this: once a job is moved to retirement on behalf of a
queued, higher priority job, is that waiting job bound to be assigned to
that particular machine? Can it not use the next available resource? I
get the feeling that the job is exempted from future negotiation cycles
because once I see a message saying job 1.0 is being preempted for job
2.0 I don't see any more negotiator messages for job 2.0 in subsequent
negotiation cycles. Is there a point in time when the 2.0 job will give
up waiting for the 1.0 job to retire and be renegotiated?

I am also seeing this very odd message in my NegotiatorLog printed at
the start of her portion of the negotiation cycle:

12/13 16:00:02     Over submitter resource limit (0) ... only consider
startd ranks

This is printed for the user "bchan" who is experiencing the inability
to get her higher priority job running before her lower priority jobs.
What does this message mean? I couldn't find an answer searching the
archives unfortunatly, although I did notice this question has been
asked a few times.

Myself and another user tested that priority works, and for us it wasn't
a problem. But in the NegotiatorLog file there were no "Over submitter"
messages for our sections of the negotiation cycle. I suspect her
problems relate to this message.

Thanks!

- Ian Chesal




-- Ian R. Chesal <ichesal@xxxxxxxxxx> Senior Software Engineer

Altera Corporation
Toronto Technology Center
Tel: (416) 926-8300


_______________________________________________
Condor-users mailing list
Condor-users@xxxxxxxxxxx
http://lists.cs.wisc.edu/mailman/listinfo/condor-users



_______________________________________________ Condor-users mailing list Condor-users@xxxxxxxxxxx http://lists.cs.wisc.edu/mailman/listinfo/condor-users