Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [Condor-users] Adjusting machine RANK classadexprbased ontotalqueue time for a job
- Date: Thu, 28 Oct 2004 10:28:27 -0400
- From: "Ian Chesal" <ICHESAL@xxxxxxxxxx>
- Subject: RE: [Condor-users] Adjusting machine RANK classadexprbased ontotalqueue time for a job
All the jobs in the system belonged to me. It was just the one cluster
of jobs present at the time I saw this happening. We are doing pretty
much what you suggested to force re-negotiation of the claim after every
job. Indeed, the job was in the retiring state when I issued the
condor_rm command to remove it.
If the negoiator is simply connecting a startd with a schedd then is
there something amiss with the schedd when condor_rm is invoked? I would
have expected 44.2 to run after 44.0 whether the negotiator or the
schedd was deciding which job to had to the startd next.
Ian
> -----Original Message-----
> From: condor-users-bounces@xxxxxxxxxxx
> [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Dan Bradley
> Sent: October 27, 2004 6:16 PM
> To: Condor-Users Mail List
> Subject: Re: [Condor-users] Adjusting machine RANK classad
> exprbased ontotalqueue time for a job
>
>
> Ian,
>
> Could you specify which of the jobs in your various tests are
> being run by different users, if any? One potential point of
> confusion is that, by design, the Condor negotiator does not
> micromanage what the schedd does with a claim. Once the
> schedd gets a claim on behalf of a user, it will continue to
> run jobs on that claim until the claim is taken away or the
> user runs out of jobs. The negotiator doesn't tell the
> schedd which job to run next on the claim.
>
> You can force renegotiation of claims after every job if you want.
> Something like the following policy will do this:
>
> MaxJobRetirementTime = 1000000
> WANT_SUSPEND = FALSE
> PREEMPT = TRUE
>
> --Dan
>
> Ian Chesal wrote:
>
> >It looks like it was my use of condor_rm that messed up my
> >predictability. I continued the experiment but this time I made sure
> >the running 44.1 process finished normally instead of being
> pre-maturly
> >terminated by condor_rm.
> >
> >I had two queued jobs with their EnteredCurrentStatus times:
> >
> >44.2 1098912677
> >44.3 1098910808
> >
> >I expected 44.2 to rank lower than 44.3 by ~31. So 44.3
> should be the
> >next job picked up.
> >
> >And this was the case. My rank expression worked this time.
> Excellent.
> >
> >So here's a question for the condor team: If I was a "sneaky user" I
> >could write a job that, after processing was complete sent
> me an email
> >and then went to sleep for a long, long time. Upon receiving that
> >email, if I used condor_rm to terminate the job I'd be able
> to hang on
> >to the resource it was using and run another job on it. Even
> if another
> >job, from another user, had a higher rank because condor_rm seems to
> >prevent the machine from re-negotiating. This would give me infinite
> >access to a resource. Can this happen?
> >
> >
> >Ian
> >
> >
> >
> >
> >
> >
> >>-----Original Message-----
> >>From: condor-users-bounces@xxxxxxxxxxx
> >>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> >>Sent: October 27, 2004 5:18 PM
> >>To: Condor-Users Mail List
> >>Subject: RE: [Condor-users] Adjusting machine RANK classad
> expr based
> >>ontotalqueue time for a job
> >>
> >>Hmm. So I went with the RANK expression:
> >>
> >>RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> >>TARGET.EnteredCurrentStatus)/60))
> >>
> >>My plan was to make sure jobs that are queued rank higher
> the longer
> >>they've been in the queued state. In this case, +1 for every minute
> >>they've been sitting idle.
> >>
> >>To test this I submitted some jobs in the held state. Jobs
> are simple:
> >>go to the machine and sleep for an hour.
> >>
> >>I released three of the held jobs. My machine immediately picked up
> >>44.0 from the cluster and started running.
> >>
> >>I let the other two released jobs build up some queue time
> while 44.0
> >>slept on a machine. At one point I did see condor_status
> show my 44.0
> >>as being in the "Retiring" state instead of the "Busy"
> state -- that
> >>is good news. We have a long MaxJobRetirementTime so this is
> >>expected.
> >>
> >>I let about 8 minutes lapse I then I issued the commmand:
> >>
> >>condor_hold 44.1
> >>condor_release 44.1
> >>
> >>So this reset the EnteredCurrentStatus time on 44.1. I now
> have 44.0
> >>running, but retiring and the remaining two jobs each have
> >>EnteredCurrentStatus as follows:
> >>
> >>44.1 1098910859
> >>44.2 1098910279
> >>
> >>By this output I expect 44.2 to have the higher rank. 44.0 is still
> >>running so I removed it with:
> >>
> >>condor_rm 44.0
> >>
> >>I expected the machine to pick up 44.2 as the next job because it's
> >>rank is higher, having been queued for a longer time that 44.1.
> >>
> >>Not so. The machine picked up 44.1. I'm the only user in
> the system so
> >>it's not a matter of EUP. What's up? Why is it 44.2 didn't rank
> >>higher?
> >>Can anyone see how I messed up my prediction for next job
> to run? I'm
> >>stumped. I thought I had it all figured out.
> >>
> >>Thanks!
> >>
> >>Ian
> >>
> >>
> >>
> >>>-----Original Message-----
> >>>From: condor-users-bounces@xxxxxxxxxxx
> >>>[mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Ian Chesal
> >>>Sent: October 27, 2004 11:34 AM
> >>>To: Condor-Users Mail List
> >>>Subject: [Condor-users] Adjusting machine RANK classad expr
> >>>
> >>>
> >>based on
> >>
> >>
> >>>totalqueue time for a job
> >>>
> >>>I'm toying with adjusting the RANK expression to achieve a more
> >>>FIFO-like consideration when condor runs jobs. The idea is to rank
> >>>jobs on machines based on their time in the queue.
> >>>I wanted to bounce the rank expression and idea off the list.
> >>>The rank expression for machines I'm thinking of using is:
> >>>
> >>>RANK = ((TARGET.JobStatus =?= 1) * ((CurrentTime -
> >>>TARGET.EnteredCurrentStatus)/600))
> >>>
> >>>This would give a job queued 10 minutes longer than another job a
> >>>higher rank on the machine.
> >>>
> >>>The other option is:
> >>>
> >>>RANK = ((CurrentTime - TARGET.QDate)/600)
> >>>
> >>>But this would track cumulative queue time (so if the job
> >>>
> >>>
> >>queued, ran
> >>
> >>
> >>>for a bit, then got sent back to the queue) right? Or is
> >>>
> >>>
> >>Qdate reset
> >>
> >>
> >>>every time a job returns to the queue, not just the first
> time it's
> >>>queued up by condor_submit?
> >>>
> >>>Comments? Opinions? Much appreciated.
> >>>
> >>>Ian
> >>>
> >>>_______________________________________________
> >>>Condor-users mailing list
> >>>Condor-users@xxxxxxxxxxx
> >>>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>>
> >>>
> >>>
> >>_______________________________________________
> >>Condor-users mailing list
> >>Condor-users@xxxxxxxxxxx
> >>http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >>
> >>
> >>
> >
> >_______________________________________________
> >Condor-users mailing list
> >Condor-users@xxxxxxxxxxx
> >http://lists.cs.wisc.edu/mailman/listinfo/condor-users
> >
> >
>
> _______________________________________________
> Condor-users mailing list
> Condor-users@xxxxxxxxxxx
> http://lists.cs.wisc.edu/mailman/listinfo/condor-users
>