[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] how to troubleshoot the scheduling process



Rob,

The parallel scheduler in condor does not pay any attention to user priorities. By default, it schedules things in the order specified by the priority of the job (i.e. the priority assigned to the job in the submit file by the user). Unlike other job universes, the parallel universe applies this priority across jobs from all users, so it is expected that the users will coordinate the setting of these priority values. If priorities are equal, then jobs are run in first-in-first-out order. You can select an alternate best-fit algorithm as documented here:

http://www.cs.wisc.edu/condor/manual/v7.0/3_3Configuration.html#14115

--Dan

Robert E. Parrott wrote:
As a followup to this, I've enabled verbose negotiator logging, and see the following behavior.

Users do seem to come up for negotiation in EUP order, as expected. However, the data about user jobs seem to be incorrect.

When a user with EUP 0.5 (the lowest) and a parallel job in the idle state comes up for negotiation, the message I'm seeing is

"Negotiating with [user]@seas.harvard.edu skipped because no idle jobs."

Thus there's something amiss here with the info the schedd.

Any thoughts on this? The problem seems to not be present for serial or other jobs, just parallel universe jobs.

thanks,
rob


On Feb 7, 2008, at 4:23 PM, Robert E. Parrott wrote:

HI Folks,

We have a situation where a user with very high EUP, and a large
number of jobs in the queue, is always scheduled ahead of users with
much lower (100 times or more) EUP, and thus much high priority.  All
these jobs are parallel (MPI) jobs, which is likely relevant.

To begin, can anyone suggest a method to diagnose the problem here,
and how these evaluations are taking place. My understand from the
manual is that user jobs are considered in order of priority (from
lower EUP to highest).  But the opposite seems to be occurring.

As an example, this user, using 156/200 resources, has a 12 process
parallel job complete. His EUP is 156. Immediately a new 12 process
job of his is started, despite the fact that there's a user with EUP
0.5 and an 8 node job waiting in the queue.

Thank for any initial insight or input in how to address this.

rob


==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Project Manager., CrimsonGrid Initiative and
Program Manager, CyberInfrastructure Lab
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin  211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

==========================
Robert E. Parrott, Ph.D. (Phys. '06)
Project Manager., CrimsonGrid Initiative and
Program Manager, CyberInfrastructure Lab
Harvard University Sch. of Eng. and App. Sci.
Maxwell-Dworkin  211,
33 Oxford St.
Cambridge, MA 02138
(617)-495-5045




_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/