[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Processing jobs in parallel-universe-queue



Hello,

 

I have some updates on the issue.

 

In “Condor Parallel Universe” from Greg Thain

(www.cs.wisc.edu/condor/CondorWeek2005/presentations/thain_parallel_universe.ppt)

I found a note that “DedicatedScheduler schedules First-Fit, sorted by UserJobPrio”.

 

I took a look on detailed logs. Here is a short part of SchedLog:

11/4 14:34:49 Found 2 idle dedicated job(s)

11/4 14:34:49 DedicatedScheduler: Listing all dedicated jobs -

11/4 14:34:49 Dedicated job: 440.0 user1 #need to run on node2

11/4 14:34:49 Dedicated job: 441.0 user2 #need to run on node1

11/4 14:34:49 Will use UDP to update collector HOSTNAME <IP:9618>

11/4 14:34:49 Trying to query collector <IP:9618>

11/4 14:34:49 Found 18 potential dedicated resources

11/4 14:34:49 idle resource list

11/4 14:34:49  ************ empty ************

11/4 14:34:49 limbo resource list

11/4 14:34:49  ************ empty ************

11/4 14:34:49 unclaimed resource list

11/4 14:34:49    LINUX  X86_64  vm2@node1

11/4 14:34:49    LINUX  X86_64  vm1@node1

11/4 14:34:49 busy resource list

11/4 14:34:49    LINUX  X86_64  vm2@node2

11/4 14:34:49    LINUX  X86_64  vm1@node3

11/4 14:34:49    LINUX  X86_64  vm3@node2

11/4 14:34:49    LINUX  X86_64  vm2@node3

11/4 14:34:49    LINUX  X86_64  vm4@node2

11/4 14:34:49    LINUX  X86_64  vm4@node3

11/4 14:34:49    LINUX  X86_64  vm2@node4

11/4 14:34:49    LINUX  X86_64  vm1@node4

11/4 14:34:49    LINUX  X86_64  vm3@node3

11/4 14:34:49    LINUX  X86_64  vm1@node5

11/4 14:34:49    LINUX  X86_64  vm3@node4

11/4 14:34:49    LINUX  X86_64  vm2@node5

11/4 14:34:49    LINUX  X86_64  vm4@node4

11/4 14:34:49    LINUX  X86_64  vm3@node5

11/4 14:34:49    LINUX  X86_64  vm4@node5

11/4 14:34:49    LINUX  X86_64  vm1@node2

11/4 14:34:49 Trying to find 2 resource(s) for dedicated job 440.0

11/4 14:34:49 Trying to satisfy job with all possible resources

11/4 14:34:49 Could satisfy job 440 in the future, done computing schedule

11/4 14:34:49 In DedicatedScheduler::publishRequestAd()

11/4 14:34:49 Trying to update collector <IP:9618>

11/4 14:34:49 Attempting to send update via UDP to collector HOSTNAME <IP:9618>

11/4 14:34:49 Entering DedicatedScheduler::checkSanity()

11/4 14:34:49 Finished DedicatedScheduler::handleDedicatedJobs

 

As you can see from logfile, Condor says

“11/4 14:34:49 Could satisfy job 440 in the future, done computing schedule”.

With the information from the ppt, this means that for Condor state

“could satisfy in the future” is equal  to “Fit” condition.

Because Condor’s DedicatedScheduler schedules only the “First-Fit” job,

it will not get to next job(s), job 441 in my case, as long as job 440 is in the queue.

 

The running job (439 in my case) may be running for several days. During this time

job 441 will be waiting in the queue even if the resources for it (node1) is free.

This is, of course, not good situation, because it is wasting of available CPU time.

 

At least I understand the Condor’s “DedicatedScheduler” behavior now.

 

I am using Condor 6.8.6.  Can someone tell me, if the behavior of “DedicatedScheduler”

is different  in Condor 6.9.x?

 

If the behavior is the same in Condor 6.8.6 and 6.9.x, are there some trick how

to tweak “DedicatedScheduler”?

 

If, in the above example, job 440 would need to be run on node6, which does not

exists, the state of job would be “not fit” and the scheduler will move to next job

as expected. So maybe all I need is to tell “DedicatedScheduler” to consider state

“Could satisfy job in the future”  as “not fit” condition.

Do you have idea if, and how, this can be done?

 

Thank you for any advice or tip.

 

Cheers,

Martin

 

PS: In the above ppt there is also strange statement: “Condor_q –analyze mystery!”.

       Do you have any idea what does it mean?

 

 

From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Martin Galis
Sent: Saturday, 03 November, 2007 21:53
To: condor-users@xxxxxxxxxxx
Subject: [Condor-users] Processing jobs in parallel-universe-queue

 

Hello,

 

I have installed Condor 6.8.6 few weeks ago

(so I am still new to condor).

 

We are running Condor on small pool of 6 machines.

One of them is central manager, submit, scheduler (also

acts as dedicated scheduler) and execute machine.

The rest of the pool are execute machines (configured as

dedicated resources). Execute machines are 4-core machines

(2xdual-core CPUs).

 

We are experiencing  2 problems with parallel jobs submissions.

 

1 )

I submit job1 which requires 4 CPU on, say, node1.

After some time it is executed.

Then I submit  job2 witch again requires 4 CPUs  on node1.

This one stays in idle state, because no more CPUs are available on node1.

As last I submit a job3 to node2.

The strange is that this job stays idle until job2 is executed.

But because node2 is free I do not see a reason why

it should stay idle and wait for job2.

It looks like the job queue for parallel universe is processed

strictly in FIFO policy. Is this normal behavior for parallel

universe or am I missing something?

 

Note: In vanilla universe job management work

as expected – the job3 will be executed right after submission.

 

2)

After the job for parallel universe is submitted to queue

it stays idle for some time. Sometimes it is executed in 10s

of seconds, sometimes in few minutes. We usually use

condor_reschedule, which helps to execute the job

(at least we think it helps). The jobs for vanilla universe

are executed right after they are submitted (assuming

there are free CPUs to run the job).

Is this normal behavior of parallel universe

or is it just due to configuration of Condor?

If it is configuration, how can I change it?   

 

 

If you need some configuration files, log files or whatever,

just tell me, I will send it.

 

Thanks in advance for any help or suggestion.

 

Cheers,

Martin