Hello, I have some updates on the
issue. In “Condor Parallel Universe”
from Greg Thain (www.cs.wisc.edu/condor/CondorWeek2005/presentations/thain_parallel_universe.ppt) I found a note that
“DedicatedScheduler schedules First-Fit, sorted by UserJobPrio”. I took a look on detailed logs. Here
is a short part of SchedLog: 11/4 14:34:49 Found 2 idle
dedicated job(s) 11/4 14:34:49
DedicatedScheduler: Listing all dedicated jobs - 11/4 14:34:49 Dedicated job:
440.0 user1 #need to run on node2 11/4 14:34:49 Dedicated job:
441.0 user2 #need to run on node1 11/4 14:34:49 Will use UDP to
update collector HOSTNAME <IP:9618> 11/4 14:34:49 Trying to query
collector <IP:9618> 11/4 14:34:49 Found 18 potential
dedicated resources 11/4 14:34:49 idle resource list 11/4 14:34:49 ************
empty ************ 11/4 14:34:49 limbo resource
list 11/4 14:34:49 ************
empty ************ 11/4 14:34:49 unclaimed resource
list 11/4 14:34:49
LINUX X86_64 vm2@node1 11/4 14:34:49
LINUX X86_64 vm1@node1 11/4 14:34:49 busy resource list 11/4 14:34:49
LINUX X86_64 vm2@node2 11/4 14:34:49
LINUX X86_64 vm1@node3 11/4 14:34:49
LINUX X86_64 vm3@node2 11/4 14:34:49
LINUX X86_64 vm2@node3 11/4 14:34:49
LINUX X86_64 vm4@node2 11/4 14:34:49
LINUX X86_64 vm4@node3 11/4 14:34:49
LINUX X86_64 vm2@node4 11/4 14:34:49
LINUX X86_64 vm1@node4 11/4 14:34:49
LINUX X86_64 vm3@node3 11/4 14:34:49
LINUX X86_64 vm1@node5 11/4 14:34:49
LINUX X86_64 vm3@node4 11/4 14:34:49
LINUX X86_64 vm2@node5 11/4 14:34:49
LINUX X86_64 vm4@node4 11/4 14:34:49
LINUX X86_64 vm3@node5 11/4 14:34:49
LINUX X86_64 vm4@node5 11/4 14:34:49
LINUX X86_64 vm1@node2 11/4 14:34:49 Trying to find 2
resource(s) for dedicated job 440.0 11/4 14:34:49 Trying to satisfy
job with all possible resources 11/4 14:34:49 Could satisfy job
440 in the future, done computing schedule 11/4 14:34:49 In
DedicatedScheduler::publishRequestAd() 11/4 14:34:49 Trying to update
collector <IP:9618> 11/4 14:34:49 Attempting to send
update via UDP to collector HOSTNAME <IP:9618> 11/4 14:34:49 Entering
DedicatedScheduler::checkSanity() 11/4 14:34:49 Finished
DedicatedScheduler::handleDedicatedJobs As you can see from logfile,
Condor says “11/4 14:34:49 Could satisfy job
440 in the future, done computing schedule”. With the information from the
ppt, this means that for Condor state “could satisfy in the future” is
equal to “Fit” condition. Because Condor’s
DedicatedScheduler schedules only the “First-Fit” job, it will not get to next job(s),
job 441 in my case, as long as job 440 is in the queue. The running job (439 in my case)
may be running for several days. During this time job 441 will be waiting in the
queue even if the resources for it (node1) is free. This is, of course, not good
situation, because it is wasting of available CPU time. At least I understand the
Condor’s “DedicatedScheduler” behavior now. I am using Condor 6.8.6.
Can someone tell me, if the behavior of “DedicatedScheduler” is different in Condor
6.9.x? If the behavior is the same in
Condor 6.8.6 and 6.9.x, are there some trick how to tweak “DedicatedScheduler”? If, in the above example, job
440 would need to be run on node6, which does not exists, the state of job would
be “not fit” and the scheduler will move to next job as expected. So maybe all I need
is to tell “DedicatedScheduler” to consider state “Could satisfy job in the
future” as “not fit” condition. Do you have idea if, and how,
this can be done? Thank you for any advice or tip. Cheers, Martin PS: In the above ppt there is
also strange statement: “Condor_q –analyze mystery!”.
Do you have any idea what does it mean? From:
condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On
Behalf Of Martin Galis Hello, I have installed Condor 6.8.6 few weeks ago (so I am still new to condor). We are running Condor on small pool of 6 machines. One of them is central manager, submit, scheduler (also acts as dedicated scheduler) and execute machine. The rest of the pool are execute machines (configured as dedicated resources). Execute machines are 4-core machines (2xdual-core CPUs). We are experiencing 2 problems with parallel jobs
submissions. 1 ) I submit job1 which requires 4 CPU on, say, node1. After some time it is executed. Then I submit job2 witch again requires 4 CPUs
on node1. This one stays in idle state, because no more CPUs are
available on node1. As last I submit a job3 to node2. The strange is that this job stays idle until job2 is
executed. But because node2 is free I do not see a reason why it should stay idle and wait for job2. It looks like the job queue for parallel universe is
processed strictly in FIFO policy. Is this normal behavior for
parallel universe or am I missing something? Note: In vanilla universe job management work as expected – the job3 will be executed right after
submission. 2) After the job for parallel universe is submitted to queue it stays idle for some time. Sometimes it is executed in 10s of seconds, sometimes in few minutes. We usually use condor_reschedule, which helps to execute the job (at least we think it helps). The jobs for vanilla universe are executed right after they are submitted (assuming there are free CPUs to run the job). Is this normal behavior of parallel universe or is it just due to configuration of Condor? If it is configuration, how can I change it? If you need some configuration files, log files or whatever, just tell me, I will send it. Thanks in advance for any help or suggestion. Cheers, Martin |