Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Condor-users] Heterogeneous 4-machine parallel job
- Date: Tue, 10 Jul 2007 16:57:52 -0600
- From: "Denis Bueno" <denbuen@xxxxxxxxxx>
- Subject: Re: [Condor-users] Heterogeneous 4-machine parallel job
On 07/09/2007 19:51, "Claudiu Udrea" <claudiu@xxxxxxx> wrote:
> I remember having the same problem on my pool and the fix was to put all the
> files I wanted to transfer with Condor's transfer mechanism( aka
> transfer_input_files=..) in the same directory and reference it with the
> macro:
> initialdir = /home/condor/jobs/myjob (or whatever path you are using)
>
> Then, just include the names of the files under transfer_input_files instead
> of the entire path.
>
Thank you for your suggestion. I did this as you can see [1], but it didn't
fix my problem. (Although why it would is not clear -- how does the value
of initialdir affect anything if you don't use it?) After the machines
match, the ShadowLog still complains of not having
"/Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/run_condor_parallel_java.$
$(OpSys).$$(Arch).bat" on one of the execute machines.
>From the manual, it seems you need as many "arguments"-"queue" pairs of
settings in your submit file as there are matching OpSys+Arch combinations
(Manual section 2.5). Commenting in the last two lines of my submit file
causes the scheduler not to match:
,----SchedLog excerpt
| 7/10 16:44:18 (pid:18967) unclaimed resource list
| 7/10 16:44:18 (pid:18967) OSX PPC vm1@xxxxxxxxxxxxxxxxxxxxxx
| 7/10 16:44:18 (pid:18967) OSX PPC vm2@xxxxxxxxxxxxxxxxxxxxxx
| 7/10 16:44:18 (pid:18967) WINNT51 INTEL
vm1@xxxxxxxxxxxxxxxxxxxxxx
| 7/10 16:44:18 (pid:18967) WINNT51 INTEL
vm2@xxxxxxxxxxxxxxxxxxxxxx
| 7/10 16:44:18 (pid:18967) busy resource list
| 7/10 16:44:18 (pid:18967) ************ empty ************
| 7/10 16:44:18 (pid:18967) Trying to find 4 resource(s) for dedicated
job 87.0
| 7/10 16:44:18 (pid:18967) Trying to find 4 resource(s) for dedicated
job 87.1
| 7/10 16:44:18 (pid:18967) Trying to satisfy job with all possible
resources
| 7/10 16:44:18 (pid:18967) Can't satisfy job 87 with all possible
resources... trying next job
`----
This suggests that the two Queue directives actually cause condor to look
for 8 machines (4+4) instead of four.
Does anyone have any more suggestions for heterogeneous parallel job
submission?
Thank you.
-Denis
[1]
# -*- mode: conf -*-
### Condor job description file for condor-test
### My variables
initialdir = /Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/
### Job universe & ancillary settings.
universe = parallel
machine_count = 4
#getenv = true
# Where to direct output
output = condor-test.out$(Node)
error = condor-test.out$(Node)
log = log.$(Node)
### The files we need to use.
transfer_input_files =
condor-test.jar,java-getopt-1.0.13.jar,run_condor_parallel_java.OSX.PPC.bat,
run_condor_parallel_java.WINNT51.INTEL.bat
should_transfer_files = YES
TRANSFER_FILES = ALWAYS
when_to_transfer_output = ON_EXIT
### Notify me on finish of job?
Notification = always
### Where should I run?
Requirements = (OpSys == "OSX" && Arch == "PPC") \
|| (OpSys == "WINNT51" && Arch == "INTEL")
# Requirements = (OpSys == "WINNT51" && Arch == "INTEL")
### What to run
executable = run_condor_parallel_java.$$(OpSys).$$(Arch).bat
# executable = run_condor_parallel_java.OSX.PPC
# executable = run_condor_parallel_java.WINNT51.INTEL.bat
arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
queue
# arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
# queue