[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Heterogeneous 4-machine parallel job



On 07/09/2007 19:51, "Claudiu Udrea" <claudiu@xxxxxxx> wrote:
> I remember having the same problem on my pool and the fix was to put all the
> files I wanted to transfer with Condor's transfer mechanism( aka
> transfer_input_files=..) in the same directory and reference it with the
> macro: 
> initialdir = /home/condor/jobs/myjob (or whatever path you are using)
> 
> Then, just include the names of the files under transfer_input_files instead
> of the entire path.
> 

Thank you for your suggestion.  I did this as you can see [1], but it didn't
fix my problem.  (Although why it would is not clear -- how does the value
of initialdir affect anything if you don't use it?) After the machines
match, the ShadowLog still complains of not having
"/Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/run_condor_parallel_java.$
$(OpSys).$$(Arch).bat" on one of the execute machines.

>From the manual, it seems you need as many "arguments"-"queue" pairs of
settings in your submit file as there are matching OpSys+Arch combinations
(Manual section 2.5).  Commenting in the last two lines of my submit file
causes the scheduler not to match:

    ,----SchedLog excerpt
    | 7/10 16:44:18 (pid:18967) unclaimed resource list
    | 7/10 16:44:18 (pid:18967)    OSX    PPC    vm1@xxxxxxxxxxxxxxxxxxxxxx
    | 7/10 16:44:18 (pid:18967)    OSX    PPC    vm2@xxxxxxxxxxxxxxxxxxxxxx
    | 7/10 16:44:18 (pid:18967)    WINNT51    INTEL
vm1@xxxxxxxxxxxxxxxxxxxxxx
    | 7/10 16:44:18 (pid:18967)    WINNT51    INTEL
vm2@xxxxxxxxxxxxxxxxxxxxxx
    | 7/10 16:44:18 (pid:18967) busy resource list
    | 7/10 16:44:18 (pid:18967)  ************ empty ************
    | 7/10 16:44:18 (pid:18967) Trying to find 4 resource(s) for dedicated
job 87.0
    | 7/10 16:44:18 (pid:18967) Trying to find 4 resource(s) for dedicated
job 87.1
    | 7/10 16:44:18 (pid:18967) Trying to satisfy job with all possible
resources
    | 7/10 16:44:18 (pid:18967) Can't satisfy job 87 with all possible
resources... trying next job
    `----

 This suggests that the two Queue directives actually cause condor to look
for 8 machines (4+4) instead of four.

Does anyone have any more suggestions for heterogeneous parallel job
submission?

Thank you.

-Denis

[1]
# -*- mode: conf -*-
### Condor job description file for condor-test

### My variables
initialdir = /Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/


### Job universe & ancillary settings.
universe = parallel
machine_count = 4
#getenv = true

# Where to direct output
output = condor-test.out$(Node)
error = condor-test.out$(Node)
log = log.$(Node)

### The files we need to use.
transfer_input_files =
condor-test.jar,java-getopt-1.0.13.jar,run_condor_parallel_java.OSX.PPC.bat,
run_condor_parallel_java.WINNT51.INTEL.bat
should_transfer_files = YES
TRANSFER_FILES = ALWAYS
when_to_transfer_output = ON_EXIT

### Notify me on finish of job?
Notification = always

### Where should I run?
Requirements = (OpSys == "OSX" && Arch == "PPC") \
                || (OpSys == "WINNT51" && Arch == "INTEL")
# Requirements = (OpSys == "WINNT51" && Arch == "INTEL")

### What to run
executable = run_condor_parallel_java.$$(OpSys).$$(Arch).bat
# executable = run_condor_parallel_java.OSX.PPC
# executable = run_condor_parallel_java.WINNT51.INTEL.bat


arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
queue

# arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
# queue