[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Heterogeneous 4-machine parallel job


I'm trying to submit a heterogeneous four-machine parallel job.  I have a
four-machine condor pool -- 2 OSX machines and 2 Windows boxen.  (There are
really only 2 physical machines, each dual-proc/core, but I think that
detail doesn't matter). I have included what I think is the correct job
description [1], in which I use the $$(OpSys) and $$(Arch) macros to specify
the executables to run.  Those executables exist (as
run_condor_parallel_java.OSX.PPC and run_condor_parallel_java.INTEL.WINNT51)
and are simply scripts (bash and batch, respectively) that invoke java.  The
OSX machine is the dedicated scheduler.

There are two symptoms, and I'm not sure which is the more significant:

1) In my log.#pArAlLeLnOdE# (I don't know why it's weirdly named), I see:

      000 (062.000.000) 07/09 17:48:13 Job submitted from host:
      007 (062.000.000) 07/09 17:53:45 Shadow exception!
              Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxx: STARTER
failed to       receive file(s) from <>; SHADOW at failed       to send file(s) to <>:
error reading from 
(OpSys).$$(Arch):       (errno 2) No such file or directory
              0  -  Run Bytes Sent By Job
              632607  -  Run Bytes Received By Job

which suggests that $$(...) is not being expanded.

2) In my master node's MatchLog:

      7/9 17:53:14 (fd:12) (pid:119)       Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <> preempting
none <> vm1@xxxxxxxxxxxxxxxxxxxxxx
      7/9 17:53:14 (fd:13) (pid:119)       Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <> preempting
none <> vm2@xxxxxxxxxxxxxxxxxxxxxx
      7/9 17:53:15 (fd:13) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <>: no match
      7/9 17:53:15 (fd:13) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <>: no match
      7/9 17:53:35 (fd:8) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <>: no match

I'm not sure about the significance of (2) since the job does run, but it
throws the shadow exception.

The expected behavior, of course, is to create three slaves and one master,
and have them run.  My master simply waits for three incoming connections,
then exits; my slaves connect to the master via the hostname passed on the
command-line.  (Incidentally, I believe that detail is wrong -- the hostname
I need to pass on the command-line is the master or scheduler hostname, and
my Condor macro-fu is not yet high enough to know how to do that.)

Thank you.

Some perhaps relevant details:


Name          OpSys       Arch   State      Activity   LoadAv Mem

vm1@xxxxxxxxx OSX         PPC    Unclaimed  Idle       0.550  1024
vm2@xxxxxxxxx OSX         PPC    Unclaimed  Idle       0.000  1024
vm1@xxxxxxxxx WINNT51     INTEL  Unclaimed  Idle       0.010  1023
vm2@xxxxxxxxx WINNT51     INTEL  Unclaimed  Idle       0.000  1023


# -*- mode: conf -*-
### Condor job description file for condor-test

### My variables

### Job universe & ancillary settings.
universe = parallel
machine_count = 4
#getenv = true

# Where to direct output
output = condor-test.out$(Node)
error = condor-test.out$(Node)
log = log.$(Node)

### The files we need to use.
transfer_input_files =
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

### Notify someone on finish of job?  This doesn't work, probably ID10T.
Notification = always

### Where should I run?
Requirements = ((Arch == "PPC" && OpSys == "OSX") \
                || (Arch == "INTEL" && OpSys == "WINNT51"))

### What to run
executable = run_condor_parallel_java.$$(OpSys).$$(Arch)

arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3

# arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=1
# queue