[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Heterogeneous 4-machine parallel job



Hey Denis,

I remember having the same problem on my pool and the fix was to put all the files I wanted to transfer with Condor's transfer mechanism( aka transfer_input_files=..) in the same directory and reference it with the macro:
initialdir = /home/condor/jobs/myjob (or whatever path you are using)

Then, just include the names of the files under transfer_input_files instead of the entire path.

I don't really know why it worked to be honest, but you may find it useful to give it a shot.

Good luck!
Claudiu Udrea
The George Washington University
Washington DC

On 7/9/07, Denis Bueno < denbuen@xxxxxxxxxx> wrote:
All,

I'm trying to submit a heterogeneous four-machine parallel job.  I have a
four-machine condor pool -- 2 OSX machines and 2 Windows boxen.  (There are
really only 2 physical machines, each dual-proc/core, but I think that
detail doesn't matter). I have included what I think is the correct job
description [1], in which I use the $$(OpSys) and $$(Arch) macros to specify
the executables to run.  Those executables exist (as
run_condor_parallel_java.OSX.PPC and run_condor_parallel_java.INTEL.WINNT51)
and are simply scripts (bash and batch, respectively) that invoke java.  The
OSX machine is the dedicated scheduler.

There are two symptoms, and I'm not sure which is the more significant:

1) In my log.#pArAlLeLnOdE# (I don't know why it's weirdly named), I see:

      000 (062.000.000) 07/09 17:48:13 Job submitted from host:
<134.253.202.158:52016>
      ...
      007 (062.000.000) 07/09 17:53:45 Shadow exception!
              Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxx: STARTER
failed to       receive file(s) from <134.253.202.158:52124 >; SHADOW at
134.253.202.158 failed       to send file(s) to <134.253.202.216:4055>:
error reading from
/Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/run_condor_parallel_java.$$
(OpSys).$$(Arch):       (errno 2) No such file or directory
              0  -  Run Bytes Sent By Job
              632607  -  Run Bytes Received By Job
      ...

which suggests that $$(...) is not being expanded.

2) In my master node's MatchLog:

      7/9 17:53:14 (fd:12) (pid:119)       Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx < 134.253.202.158:52016> preempting
none <134.253.202.216:4005> vm1@xxxxxxxxxxxxxxxxxxxxxx
      7/9 17:53:14 (fd:13) (pid:119)       Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016> preempting
none < 134.253.202.216:4005> vm2@xxxxxxxxxxxxxxxxxxxxxx
      7/9 17:53:15 (fd:13) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found
      7/9 17:53:15 (fd:13) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found
      7/9 17:53:35 (fd:8) (pid:119)       Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found


I'm not sure about the significance of (2) since the job does run, but it
throws the shadow exception.

The expected behavior, of course, is to create three slaves and one master,
and have them run.  My master simply waits for three incoming connections,
then exits; my slaves connect to the master via the hostname passed on the
command-line.  (Incidentally, I believe that detail is wrong -- the hostname
I need to pass on the command-line is the master or scheduler hostname, and
my Condor macro-fu is not yet high enough to know how to do that.)

Thank you.

Some perhaps relevant details:

Condor_status:

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

vm1@xxxxxxxxx OSX         PPC    Unclaimed  Idle       0.550  1024
0+00:02:07
vm2@xxxxxxxxx OSX         PPC    Unclaimed  Idle       0.000  1024
0+00:02:08
vm1@xxxxxxxxx WINNT51     INTEL  Unclaimed  Idle       0.010  1023
0+00:04:05
vm2@xxxxxxxxx WINNT51     INTEL  Unclaimed  Idle       0.000  1023
0+00:04:06

-Denis

[1]
# -*- mode: conf -*-
### Condor job description file for condor-test

### My variables


### Job universe & ancillary settings.
universe = parallel
machine_count = 4
#getenv = true

# Where to direct output
output = condor-test.out$(Node)
error = condor-test.out$(Node)
log = log.$(Node)

### The files we need to use.
transfer_input_files =
dist/condor-test.jar,lib/runtime/java- getopt-1.0.13.jar
should_transfer_files = YES
when_to_transfer_output = ON_EXIT

### Notify someone on finish of job?  This doesn't work, probably ID10T.
Notification = always


### Where should I run?
Requirements = ((Arch == "PPC" && OpSys == "OSX") \
                || (Arch == "INTEL" && OpSys == "WINNT51"))

### What to run
executable = run_condor_parallel_java.$$(OpSys).$$(Arch)


arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
queue

# arguments = condor-test.jar --master= s889069.srn.sandia.gov --num-slaves=1
$(Node)
# queue



_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/