All,
I'm trying to submit a heterogeneous four-machine parallel job. I have a
four-machine condor pool -- 2 OSX machines and 2 Windows boxen. (There are
really only 2 physical machines, each dual-proc/core, but I think that
detail doesn't matter). I have included what I think is the correct job
description [1], in which I use the $$(OpSys) and $$(Arch) macros to specify
the executables to run. Those executables exist (as
run_condor_parallel_java.OSX.PPC and run_condor_parallel_java.INTEL.WINNT51)
and are simply scripts (bash and batch, respectively) that invoke java. The
OSX machine is the dedicated scheduler.
There are two symptoms, and I'm not sure which is the more significant:
1) In my log.#pArAlLeLnOdE# (I don't know why it's weirdly named), I see:
000 (062.000.000) 07/09 17:48:13 Job submitted from host:
<134.253.202.158:52016>
...
007 (062.000.000) 07/09 17:53:45 Shadow exception!
Error from starter on vm1@xxxxxxxxxxxxxxxxxxxxxx: STARTER
failed to receive file(s) from <134.253.202.158:52124 >; SHADOW at
134.253.202.158 failed to send file(s) to <134.253.202.216:4055>:
error reading from
/Volumes/snl/work/orca/ORCAmsf/trunk/condor-test/run_condor_parallel_java.$$
(OpSys).$$(Arch): (errno 2) No such file or directory
0 - Run Bytes Sent By Job
632607 - Run Bytes Received By Job
...
which suggests that $$(...) is not being expanded.
2) In my master node's MatchLog:
7/9 17:53:14 (fd:12) (pid:119) Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx < 134.253.202.158:52016> preempting
none <134.253.202.216:4005> vm1@xxxxxxxxxxxxxxxxxxxxxx
7/9 17:53:14 (fd:13) (pid:119) Matched 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016> preempting
none < 134.253.202.216:4005> vm2@xxxxxxxxxxxxxxxxxxxxxx
7/9 17:53:15 (fd:13) (pid:119) Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found
7/9 17:53:15 (fd:13) (pid:119) Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found
7/9 17:53:35 (fd:8) (pid:119) Rejected 62.0
DedicatedScheduler@xxxxxxxxxxxxxxxxxxxxxx <134.253.202.158:52016>: no match
found
I'm not sure about the significance of (2) since the job does run, but it
throws the shadow exception.
The expected behavior, of course, is to create three slaves and one master,
and have them run. My master simply waits for three incoming connections,
then exits; my slaves connect to the master via the hostname passed on the
command-line. (Incidentally, I believe that detail is wrong -- the hostname
I need to pass on the command-line is the master or scheduler hostname, and
my Condor macro-fu is not yet high enough to know how to do that.)
Thank you.
Some perhaps relevant details:
Condor_status:
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
vm1@xxxxxxxxx OSX PPC Unclaimed Idle 0.550 1024
0+00:02:07
vm2@xxxxxxxxx OSX PPC Unclaimed Idle 0.000 1024
0+00:02:08
vm1@xxxxxxxxx WINNT51 INTEL Unclaimed Idle 0.010 1023
0+00:04:05
vm2@xxxxxxxxx WINNT51 INTEL Unclaimed Idle 0.000 1023
0+00:04:06
-Denis
[1]
# -*- mode: conf -*-
### Condor job description file for condor-test
### My variables
### Job universe & ancillary settings.
universe = parallel
machine_count = 4
#getenv = true
# Where to direct output
output = condor-test.out$(Node)
error = condor-test.out$(Node)
log = log.$(Node)
### The files we need to use.
transfer_input_files =
dist/condor-test.jar,lib/runtime/java- getopt-1.0.13.jar
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
### Notify someone on finish of job? This doesn't work, probably ID10T.
Notification = always
### Where should I run?
Requirements = ((Arch == "PPC" && OpSys == "OSX") \
|| (Arch == "INTEL" && OpSys == "WINNT51"))
### What to run
executable = run_condor_parallel_java.$$(OpSys).$$(Arch)
arguments = condor-test.jar --master=s889069.srn.sandia.gov --num-slaves=3
$(Node)
queue
# arguments = condor-test.jar --master= s889069.srn.sandia.gov --num-slaves=1
$(Node)
# queue
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users
The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/