[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Jobs are executed only on the submitting machines



On Oct 2, 2006, at 3:10 AM, Dr. Raffaele Montella wrote:

I'm new in Condor installing and configuring so I don't now if there is some
else to do to avoid this behaviour:
I installed the Condor 6.8.1 on a cluster build by 9 P4@xxxxxx with 1 master host and 9 working nodes (basically on a Beowulf system) with Fedora Core 4 Linux. All nodes share /home and /opt. I followed the install procedure choosing a full installation on the master node configuring it as the condor
control manager. All daemons on the controller starts up correctly.
In the CONTROL_CONFIG file I choose:
LOCAL_DIR               = /home/condor/hosts/$(HOSTNAME)
LOCAL_CONFIG_FILE       = $(RELEASE_DIR)/etc/$(HOSTNAME).local
REQUIRE_LOCAL_CONFIG_FILE = FALSE
HOSTALLOW_WRITE = *

Then I configured each working node defining the
CONDOR_HOME=/opt/condor-6.8.1, the
CODOR_CONFIG=/opt/condor-6.8.1/etc/condor_config. Condor starts up correctly on each working node. The condor_status command shows all machines in the
pool:

Name          OpSys       Arch   State      Activity   LoadAv Mem
ActvtyTime

vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000  1012
0+20:51:56
vm2@xxxxxxxxx LINUX       INTEL  Owner      Idle       0.070  1012
0+00:10:05
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+03:05:04
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+20:50:50
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+03:05:04
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+20:50:53
vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       1.000   250
0+22:20:50
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+02:50:08
vm1@xxxxxxxxx LINUX       INTEL  Owner      Idle       1.000   250
0+22:20:42
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+02:55:05
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+03:05:06
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+20:50:54
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   504
0+03:05:04
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   504
0+20:50:55
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+03:05:05
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+20:50:53
vm1@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.010   250
0+03:05:05
vm2@xxxxxxxxx LINUX       INTEL  Unclaimed  Idle       0.000   250
0+20:50:54

                     Total Owner Claimed Unclaimed Matched Preempting
Backfill

         INTEL/LINUX    18     3       0        15       0          0
0

               Total    18     3       0        15       0          0
0


It's appear working correctly, but if I submit a using the following script with the command condor_submit -a "log = out.log" -a "error = error.log"
ex02.submit:

Executable     = /bin/hostname
Universe       = vanilla
Requirements   = OpSys == "LINUX" && Arch =="INTEL"
             Error   = err.$(Process)
             Output  = out.$(Process)
             Log = foo.log

Queue 50

The jobs are queued but executed only on the submitting machine.
I tried with more jobs, for example 500, with all machines unclaimed, but
nothing! If I submit from the master node all jobs are executed on the
master node, if I submit from the node01 all jobs are executed on the node01
and so on.

What is wrong?

This is probably a shared filesystem problem. On unix, if you don't say otherwise, Condor assumes your job's data files are on a shared filesystem. If your machines don't have a share filesystem, then Condor will only run the job on the submit machine.

You can tell Condor to not rely on a shared filesystem and transfer the job's file itself by including the follow in your submit file:
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
Then you can also use transfer_input_files to say what input files need to be transferred in addition to the executable.

+--------------------------------+-----------------------------------+
|           Jaime Frey           | I used to be a heavy gambler.     |
|       jfrey@xxxxxxxxxxx        | But now I just make mental bets.  |
| http://www.cs.wisc.edu/~jfrey/ | That's how I lost my mind.        |
+--------------------------------+-----------------------------------+