[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor-C + parallel universe oddity



Hi all, I'm attempting to submit parallel (MPI) jobs to run to a single SMP host with delegated scheduling using Condor-C, but have run into an annoying problem which feels like a bug, though I stand to be corrected. The problem is that the environment variable _CONDOR_NPROCS does not get set correctly on the execute node, but is always set to 1, immaterial of what I request in the submit script, so only a single threaded instance gets executed. So for example, if I try to submit a job that needs to run on 4 cores from machine_A by delegating to machine_B for it to run on machine_C, then I'd use the following submit script:

########
universe = grid
executable = <mpi wrapper>
transfer_input_files = <actual executable>, <other input files>
WhenToTransferOutput = ON_EXIT
output = myoutput
error = myerror
log = mylog
notification = never

grid_resource = condor machine_B.domain central.manager.domain

+remote_WantParallelSchedulingGroups = True
+remote_grid_resource = condor
+remote_jobuniverse = 11
+remote_arguments = "<actual executable> <other arguments>"
+remote_requirements = OpSys == "LINUX" && Arch == "X86_64"
+remote_MachineCount = 4
+remote_ShouldTransferFiles = "YES"
+remote_WhenToTransferOutput = "ON_EXIT"

queue
########

I can see the value for MachineCount (or if I use the tag "machine_count") being set correctly in the CGAHPLog file on the submit node, but on the execute host _CONDOR_NPROCS only ever gets set to 1. If I forgo Condor-C and submit the job directly, i.e. machine_A to machine_C with the submit script:

#######
universe = parallel
executable = <mpi wrapper>
transfer_input_files = <actual executable>, <other input files>
WhenToTransferOutput = ON_EXIT
output = myoutput
error = myerror
log = mylog
notification = never
machinecount = 4
arguments = "<actual executable> <other arguments>"
+WantParallelSchedulingGroups = True
requirements = OpSys == "LINUX" && Arch == "X86_64"

queue
#######

then the job runs correctly. I should say that I *can* get the delegated parallel/MPI job to use the correct number of processors (actually single core slots) by adding the following hack to the original submit script:

+remote_MinHosts = 4
+remote_MaxHosts = 4

However, Condor only shows one of the slots (the MPI head node) to be claimed, e.g via condor_status, even though all the nodes in use are correctly listed in the job's classad under RemoteHosts, meaning that the execute host could end up getting oversubscribed with jobs. Note that unless I add these values by hand then MinHosts and MaxHosts always have values of 1, immaterial of what I request via MachineCount.

In all the foregoing I've been using Condor 7.2.4, with machines A and B running Debian 5.0, and machine C running Ubuntu 9.04. In all cases I'm running the Debian 5.0 version of Condor.

Any hints as to what may be going wrong?

Aloo


Use Hotmail to send and receive mail from your different email accounts. Find out how.