[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Problem with HTCondor, Dynamic Slots, and the Parallel Universe



Hi all,


Iâm having an issue with HTCondor while using the parallel universe and dynamic slots and I'm hoping someone here might be able to point me in the right direction. On a small two-node cluster (each identical node has 24 virtual processor cores), we are trying to run an MPI program, but this error occurs even with programs that do not use MPI or any other library/protocol to communicate.


When attempting to run a parallel job, it seems that in general the more processors I tell the job to use (greaterÂmachine_count), the less often the job actually runs (machine_countÂis always less than the total number of processors). WhenÂmachine_countÂis 2, the job always runs. If itâs 4, it usually runs. If itâs 10, it sometimes runs. If itâs 35, it rarely runs. If itâs 40 it never runs. If it doesnât run, the job just sits as idle and never does anything, even after a day. Also note that no one else is using the machines.


The weird thing is thatÂcondor_qÂsays that slots were matched, but also says that there are no matches. It seems like condor is not always partitioning the dynamic slots like itâs supposed to, but sometimes it does work perfectly. There are no errors, the job just stays idle.


Iâm using the linuxÂsleepÂcommand in the following example to simplify it as much as possible. The problem occurs regardless of the program. If I submit the job to the vanilla universe with âqueue 40â, everything runs fine (but not what I want since I need the parallel universe).


For example, if Iâm trying to run the following job (just an example):


universe = parallel

executable = /bin/sleep

arguments = 20

machine_count = 30

request_memory = 500

request_disk = 500

log = output/test.log

queue


This is whatÂcondor_statusÂreports before the job is submitted:


Name          OpSys  ÂArch   ÂState    ÂActivity  ÂLoadAv  Mem   ÂActvtyTime


slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80533  Â0+00:13:00

slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80533  Â0+00:12:53


        Total  ÂOwner  Claimed  ÂUnclaimed  ÂMatched  ÂPreempting  ÂBackfill


X86_64/LINUX Â Â2 Â Â Â Â0 Â Â Â Â0 Â Â Â Â Â2 Â Â Â Â Â Â0 Â Â Â Â Â0 Â Â Â Â Â Â 0


Total      2    Â0    Â0     Â2      Â0     Â0       0


This is whatÂcondor_statusÂreports after the job is submitted:


Name          OpSys  ÂArch   ÂState    ÂActivity  ÂLoadAv  Mem   ÂActvtyTime


slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80021  Â0+00:13:00

slot1_1@xxxxxxxxxxxxxÂÂLINUX  ÂX86_64  ÂClaimed   ÂIdle    Â0.000  Â512   Â0+00:13:00

slot1@xxxxxxxxxxxxx ÂLINUX  ÂX86_64  ÂUnclaimed  ÂIdle    Â0.000  Â80021  Â0+00:12:53

slot1_1@xxxxxxxxxxxxxÂÂLINUX  ÂX86_64  ÂClaimed   ÂIdle    Â0.000  Â512   Â0+00:12:53


        Total  ÂOwner  ÂClaimed  ÂUnclaimed  ÂMatched  ÂPreempting  ÂBackfill


X86_64/LINUX Â Â4 Â Â Â Â0 Â Â Â Â2 Â Â Â Â Â2 Â Â Â Â Â Â0 Â Â Â Â Â0 Â Â Â Â Â Â 0


Total      4    Â0    Â2     Â2      Â0     Â0       0



This is whatÂcondor_q -better-analyzeÂreports:


-- Submitter:Âcomp1.site.caÂ: <ip:port> :Âcomp1.site.ca

User priority forÂenglers@xxxxxxxÂis not available, attempting to analyze without it.

---

107.000: Run analysis summary. Of 4 machines,

ÂÂÂÂÂ 0 are rejected by your job's requirements

ÂÂÂÂÂ 0 reject your job because of their own requirements

ÂÂÂÂÂ 2 match and are already running your jobs

ÂÂÂÂÂ 0 match but are serving other users

ÂÂÂÂÂ 0 are available to run your job

ÂÂÂÂÂÂÂ No successful match recorded.

ÂÂÂÂÂÂÂ Last failed match: Tue Aug 5 13:09:59 2014


ÂÂÂÂÂÂÂ Reason for last match failure: no match found


The Requirements _expression_ for your job is:


ÂÂÂ ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&

ÂÂÂ ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&

ÂÂÂ ( ( TARGET.HasFileTransfer ) ||

ÂÂÂ ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )


Your job defines the following attributes:


ÂÂÂ FileSystemDomain = "comp1.site.ca"

ÂÂÂ RequestDisk = 500

ÂÂÂ RequestMemory = 500


The Requirements _expression_ for your job reduces to these conditions:


Slots

StepÂÂÂÂÂÂ MatchedÂÂÂÂÂÂÂÂÂÂÂÂÂ Condition

-----ÂÂÂÂÂÂÂ --------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ---------

[0]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Arch == "X86_64"

[1]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.OpSys == "LINUX"

[3]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Disk >= RequestDisk

[5]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂ ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.Memory >= RequestMemory

[7]ÂÂÂÂÂÂÂÂÂ 4ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ TARGET.HasFileTransfer


Suggestions:


ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂÂÂÂÂÂ Suggestion

ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------

1ÂÂ ( TARGET.Arch == "X86_64" )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4

2ÂÂ ( TARGET.OpSys == "LINUX" )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4

3ÂÂ ( TARGET.Disk >= 500 )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4

4ÂÂ ( TARGET.Memory >= 500 )ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4

5ÂÂ ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "comp1.site.ca" ) )

ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 4


The condor configuration contains for each node:

Â

START = TRUE

SUSPEND = FALSE

PREEMPT = FALSE

KILL = FALSE


NUM_SLOTS=1

NUM_SLOTS_TYPE_1=1

SLOT_TYPE_1=100%

SLOT_TYPE_1_PARTITIONABLE=true




I have also attached a file with the relevant sections of the log files that were updated. Some of the lines look like something went wrong, but I donât understand what they mean.


I have also tried changing user priorities (the only two users are englers and DedicatedScheduler) with no change.


Note: all instances of the hostname and ip addresses / ports were replaced with âcomp1.site.ca/comp2.site.caâ and âip:portâ.


It would be a great help if someone could provide some insight into what is going on. Iâm not really sure where to start.


Thanks for your time!

Steve