Re: [HTCondor-users] Problem with HTCondor, Dynamic Slots, and the Parallel Universe

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Le 24 août 2014 à 19:55, Steven <smengler3@xxxxxxxxx> a écrit :

Hi all,

I’m having an issue with HTCondor while using the parallel universe and dynamic slots and I'm hoping someone here might be able to point me in the right direction. On a small two-node cluster (each identical node has 24 virtual processor cores), we are trying to run an MPI program, but this error occurs even with programs that do not use MPI or any other library/protocol to communicate.

When attempting to run a parallel job, it seems that in general the more processors I tell the job to use (greater machine_count), the less often the job actually runs (machine_count is always less than the total number of processors). When machine_count is 2, the job always runs. If it’s 4, it usually runs. If it’s 10, it sometimes runs. If it’s 35, it rarely runs. If it’s 40 it never runs. If it doesn’t run, the job just sits as idle and never does anything, even after a day. Also note that no one else is using the machines.

The weird thing is that condor_q says that slots were matched, but also says that there are no matches. It seems like condor is not always partitioning the dynamic slots like it’s supposed to, but sometimes it does work perfectly. There are no errors, the job just stays idle.

I’m using the linux sleep command in the following example to simplify it as much as possible. The problem occurs regardless of the program. If I submit the job to the vanilla universe with “queue 40”, everything runs fine (but not what I want since I need the parallel universe).

For example, if I’m trying to run the following job (just an example):

universe = parallel

executable = /bin/sleep

arguments = 20

machine_count = 30

request_memory = 500

request_disk = 500

log = output/test.log

queue

This is what condor_status reports before the job is submitted:

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@xxxxxxxxxxxxx   LINUX X86_64 Unclaimed Idle 0.000 80533 0+00:13:00

slot1@xxxxxxxxxxxxx   LINUX X86_64 Unclaimed Idle 0.000 80533 0+00:12:53

Total Owner    Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 2 0 0 2 0 0 0

Total 2 0 0 2 0 0 0

This is what condor_status reports after the job is submitted:

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

slot1@xxxxxxxxxxxxx   LINUX X86_64 Unclaimed Idle 0.000 80021 0+00:13:00

slot1_1@xxxxxxxxxxxxx  LINUX X86_64 Claimed Idle 0.000 512 0+00:13:00

slot1@xxxxxxxxxxxxx   LINUX X86_64 Unclaimed Idle 0.000 80021 0+00:12:53

slot1_1@xxxxxxxxxxxxx  LINUX X86_64 Claimed Idle 0.000 512 0+00:12:53

Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 4 0 2 2 0 0 0

Total 4 0 2 2 0 0 0

This is what condor_q -better-analyze reports:

-- Submitter: comp1.site.ca : <ip:port> : comp1.site.ca

User priority for englers@xxxxxxx is not available, attempting to analyze without it.

---

107.000: Run analysis summary. Of 4 machines,

      0 are rejected by your job's requirements

      0 reject your job because of their own requirements

      2 match and are already running your jobs

      0 match but are serving other users

      0 are available to run your job

        No successful match recorded.

        Last failed match: Tue Aug 5 13:09:59 2014

        Reason for last match failure: no match found

The Requirements _expression_ for your job is:

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&

    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&

    ( ( TARGET.HasFileTransfer ) ||

    ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

Your job defines the following attributes:

    FileSystemDomain = "comp1.site.ca"

    RequestDisk = 500

    RequestMemory = 500

The Requirements _expression_ for your job reduces to these conditions:

Slots

Step       Matched              Condition

-----        --------                   ---------

[0]          4                              TARGET.Arch == "X86_64"

[1]          4                              TARGET.OpSys == "LINUX"

[3]          4                              TARGET.Disk >= RequestDisk

[5]          4                              TARGET.Memory >= RequestMemory

[7]          4                              TARGET.HasFileTransfer

Suggestions:

    Condition                                                        Machines Matched         Suggestion

    ---------                                                             ----------------                     ----------

1   ( TARGET.Arch == "X86_64" )                 4

2   ( TARGET.OpSys == "LINUX" )               4

3   ( TARGET.Disk >= 500 )                             4

4   ( TARGET.Memory >= 500 )                    4

5   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "comp1.site.ca" ) )

                                                                                4

The condor configuration contains for each node:

START = TRUE

SUSPEND = FALSE

PREEMPT = FALSE

KILL = FALSE

NUM_SLOTS=1

NUM_SLOTS_TYPE_1=1

SLOT_TYPE_1=100%

SLOT_TYPE_1_PARTITIONABLE=true

I have also attached a file with the relevant sections of the log files that were updated. Some of the lines look like something went wrong, but I don’t understand what they mean.

I have also tried changing user priorities (the only two users are englers and DedicatedScheduler) with no change.

Note: all instances of the hostname and ip addresses / ports were replaced with “comp1.site.ca/comp2.site.ca” and “ip:port”.

It would be a great help if someone could provide some insight into what is going on. I’m not really sure where to start.

Thanks for your time!

Steve

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

__________________________________________

Guillaume Thibault

Mailing List Archives

Public Access

Re: [HTCondor-users] Problem with HTCondor, Dynamic Slots, and the Parallel Universe