[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Problem with HTCondor, Dynamic Slots, and the Parallel Universe



Sorry all about that, the mail was misplaced and I believed it was one of my fellow who wrote it :-(
My mistake.




Le 24 août 2014 à 19:55, Steven <smengler3@xxxxxxxxx> a écrit :

Hi all,

I’m having an issue with HTCondor while using the parallel universe and dynamic slots and I'm hoping someone here might be able to point me in the right direction. On a small two-node cluster (each identical node has 24 virtual processor cores), we are trying to run an MPI program, but this error occurs even with programs that do not use MPI or any other library/protocol to communicate.

When attempting to run a parallel job, it seems that in general the more processors I tell the job to use (greater machine_count), the less often the job actually runs (machine_count is always less than the total number of processors). When machine_count is 2, the job always runs. If it’s 4, it usually runs. If it’s 10, it sometimes runs. If it’s 35, it rarely runs. If it’s 40 it never runs. If it doesn’t run, the job just sits as idle and never does anything, even after a day. Also note that no one else is using the machines.

The weird thing is that condor_q says that slots were matched, but also says that there are no matches. It seems like condor is not always partitioning the dynamic slots like it’s supposed to, but sometimes it does work perfectly. There are no errors, the job just stays idle.

I’m using the linux sleep command in the following example to simplify it as much as possible. The problem occurs regardless of the program. If I submit the job to the vanilla universe with “queue 40”, everything runs fine (but not what I want since I need the parallel universe).

For example, if I’m trying to run the following job (just an example):

universe = parallel
executable = /bin/sleep
arguments = 20
machine_count = 30
request_memory = 500
request_disk = 500
log = output/test.log
queue

This is what condor_status reports before the job is submitted:

Name                   OpSys    Arch      State        Activity    LoadAv   Mem      ActvtyTime

slot1@xxxxxxxxxxxxx    LINUX    X86_64    Unclaimed    Idle        0.000    80533    0+00:13:00
slot1@xxxxxxxxxxxxx    LINUX    X86_64    Unclaimed    Idle        0.000    80533    0+00:12:53

                Total    Owner    Claimed    Unclaimed    Matched    Preempting    Backfill

X86_64/LINUX    2        0        0          2            0          0             0

Total           2        0        0          2            0          0             0

This is what condor_status reports after the job is submitted:

Name                   OpSys    Arch      State        Activity    LoadAv   Mem      ActvtyTime

slot1@xxxxxxxxxxxxx    LINUX    X86_64    Unclaimed    Idle        0.000    80021    0+00:13:00
slot1_1@xxxxxxxxxxxxx  LINUX    X86_64    Claimed      Idle        0.000    512      0+00:13:00
slot1@xxxxxxxxxxxxx    LINUX    X86_64    Unclaimed    Idle        0.000    80021    0+00:12:53
slot1_1@xxxxxxxxxxxxx  LINUX    X86_64    Claimed      Idle        0.000    512      0+00:12:53

                Total    Owner    Claimed    Unclaimed    Matched    Preempting    Backfill

X86_64/LINUX    4        0        2          2            0          0             0

Total           4        0        2          2            0          0             0


This is what condor_q -better-analyze reports:

-- Submitter: comp1.site.ca : <ip:port> : comp1.site.ca
User priority for englers@xxxxxxx is not available, attempting to analyze without it.
---
107.000:  Run analysis summary.  Of 4 machines,
      0 are rejected by your job's requirements
      0 reject your job because of their own requirements
      2 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job
        No successful match recorded.
        Last failed match: Tue Aug 5 13:09:59 2014

        Reason for last match failure: no match found

The Requirements _expression_ for your job is:

    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( ( TARGET.HasFileTransfer ) ||
    ( TARGET.FileSystemDomain == MY.FileSystemDomain ) )

Your job defines the following attributes:

    FileSystemDomain = "comp1.site.ca"
    RequestDisk = 500
    RequestMemory = 500

The Requirements _expression_ for your job reduces to these conditions:

Slots
Step       Matched              Condition
-----        --------                   ---------
[0]          4                              TARGET.Arch == "X86_64"
[1]          4                              TARGET.OpSys == "LINUX"
[3]          4                              TARGET.Disk >= RequestDisk
[5]          4                              TARGET.Memory >= RequestMemory
[7]          4                              TARGET.HasFileTransfer

Suggestions:

    Condition                                                        Machines Matched         Suggestion
    ---------                                                             ----------------                     ----------
1   ( TARGET.Arch == "X86_64" )                 4
2   ( TARGET.OpSys == "LINUX" )               4
3   ( TARGET.Disk >= 500 )                             4
4   ( TARGET.Memory >= 500 )                    4
5   ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "comp1.site.ca" ) )
                                                                                4

The condor configuration contains for each node:
 
START = TRUE
SUSPEND = FALSE
PREEMPT = FALSE
KILL = FALSE

NUM_SLOTS=1
NUM_SLOTS_TYPE_1=1
SLOT_TYPE_1=100%
SLOT_TYPE_1_PARTITIONABLE=true



I have also attached a file with the relevant sections of the log files that were updated. Some of the lines look like something went wrong, but I don’t understand what they mean.

I have also tried changing user priorities (the only two users are englers and DedicatedScheduler) with no change.

Note: all instances of the hostname and ip addresses / ports were replaced with “comp1.site.ca/comp2.site.ca” and “ip:port”.

It would be a great help if someone could provide some insight into what is going on. I’m not really sure where to start.

Thanks for your time!
Steve
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

__________________________________________
Guillaume Thibault