[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel scheduling group problem



Are the jobs parallel universe jobs?   The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).

 

I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_. 

 

As for your question about -better-analyze.  It is not saying that all 4 machines match. 

This line

[0]           2  ParallelSchedulingGroup is my.Matched_PSG

Indicates that only two machines match that clause.  whereas these lines

 

1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"

2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE

 

(incorrectly) indicates that 0 machines match.  There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problem

 

We've a mixed Windows/Linux setup managed by HTCondor. I configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

 

# on both Linux machines

ParallelSchedulingGroup = "linux-cluster"

 

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

 

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:


---------------------------------------------------------------------------------------------------------

 The Requirements _expression_ for your job is:

    ( ParallelSchedulingGroup is my.Matched_PSG ) &&
    ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
      ( Arch == "X86_64" ) &&
      ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
      ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    DiskUsage = 75
    Matched_PSG = "windows-cluster"
    RequestDisk = 75

The Requirements _expression_ for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           2  ParallelSchedulingGroup is my.Matched_PSG
[1]           2  Opsys == "Linux"
[2]           2  Opsys == "Windows"
[3]           4  [1] || [2]
[4]           4  Arch == "X86_64"
[6]           4  stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]           4  CST_CLUSTER_HAS_DC is true

Suggestions:

    Condition                      Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"
2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE
3   ( TARGET.Disk >= 75 )             4
4   ( TARGET.HasFileTransfer )        4

 

---------------------------------------------------------------------------------------------------------

It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster" 

 

which is for sure not true as I have also checked with condor_status:


condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.