Re: [HTCondor-users] Parallel scheduling group problem

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Are the jobs parallel universe jobs? The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).

I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_.

As for your question about -better-analyze. It is not saying that all 4 machines match.

This line

[0] 2 ParallelSchedulingGroup is my.Matched_PSG

Indicates that only two machines match that clause. whereas these lines

1 ( ParallelSchedulingGroup is "windows-cluster" )
0 MODIFY TO "windows-cluster"

2 ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
0 REMOVE

(incorrectly) indicates that 0 machines match. There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

-tj

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problem

We've a mixed Windows/Linux setup managed by HTCondor. I configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

# on both Linux machines

ParallelSchedulingGroup = "linux-cluster"

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:

---------------------------------------------------------------------------------------------------------

The Requirements _expression_ for your job is:

    ( ParallelSchedulingGroup is my.Matched_PSG ) &&
    ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
      ( Arch == "X86_64" ) &&
      ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
      ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    DiskUsage = 75
    Matched_PSG = "windows-cluster"
    RequestDisk = 75

The Requirements _expression_ for your job reduces to these conditions:

         Slots
Step    Matched Condition
----- -------- ---------
[0]           2 ParallelSchedulingGroup is my.Matched_PSG
[1]           2 Opsys == "Linux"
[2]           2 Opsys == "Windows"
[3]           4 [1] || [2]
[4]           4 Arch == "X86_64"
[6]           4 stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]           4 CST_CLUSTER_HAS_DC is true

Suggestions:

    Condition                      Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"
2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE
3   ( TARGET.Disk >= 75 )             4
4   ( TARGET.HasFileTransfer )        4

---------------------------------------------------------------------------------------------------------

It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster"

which is for sure not true as I have also checked with condor_status:

condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.

Mailing List Archives

Public Access

Re: [HTCondor-users] Parallel scheduling group problem