[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel scheduling group problem



Sorry, I donât really know why the parallel jobs would stop matching.   

 

I can tell you that condor_q -analyze doesnât work for parallel, local or scheduler universe jobs, and in HTCondor 8.6 and later it will notice that the job you are trying to analyze is one of these and will print a message to that effect.

 

-tj

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Tuesday, August 1, 2017 3:04 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel scheduling group problem

 

Hi John,

my apologies for the late reply to your message. Yes, the affected jobs are only parallel universe jobs. The vanilla jobs which have basically the same requirements _expression_ but don't use parallel scheduling groups are not affected by this behavior. Regarding your suggestion to make the jobs exclusive Windows (Opsys = "Windows") or exclusive Linux (Qpsys == "Linux) I can say that our setup is a bit special as we have a mixed Windows/Linux cluster. Often people don't really care whether there job runs on Windows or on Linux (the software used is cross-platform) and this is why I use the (Opsys == "Windows" || Opsys == "Linux") _expression_. But sometimes people want to select a specific OS (e.g. to reproduce an issue one of our customers reported). I'll experiment a bit more in the next days to find out, what triggers the behavior that suddenly these parallel jobs are no longer matched to resources. If you can give me any hint, it would be helpful though.
 

It's good to know that the "analyze" output will be gone in 8.6. It was always a bit confusing. ;-)

 

2017-07-25 20:01 GMT+02:00 John M Knoeller <johnkn@xxxxxxxxxxx>:

Are the jobs parallel universe jobs?   The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).

 

I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_. 

 

As for your question about -better-analyze.  It is not saying that all 4 machines match. 

This line

[0]           2  ParallelSchedulingGroup is my.Matched_PSG

Indicates that only two machines match that clause.  whereas these lines

 

1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"

2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE

 

(incorrectly) indicates that 0 machines match.  There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

 

-tj

 

 

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problem

 

We've a mixed Windows/Linux setup managed by HTCondor. I configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

 

# on both Linux machines

ParallelSchedulingGroup = "linux-cluster"

 

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

 

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:


---------------------------------------------------------------------------------------------------------

 The Requirements _expression_ for your job is:

    ( ParallelSchedulingGroup is my.Matched_PSG ) &&
    ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
      ( Arch == "X86_64" ) &&
      ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
      ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    DiskUsage = 75
    Matched_PSG = "windows-cluster"
    RequestDisk = 75

The Requirements _expression_ for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           2  ParallelSchedulingGroup is my.Matched_PSG
[1]           2  Opsys == "Linux"
[2]           2  Opsys == "Windows"
[3]           4  [1] || [2]
[4]           4  Arch == "X86_64"
[6]           4  stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]           4  CST_CLUSTER_HAS_DC is true

Suggestions:

    Condition                      Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"
2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE
3   ( TARGET.Disk >= 75 )             4
4   ( TARGET.HasFileTransfer )        4

 

---------------------------------------------------------------------------------------------------------

It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster" 

 

which is for sure not true as I have also checked with condor_status:


condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/