[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Parallel scheduling group problem



Hi John,

my apologies for the late reply to your message. Yes, the affected jobs are only parallel universe jobs. The vanilla jobs which have basically the same requirements _expression_ but don't use parallel scheduling groups are not affected by this behavior. Regarding your suggestion to make the jobs exclusive Windows (Opsys = "Windows") or exclusive Linux (Qpsys == "Linux) I can say that our setup is a bit special as we have a mixed Windows/Linux cluster. Often people don't really care whether there job runs on Windows or on Linux (the software used is cross-platform) and this is why I use the (Opsys == "Windows" || Opsys == "Linux") _expression_. But sometimes people want to select a specific OS (e.g. to reproduce an issue one of our customers reported). I'll experiment a bit more in the next days to find out, what triggers the behavior that suddenly these parallel jobs are no longer matched to resources. If you can give me any hint, it would be helpful though.
Â
It's good to know that the "analyze" output will be gone in 8.6. It was always a bit confusing. ;-)

2017-07-25 20:01 GMT+02:00 John M Knoeller <johnkn@xxxxxxxxxxx>:

Are the jobs parallel universe jobs?ÂÂ The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same âscheduling groupâ (usually use to indicate that the machines have fast network access to each other).

Â

I think you just want to add Opsys==âWINDOWSâ to your jobâs requirements _expression_.Â

Â

As for your question about -better-analyze. It is not saying that all 4 machines match.Â

This line

[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG

Indicates that only two machines match that clause. Âwhereas these lines

Â

1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"

2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE

Â

(incorrectly) indicates that 0 machines match. There is a known problem with the âSuggestions:â clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions â the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

Â

-tj

Â

Â

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Parallel scheduling group problem

Â

We've a mixed Windows/Linux setup managed by HTCondor.ÂI configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

Â

# on both Linux machines

ParallelSchedulingGroup = "linux-cluster"

Â

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

Â

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:


---------------------------------------------------------------------------------------------------------

ÂThe Requirements _expression_ for your job is:

ÂÂÂ ( ParallelSchedulingGroup is my.Matched_PSG ) &&
ÂÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
ÂÂÂÂÂ ( Arch == "X86_64" ) &&
ÂÂÂÂÂ ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
ÂÂÂÂÂ ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
ÂÂÂ ( TARGET.HasFileTransfer )

Your job defines the following attributes:

ÂÂÂ DiskUsage = 75
ÂÂÂ Matched_PSG = "windows-cluster"
ÂÂÂ RequestDisk = 75

The Requirements _expression_ for your job reduces to these conditions:

ÂÂÂÂÂÂÂÂ Slots
Step Matched Condition
-----Â --------Â ---------
[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG
[1]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Linux"
[2]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Windows"
[3]ÂÂÂÂÂÂÂÂÂÂ 4Â [1] || [2]
[4]ÂÂÂÂÂÂÂÂÂÂ 4Â Arch == "X86_64"
[6]ÂÂÂÂÂÂÂÂÂÂ 4Â stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]ÂÂÂÂÂÂÂÂÂÂ 4Â CST_CLUSTER_HAS_DC is true

Suggestions:

ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂ Suggestion
ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂ ----------
1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"
2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE
3ÂÂ ( TARGET.Disk >= 75 )ÂÂÂÂÂÂÂÂÂÂÂÂ 4
4ÂÂ ( TARGET.HasFileTransfer )ÂÂÂÂÂÂÂ 4

Â

---------------------------------------------------------------------------------------------------------

It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster"Â

Â

which is for sure not true as I have also checked with condor_status:


condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/