[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Parallel scheduling group problem



We've a mixed Windows/Linux setup managed by HTCondor.ÂI configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

# on both Linux machines
ParallelSchedulingGroup = "linux-cluster"

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:

---------------------------------------------------------------------------------------------------------
ÂThe Requirements _expression_ for your job is:
ÂÂÂ ( ParallelSchedulingGroup is my.Matched_PSG ) &&
ÂÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
ÂÂÂÂÂ ( Arch == "X86_64" ) &&
ÂÂÂÂÂ ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
ÂÂÂÂÂ ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
ÂÂÂ ( TARGET.HasFileTransfer )
Your job defines the following attributes:
ÂÂÂ DiskUsage = 75
ÂÂÂ Matched_PSG = "windows-cluster"
ÂÂÂ RequestDisk = 75
The Requirements _expression_ for your job reduces to these conditions:

ÂÂÂÂÂÂÂÂ Slots
Step Matched Condition
-----Â --------Â ---------
[0]ÂÂÂÂÂÂÂÂÂÂ 2Â ParallelSchedulingGroup is my.Matched_PSG
[1]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Linux"
[2]ÂÂÂÂÂÂÂÂÂÂ 2Â Opsys == "Windows"
[3]ÂÂÂÂÂÂÂÂÂÂ 4Â [1] || [2]
[4]ÂÂÂÂÂÂÂÂÂÂ 4Â Arch == "X86_64"
[6]ÂÂÂÂÂÂÂÂÂÂ 4Â stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]ÂÂÂÂÂÂÂÂÂÂ 4Â CST_CLUSTER_HAS_DC is true

Suggestions:
ÂÂÂ ConditionÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ Machines MatchedÂÂÂ Suggestion
ÂÂÂ ---------ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ----------------ÂÂÂ ----------
1ÂÂ ( ParallelSchedulingGroup is "windows-cluster" )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ MODIFY TO "windows-cluster"
2ÂÂ ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ 0ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ REMOVE
3ÂÂ ( TARGET.Disk >= 75 )ÂÂÂÂÂÂÂÂÂÂÂÂ 4
4ÂÂ ( TARGET.HasFileTransfer )ÂÂÂÂÂÂÂ 4

---------------------------------------------------------------------------------------------------------
It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster"Â

which is for sure not true as I have also checked with condor_status:

condor_status -pool centos7-master.cst.de -af Machine ParallelSchedulingGroup
centos7-node01.cst.de linux-cluster
centos7-node02.cst.de linux-cluster
win2012-master.cst.de windows-cluster
win2012-node01.cst.de windows-cluster

Has anyone an idea what may cause this strange behavior?

Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.