[HTCondor-users] ååïHTCondor-users Digest, Vol 45, Issue 5

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi,I want to know why my Parallel job is not running?

universe = parallel
executable = /bin/sleep
arguments = 30
machine_count = 3
queue

And my slot info is:

Name                               OpSys      Arch   State     Activity LoadAv Mem    ActvtyTime

ip-172-31-74-224.ec2.internal      LINUX      X86_64 Unclaimed Idle      0.000 15290  0+00:14:39
slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  7645  0+00:14:45
slot2@xxxxxxxxxxxxxxxxxxxxxxxxxxxx LINUX      X86_64 Unclaimed Idle      0.000  7645  0+00:15:03

why this job is not run?

------------------------------------------------------------------
åääïhtcondor-users-request <htcondor-users-request@xxxxxxxxxxx>
åéæéï2017å8æ3æ(ææå) 02:02
æääïhtcondor-users <htcondor-users@xxxxxxxxxxx>
äãéïHTCondor-users Digest, Vol 45, Issue 5

Send HTCondor-users mailing list submissions to
htcondor-users@xxxxxxxxxxx

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
or, via email, send a message with subject or body 'help' to
htcondor-users-request@xxxxxxxxxxx

You can reach the person managing the list at
htcondor-users-owner@xxxxxxxxxxx

When replying, please edit your Subject line so it is more specific
than "Re: Contents of HTCondor-users digest..."

Today's Topics:

   1. Re: Parallel scheduling group problem (John M Knoeller)

----------------------------------------------------------------------

Message: 1
Date: Wed, 02 Aug 2017 18:00:44 +0000
From: John M Knoeller <johnkn@xxxxxxxxxxx>
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel scheduling group problem
Message-ID:
<CY1PR0601MB2021E070F4B47D21CADA6EAD96B00@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>

Content-Type: text/plain; charset="utf-8"

Sorry, I don?t really know why the parallel jobs would stop matching.

I can tell you that condor_q -analyze doesn?t work for parallel, local or scheduler universe jobs, and in HTCondor 8.6 and later it will notice that the job you are trying to analyze is one of these and will print a message to that effect.

-tj

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Felix Wolfheimer
Sent: Tuesday, August 1, 2017 3:04 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] Parallel scheduling group problem

Hi John,
my apologies for the late reply to your message. Yes, the affected jobs are only parallel universe jobs. The vanilla jobs which have basically the same requirements _expression_ but don't use parallel scheduling groups are not affected by this behavior. Regarding your suggestion to make the jobs exclusive Windows (Opsys = "Windows") or exclusive Linux (Qpsys == "Linux) I can say that our setup is a bit special as we have a mixed Windows/Linux cluster. Often people don't really care whether there job runs on Windows or on Linux (the software used is cross-platform) and this is why I use the (Opsys == "Windows" || Opsys == "Linux") _expression_. But sometimes people want to select a specific OS (e.g. to reproduce an issue one of our customers reported). I'll experiment a bit more in the next days to find out, what triggers the behavior that suddenly these parallel jobs are no longer matched to resources. If you can give me any hint, it would be helpful though.

It's good to know that the "analyze" output will be gone in 8.6. It was always a bit confusing. ;-)

2017-07-25 20:01 GMT+02:00 John M Knoeller <johnkn@xxxxxxxxxxx<mailto:johnkn@xxxxxxxxxxx>>:
Are the jobs parallel universe jobs?   The purpose of ParallelSchedulingGroup is to insure that all of the nodes of a parallel universe job in the same ?scheduling group? (usually use to indicate that the machines have fast network access to each other).

I think you just want to add Opsys==?WINDOWS? to your job?s requirements _expression_.

As for your question about -better-analyze.  It is not saying that all 4 machines match.
This line
[0]           2  ParallelSchedulingGroup is my.Matched_PSG
Indicates that only two machines match that clause.  whereas these lines

1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"
2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE

(incorrectly) indicates that 0 machines match.  There is a known problem with the ?Suggestions:? clause of -better-analyze. It does not correctly analyze complex sub-clauses, and almost never makes useful suggestions ? the suggestions clause has been removed from HTCondor 8.6 and later for that reason.

-tj

From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx<mailto:htcondor-users-bounces@xxxxxxxxxxx>] On Behalf Of Felix Wolfheimer
Sent: Monday, July 24, 2017 3:02 PM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx<mailto:htcondor-users@xxxxxxxxxxx>>
Subject: [HTCondor-users] Parallel scheduling group problem

We've a mixed Windows/Linux setup managed by HTCondor. I configured parallel scheduling groups for all systems. In a test setup where I can reproduce the issues, which I experience in the production pool, I have four execution hosts (2xWindows, 2xLinux). The execution hosts have parallel scheduling groups as follows:

# on both Linux machines
ParallelSchedulingGroup = "linux-cluster"

# on the Windows machines
ParallelSchedulingGroup = "windows-cluster"

After a while, jobs submitted to the parallel universe won't be started anymore and condor_q -better-analyze for such a job gives the following somehow inconsistent information:

---------------------------------------------------------------------------------------------------------
The Requirements _expression_ for your job is:
    ( ParallelSchedulingGroup is my.Matched_PSG ) &&
    ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) &&
      ( Arch == "X86_64" ) &&
      ( stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",") ) &&
      ( CST_CLUSTER_HAS_DC is true ) ) && ( TARGET.Disk >= RequestDisk ) &&
    ( TARGET.HasFileTransfer )
Your job defines the following attributes:
    DiskUsage = 75
    Matched_PSG = "windows-cluster"
    RequestDisk = 75
The Requirements _expression_ for your job reduces to these conditions:
         Slots
Step    Matched  Condition
-----  --------  ---------
[0]           2  ParallelSchedulingGroup is my.Matched_PSG
[1]           2  Opsys == "Linux"
[2]           2  Opsys == "Windows"
[3]           4  [1] || [2]
[4]           4  Arch == "X86_64"
[6]           4  stringListMember("2017",TARGET.CST_INSTALLED_VERSIONS,",")
[8]           4  CST_CLUSTER_HAS_DC is true
Suggestions:
    Condition                      Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ParallelSchedulingGroup is "windows-cluster" )
                                      0                   MODIFY TO "windows-cluster"
2   ( ( ( Opsys == "Linux" ) || ( Opsys == "Windows" ) ) && ( Arch == "X86_64" ) && ( stringListMember("2017",TARGET.CST
_INSTALLED_VERSIONS,",") ) && ( CST_CLUSTER_HAS_DC is true ) )
                                      0                   REMOVE
3   ( TARGET.Disk >= 75 )             4
4   ( TARGET.HasFileTransfer )        4

---------------------------------------------------------------------------------------------------------
It's strange that on one hand condor_q tells me that basically all four machines match my requirements _expression_, but on the other hand tells me that no machine matches the condition

ParallelSchedulingGroup is "windows-cluster"

which is for sure not true as I have also checked with condor_status:

condor_status -pool centos7-master.cst.de<http://centos7-master.cst.de> -af Machine ParallelSchedulingGroup
centos7-node01.cst.de<http://centos7-node01.cst.de> linux-cluster
centos7-node02.cst.de<http://centos7-node02.cst.de> linux-cluster
win2012-master.cst.de<http://win2012-master.cst.de> windows-cluster
win2012-node01.cst.de<http://win2012-node01.cst.de> windows-cluster
Has anyone an idea what may cause this strange behavior?
Don't know whether this is relevant but I've set NUM_CPUS=1 for all machines as a job is supposed to have exclusive access to all resources on a compute node.

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx<mailto:htcondor-users-request@xxxxxxxxxxx> with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www-auth.cs.wisc.edu/lists/htcondor-users/attachments/20170802/7ae32174/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
HTCondor-users mailing list
HTCondor-users@xxxxxxxxxxx
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

------------------------------

End of HTCondor-users Digest, Vol 45, Issue 5
*********************************************

Mailing List Archives

Public Access

[HTCondor-users] ååïHTCondor-users Digest, Vol 45, Issue 5