Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Limit jobs per node

Date: Fri, 06 Apr 2018 18:43:48 +0000
From: Michael Pelletier <Michael.V.Pelletier@xxxxxxxxxxxx>
Subject: Re: [HTCondor-users] Limit jobs per node

Glad I could help!

The machine resource is a global limit for each exec node, which is applicable to any user making reference to it.

If you want to allow multiple users to only run one each of their own jobs on an exec node, but allow multiple such "one-each" users to share a given exec node, then there's a technique available when you're using dynamic slots, for a requirements expression using the "ChildRemoteOwner" machine attribute found in partitionable slot ClassAds.

This attribute was added in a more recent version, though I don't recall which one. To check if you have it, run:

	condor_status -constraint 'SlotType == "Partitionable"' ChildRemoteOwner

A job which wants to insure that only one copy of itself runs on any given machine can require that it only be matched to a partitionable slot via TARGET.SlotType == "Partitionable", and then require that the TARGET.ChildRemoteUser list does not contain the job's MY.User attribute. I'm not sure if stringListMember will work for that test due to the braces - I think maybe not. A regexp match would work.

Needless to say, this will also prevent a one-per-node job from running on any machine where the user is already running another type of job, since the negotiator , but it won't prevent other types of jobs without the requirements expression from matching machines which already have a one-per-node job on them.

You might be able to use another Child* attribute to get around this - perhaps you'd only pass up a machine which is running other jobs of yours if one of the jobs had a certain disk or memory value suggestive of the one-per-node job. However with the quantizing of requests that goes on in the negotiator, that may not be entirely straightforward to evaluate.

Looks like there's a bug in the ChildRemoteOwner attribute - the ChildRemoteOwner should contain "Owner" job attributes but contains User attributes instead, just like ChildRemoteUser does. (v8.6.9)

	-Michael Pelletier.

-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Mathieu Bahin
Sent: Friday, April 6, 2018 12:49 PM
To: htcondor-users@xxxxxxxxxxx
Subject: [External] Re: [HTCondor-users] Limit jobs per node

Hi,

Thanks for the 2 solutions proposed (request_disk and resource name).

Actually, if I understood correctly, for these 2 solutions, it seems nice but if several users in a same pool want to use it at the same time, I guess it doesn't work right? Only one job of any of the users using it currently will run on a node I guess?
If I understood correctly, the "basic" user (the one that doesn't do a "condor_q -l" on other users jobs!) has no way to know that someone else is already consuming the resource for its jobs (except that he can see that the jobs of another user are very spread) although maybe user A is using the resource trick to avoid I/O problems whereas user B would use the trick for storage issues.

Cheers,
Mathieu

--
---------------------------------------------------------------------------------------
| Mathieu Bahin
| IE CNRS
|
| Institut de Biologie de l'Ecole Normale SupÃrieure (IBENS) Biocomp 
| team
| 46 rue d'Ulm
| 75230 PARIS CEDEX 05
| 01.44.32.23.56
---------------------------------------------------------------------------------------

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

References:
- Re: [HTCondor-users] Limit jobs per node
  - From: Mathieu Bahin

Prev by Date: Re: [HTCondor-users] Limit jobs per node
Next by Date: Re: [HTCondor-users] RuntimeError: Failed to receive remote ad.
Previous by thread: Re: [HTCondor-users] Limit jobs per node
Next by thread: [HTCondor-users] docker job don't start caused by corrupted .startd_docker_images file
Index(es):
- Date
- Thread

Mailing List Archives

Public Access

Re: [HTCondor-users] Limit jobs per node