[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] dynamic slots



Ah, yes.  dynamic slots only work when your jobs have request_memory/request_cpus statements.  Setting requirements is not enough because the startd needs to know how big to make a dynamic slot and can't determining that just be evaluating the requirements expression.

Now, once you start using dynamic slots, you will notice that jobs with larger resource requirements will tend to starve as long as there are sufficient smaller jobs to keep the startd's busy.  The way generally solve this is by using the Draining daemon to periodically force some percentage of machines to stop accepting jobs and give the dynamic slots a chance to coalese back into a single large slot.

See section "3.7.1.21 Defragmenting Dynamic Slots" of the manual for more information.
http://research.cs.wisc.edu/htcondor/manual/current/3_7Policy_Configuration.html#38894



-----Original Message-----
From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Larry Martell
Sent: Tuesday, February 27, 2018 11:00 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: Re: [HTCondor-users] dynamic slots

I solved this particular issue - I had

'Requirements':  '(Memory > 10000)'

When I changed it to

'request_memory': '10000'

This issue was solved. But then I ended up not using dynamic slots as
they are not doing what I need.

My need is to have condor hold jobs if there is not some amount of
memory available and submit them when memory does become available. I
have not figured out how to do that. I have another thread on the ML
about that (https://www-auth.cs.wisc.edu/lists/htcondor-users/2018-February/msg00102.shtml)
, but it has not received any replies.

On Tue, Feb 27, 2018 at 9:36 AM, John M Knoeller <johnkn@xxxxxxxxxxx> wrote:
> The condor_q -analyze output below shows that the job matches the slot, but it also shows 0 machines for all of the counters in the last clause, and
>
> No successful match recorded.
> Last failed match: Fri Feb 23 14:38:52 2018
>
> That probably indicates that the slot doesn't match the job for some reason.  try running
>
> condor_q -better:reverse 38720 -machine slot1@chopin
>
> -tj
>
> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf Of Larry Martell
> Sent: Friday, February 23, 2018 1:47 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] dynamic slots
>
> I am trying to use dynamic slots as documented here:
>
> http://research.cs.wisc.edu/htcondor/CondorWeek2012/presentations/thain-dynamic-slots.pdf
>
> I have configured 1 slot thusly:
>
> NUM_SLOTS = 1
> NUM_SLOTS_TYPE_1 = 1
> SLOT_TYPE_1 = cpus=75%
> SLOT_TYPE_1 = mem=64000
> SLOT_TYPE_1_PARTITIONABLE = true
>
> I submit a job that requires 10G of memory and it does not run:
>
> $ condor_q -better-analyze 38720
>
>
> -- Schedd: bach.elucid.local : <192.168.10.2:9618?...
> The Requirements expression for job 38720.000 is
>
>     ( ( Memory >= 10000 ) ) && ( TARGET.Arch == "X86_64" ) &&
>     ( TARGET.OpSys == "LINUX" ) && ( TARGET.Disk >= RequestDisk ) &&
>     ( TARGET.Memory >= RequestMemory ) && ( TARGET.HasFileTransfer )
>
> Job 38720.000 defines the following attributes:
>
>     DiskUsage = 0
>     ImageSize = 0
>     RequestDisk = DiskUsage
>     RequestMemory = ifthenelse(MemoryUsage =!= undefined,MemoryUsage,(
> ImageSize + 1023 ) / 1024)
>
> slot1@chopin has the following attributes:
>
>     TARGET.Memory = 64000
>     TARGET.Arch = "X86_64"
>     TARGET.Disk = 90191948
>     TARGET.HasFileTransfer = true
>     TARGET.OpSys = "LINUX"
>
> The Requirements expression for job 38720.000 reduces to these conditions:
>
>          Slots
> Step    Matched  Condition
> -----  --------  ---------
> [0]           1  Memory >= 10000
> [1]           1  TARGET.Arch == "X86_64"
> [3]           1  TARGET.OpSys == "LINUX"
> [5]           1  TARGET.Disk >= RequestDisk
> [7]           1  TARGET.Memory >= RequestMemory
> [9]           1  TARGET.HasFileTransfer
>
> No successful match recorded.
> Last failed match: Fri Feb 23 14:38:52 2018
>
> Reason for last match failure: no match found
>
> 38720.000:  Run analysis summary ignoring user priority.  Of 1 machines,
>       0 are rejected by your job's requirements
>       0 reject your job because of their own requirements
>       0 match and are already running your jobs
>       0 match but are serving other users
>       0 are available to run your job
>
> Can anyone tell me why it's not running?
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/