[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] cgroups applied at job level or process level



Hi Vikrant,

how many cores/HT do you have on your node? AFAIK a cpu share cgroup can consume more cycles if no other processes compete for it.

Instead of the cpu share, one could use a cpuset to bind a cgroup to a dedicated CPU id - but I don't think that it would make much sense in a production environment (but might be useful for benchmarking or if one would like to avoid losses by processes being switched between cores - however, I have no idea, if the CPU actually listens to the kernel ;) )

Cheers,
  Thomas

On 2019-10-21 14:01, Vikrant Aggarwal wrote:
Hello HTCondor Experts,

I have a query regarding the cgroup implementation in condor. AFAIK, cgroup works at a job level so if I am giving request_cpus as 1 then irrespectiveÂhow many processes job will spawn all the processes combinedly can only take 100% of cpu time share and if I request_cpu value is 2 then all processesÂcombinedly belonging to single job can take 200% of cpu time share.

However this theoretical understanding seems to be getting wrong with the test which we conducted on one of the node with 23 cores available to run job.

1) Started batch of 22 jobs with below python code to hog CPU core and each job was supposed to be completed in 300s.ÂÂloop_per_sec value was used after doing tests to ensure that job will run for 300s.

def hog_for_seconds(n):
 Â loop_per_sec = 16304387
 Â count = 0
 Â while count < (loop_per_sec * n):
 Â Â count += 1
 Â return count

start = time.time()
hog_for_seconds(300)
end = time.time()

2) At same time submitted one more job which is internally spawning 10 processes and it was expected to take at-least twice more than the time for completion.

def hog_for_seconds(n):
 Â loop_per_sec = 16304387
 Â count = 0
 Â while count < (loop_per_sec * n):
 Â Â count += 1
 Â return count

start = time.time()
with ProcessPoolExecutor(max_workers=10) as ex:
 Â ex.submit(hog_for_seconds, 300)
end = time.time()

But to our surprise both jobs were completed in approximately same time.

I am aware of cgroup oppportunistic behavior hence we used ASSIGN_CPU_AFFINITY setting but the results remain same.

Can anyone please help to understand how cgroup are restricting the job cpu share?

Condor version : 8.5.8 (Dynamic slots)

Thanks & Regards,
Vikrant Aggarwal


_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/


Attachment: smime.p7s
Description: S/MIME Cryptographic Signature