[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ImageSize increase too big



Henning:

A couple suggestions.

(1) Use ResidentSetSize rather than ImageSize which over-counts memory usage. RSS = RAM usage identically. The manual actually suggests ProportionalSetSize which doesn't seem to exist anymore.
(2) Understand that both ResidentSetSize and ImageSize are rounded by the schedd. Their unrounded values are ResidentSetSize_raw and ImageSize_raw.

According to the manual, the default behavior is to round these values by 25% by "order of magnitude". This is a bit hard to understand so break out your logarithms.

The pattern is that if you're between 1/5 and 5 times a given power of 10, then it rounds to ceil(value/25% of that power of 10)*value. Does that make sense?

Here are some corresponding values:

32500 30024
35000 34492
37500 35832
40000 37748
75000 57688
100000 76448
225000 201740
250000 225504
275000 261584
300000 289272
325000 309824
350000 328628
425000 422872
750000 541520
1000000 900412
1250000 1003828
1500000 1362132
1750000 1516100
2000000 1873404
2250000 2044680
2500000 2250200
7500000 5369316
10000000 7597772
17500000 15783236

Tom
`
ïOn 5/7/18, 8:24 AM, "HTCondor-users on behalf of Henning Fehrmann" <htcondor-users-bounces@xxxxxxxxxxx on behalf of henning.fehrmann@xxxxxxxxxx> wrote:

    Hi,
    
    we observed an unexplainable jump in the imagesize of an job.
    
    
    -- Schedd: atlas2.atlas.local : <10.20.30.2:38705?... @ 05/05/18 12:52:07
     ID         OWNER            SUBMITTED     RUN_TIME ST PRI SIZE   CMD
    2728869.0   XXXXX         4/28 11:29   7+00:14:54 R  0   9766.0	  XXXXXX 
    
    But it never was using that much memory:
    
    000 (2728869.000.000) 04/28 11:29:25 Job submitted from host: <10.20.30.2:38705?addrs=10.20.30.2-38705+[--1]-38705>
    001 (2728869.000.000) 04/28 11:29:47 Job executing on host: <10.10.20.16:33435?addrs=10.10.20.16-33435+[--1]-33435>
    006 (2728869.000.000) 04/28 11:29:56 Image size of job updated: 48676
    006 (2728869.000.000) 04/28 11:34:56 Image size of job updated: 188768
    006 (2728869.000.000) 04/28 11:39:56 Image size of job updated: 237552
    006 (2728869.000.000) 04/28 11:44:57 Image size of job updated: 272380
    006 (2728869.000.000) 04/28 12:19:59 Image size of job updated: 7411552
    006 (2728869.000.000) 04/28 12:24:59 Image size of job updated: 7522440
    ...
    006 (2728869.000.000) 05/05 02:16:14 Image size of job updated: 7522984
    001 (2728869.000.000) 05/05 02:43:55 Job executing on host: <10.10.17.14:46639?addrs=10.10.17.14-46639+[--1]-46639>
    001 (2728869.000.000) 05/05 04:36:51 Job executing on host: <10.10.23.1:46285?addrs=10.10.23.1-46285+[--1]-46285>
    007 (2728869.000.000) 05/05 06:32:57 Shadow exception!
    001 (2728869.000.000) 05/05 07:00:50 Job executing on host: <10.10.9.13:41637?addrs=10.10.9.13-41637+[--1]-41637>
    
    The job still runs on 10.10.9.13 with in the expected memory usage.
    
    The imagesize however is
    condor_q 2728869 -l|grep "^Image"
    ImageSize = 10000000
    ImageSize_RAW = 7522980
    
    Which hasn't been manipulated by the user.
    
    Is this a known issue?
    
    We are running condor 8.6.
    Do you need more config or logs?
    
    
    Cheers,
    Henning
    _______________________________________________
    HTCondor-users mailing list
    To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
    subject: Unsubscribe
    You can also unsubscribe by visiting
    https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
    
    The archives can be found at:
    https://lists.cs.wisc.edu/archive/htcondor-users/