[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Docker universe ImageSize/MemoryUsage



Hello,

 

Weâre using Condorâs Docker universe (HTCondor 8.6.1 , Docker 1.27) on amazon ec2 instances wherein jobs are being terminated intermittently (the same job executes successfully sometimes) when the reported MemoryUsage exceeds un-reasonably higher than the specified RequestMemory. This was NOT the case earlier when we ran the same jobs on HTCondorâs Standard Universe with own wrapper to execute Docker run. Any suggestion/help would be appreciated:

 

Pasting the logs and the condor submit file requirements below:

 

000 (112667.000.000) 08/04 20:49:24 Job submitted from host: <10.XXX.X.XXX:9618?addrs=10.XXX.X.XXX-9618+[--1]-9618&noUDP&sock=24501_8a68_3>

    DAG Node: block_0000

...

001 (112667.000.000) 08/04 21:54:23 Job executing on host: <10.XXX.X.XXX:34479?addrs=10.XXX.X.XXX-34479+[--1]-34479>

...

006 (112667.000.000) 08/04 21:54:24 Image size of job updated: 133664575

    133664575  -  MemoryUsage of job (MB)

...

005 (112667.000.000) 08/04 21:54:25 Job terminated.

    (0) Abnormal termination (signal 1)

    (0) No core file

        Usr 0 00:00:00, Sys 0 00:00:01  -  Run Remote Usage

        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

        Usr 0 00:00:00, Sys 0 00:00:01  -  Total Remote Usage

        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage

    182  -  Run Bytes Sent By Job

    9689  -  Run Bytes Received By Job

    182  -  Total Bytes Sent By Job

    9689  -  Total Bytes Received By Job

    Partitionable Resources :     Usage  Request Allocated

       Cpus                 :                  1         1

       Disk (KB)            :        23       10   1047702

       Memory (MB)          : 133664575     1844      1844

 

UNIVERSE = docker

â

â

â

LOG = job.log

JOB_MACHINE_ATTRS = Machine

JOB_MACHINE_ATTRS_HISTORY_LENGTH = 5

JobLeaseDuration = 600

REQUIREMENTS = HAS_DOCKER && HAS_RCP_DFS && (WORKER_TYPE == "SMALL") && target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2

RequestMemory = 1.8G

RequestCpus = 1

PRIORITY = 1201

PERIODIC_REMOVE = ((JobStatus==5) && (CurrentTime - EnteredCurrentStatus) > 300) || \                                      

                  ((JobStatus==2) && (CurrentTime - EnteredCurrentStatus) > 3600)

QUEUE