[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] jobs killed due to memory?



Hi Kris,

from your logs I suspect your job got killed by SIG-9, most likely by some mechanism on the host like the oom killer that is initiated by the kernel once there is no more memory and swap space.

Check the logs on the workernode to see what is actually going on and if indeed more memory would be available for the process.

The mechanisms inside condor to remove jobs due to high memory usage are either 

SYSTEM_PERIODIC_REMOVE
SYSTEM_PERIODIC_HOLD
(both on the scheduler)

or you can use cgroups on the workernode: 
CGROUP_MEMORY_LIMIT_POLICY = soft/hard

You should check your CGROUP_MEMORY_LIMIT_POLICY on the workernode and set it to soft as well.

Best
christoph


--
Christoph Beyer
DESY Hamburg
IT-Department

Notkestr. 85
Building 02b, Room 009
22607 Hamburg

phone:+49-(0)40-8998-2317
mail: christoph.beyer@xxxxxxx


Von: "Kristian Kvilekval" <kris@xxxxxxxx>
An: "htcondor-users" <htcondor-users@xxxxxxxxxxx>
Gesendet: Mittwoch, 30. September 2020 04:18:38
Betreff: [HTCondor-users] jobs killed due to memory?

Hello 
  
I am seeing jobs killed when they exceed their requested memory.
I believe I have shut off any preemption or eviction, but that does not seem to be the case.   Below is our condor_local and a typical submit file (we are running using DAGman), and a submit.log. Note that we request_memory is 24GB and we seem to be exiting approximately (and prematurely) at 24GB. I believe the process may be requesting more and these particular nodes have a lot more (unused) memory on them.

Is there a way to never kill a job due to memory? Or am I misreading the logs?

Any help appreciated,
Kris


===========================================
condor_local:

CONDOR_HOST = master
COLLECTOR_NAME = GRID
COLLECTOR_HOST = $(CONDOR_HOST):9886?sock=collector
DAEMON_LIST = {{getv "/condor/daemons"}}
# DAEMON_LIST = MASTER, SCHEDD, STARTD
# DAEMON_LIST = MASTER, SCHEDD
##  When something goes wrong with condor at your site, who should get
##  the email?

CONDOR_ADMIN          = admin@xxxxxxxx
#UID_DOMAIN            = viqi.org
#TRUST_UID_DOMAIN      = TRUE
#SOFT_UID_DOMAIN       = TRUE
#FILESYSTEM_DOMAIN     = viqi.org
##  Do you want to use NFS for file access instead of remote system calls
ALLOW_READ  = $(ALLOW_READ), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_NEGOTIATOR      = 172.*, 10.*, 128.111.*, {{getv "/condor/allowextra" ""}}

#ALLOW_READ  = $(ALLOW_READ), 172.*, 10.*, *.viqi.org
#ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, *.viqi.org
#ALLOW_NEGOTIATOR      = 172.*, 10.*, 128.111.*
#ALLOW_ADMINISTRATOR   = 172.*, 10.*,128.111.*
#ALLOW_CONFIG          = 172.*,10.*,128.111.*
#ALLOW_DAEMON          = 172.*,10.*,128.111.*


# Use CCB with shared port so outside units can talk to

USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9886
UPDATE_COLLECTOR_WITH_TCP = TRUE
CCB_ADDRESS = $(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = VIQI
BIND_ALL_INTERFACES = TRUE

SEC_DEFAULT_AUTHENTICATION = NEVER
SEC_DEFAULT_NEGOTIATION = NEVER
#https://lists.cs.wisc.edu/archive/htcondor-users/2016-December/msg00046.shtml
DISCARD_SESSION_KEYRING_ON_STARTUP = false


# Slots for multi-cpu machines
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true

START = True
PREEMPT = False
SUSPEND = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE= False
CONTINUE= True



===========================

universe = vanilla
executable=/run/bisque/data/SYSTEMS/data/staging/00-taCQiAyS5g6kWBSbtTGGCn/docker_run
error = ./launcher.err
output = ./launcher.out
log = ./launcher.log
# == False)&&(ExitCode == 0)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
notification = never
# variables from local environment here
match_list_length=3
request_memory=24000
request_cpus=1
requirements=(Arch == "x86_64") && (TARGET.Name =!= LastMatchName1) && (OpSys == "LINUX")
# store mex id for stopping
+MexID = "00-taCQiAyS5g6kWBSbtTGGCn"

initialdir = /run/bisque/data/SYSTEMS/data/staging/00-taCQiAyS5g6kWBSbtTGGCn
#transfer_input_files  =
#transfer_output_files = output_files/
transfer_output_files = .
arguments  = python fibronest.py https://data.viqi.org/module_service/mex/00-taCQiAyS5g6kWBSbtTGGCn admin:00-taCQiAyS5g6kWBSbtTGGCn:7504b1ad64de8cdc54c8fd4564191401ac36dd65
queue

=====================================

cat launcher.log
000 (305.000.000) 09/29 19:57:46 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
    DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (305.000.000) 09/29 19:58:02 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (305.000.000) 09/29 19:58:11 Image size of job updated: 20472
        20  -  MemoryUsage of job (MB)
        20472  -  ResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:03:12 Image size of job updated: 20824
        21  -  MemoryUsage of job (MB)
        20824  -  ResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:08:12 Image size of job updated: 22548
        23  -  MemoryUsage of job (MB)
        22548  -  ResidentSetSize of job (KB)
...
005 (305.000.000) 09/29 20:31:19 Job terminated.
        (1) Normal termination (return value 137)
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1487908352  -  Run Bytes Sent By Job
        932  -  Run Bytes Received By Job
        1487908352  -  Total Bytes Sent By Job
        932  -  Total Bytes Received By Job
        Partitionable Resources :      Usage  Request Allocated
           Cpus                 :       0.00        1         1
           Disk (KB)            : 1453026           1     61449
           Memory (MB)          :      23       24000     24064
...
000 (306.000.000) 09/29 20:31:32 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
    DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (306.000.000) 09/29 20:31:32 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (306.000.000) 09/29 20:31:41 Image size of job updated: 20140
        20  -  MemoryUsage of job (MB)
        20140  -  ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:36:42 Image size of job updated: 20304
        20  -  MemoryUsage of job (MB)
        20304  -  ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:51:43 Image size of job updated: 22100
        22  -  MemoryUsage of job (MB)
        22100  -  ResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 21:03:48 Image size of job updated: 314076
        22  -  MemoryUsage of job (MB)
        22100  -  ResidentSetSize of job (KB)
...
005 (306.000.000) 09/29 21:04:05 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1490578432  -  Run Bytes Sent By Job
        932  -  Run Bytes Received By Job
        1490578432  -  Total Bytes Sent By Job
        932  -  Total Bytes Received By Job
        Partitionable Resources :      Usage  Request Allocated
           Cpus                 :       0.16        1         1
           Disk (KB)            : 1455635           1     61449
           Memory (MB)          :      22       24000     24064
...
000 (307.000.000) 09/29 21:04:18 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
    DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (307.000.000) 09/29 21:04:19 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (307.000.000) 09/29 21:04:28 Image size of job updated: 20072
        20  -  MemoryUsage of job (MB)
        20072  -  ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:09:29 Image size of job updated: 20224
        20  -  MemoryUsage of job (MB)
        20224  -  ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:14:30 Image size of job updated: 20340
        20  -  MemoryUsage of job (MB)
        20340  -  ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:24:30 Image size of job updated: 22084
        22  -  MemoryUsage of job (MB)
        22084  -  ResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:34:32 Image size of job updated: 22168
        22  -  MemoryUsage of job (MB)
        22084  -  ResidentSetSize of job (KB)
...
005 (307.000.000) 09/29 21:37:06 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1490579456  -  Run Bytes Sent By Job
        932  -  Run Bytes Received By Job
        1490579456  -  Total Bytes Sent By Job
        932  -  Total Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :        0        1         1
           Disk (KB)            :  1455636        1     61449
           Memory (MB)          :       22    24000     24064
...
000 (308.000.000) 09/29 21:37:19 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
    DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (308.000.000) 09/29 21:37:40 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (308.000.000) 09/29 21:37:48 Image size of job updated: 22048
        22  -  MemoryUsage of job (MB)
        22048  -  ResidentSetSize of job (KB)
...
006 (308.000.000) 09/29 21:42:48 Image size of job updated: 22248
        22  -  MemoryUsage of job (MB)
        22248  -  ResidentSetSize of job (KB)
...
005 (308.000.000) 09/29 22:10:33 Job terminated.
        (1) Normal termination (return value 0)
                Usr 0 00:00:00, Sys 0 00:00:02  -  Run Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
                Usr 0 00:00:00, Sys 0 00:00:02  -  Total Remote Usage
                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
        1490580608  -  Run Bytes Sent By Job
        932  -  Run Bytes Received By Job
        1490580608  -  Total Bytes Sent By Job
        932  -  Total Bytes Received By Job
        Partitionable Resources :    Usage  Request Allocated
           Cpus                 :        0        1         1
           Disk (KB)            :  1455637        1     61449
           Memory (MB)          :       22    24000     24064
...


--
Kris Kvilekval, Ph.D.
President ViQi Inc
(805)-699-6081

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/