[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] jobs killed due to memory?



HelloÂ
ÂÂ
I am seeing jobs killed when they exceed their requested memory.
I believe I have shut off any preemption or eviction, but that does not seem to be the case. ÂBelow is our condor_local and a typical submit file (we are running using DAGman), and a submit.log. Note that we request_memory is 24GB and we seem to be exiting approximately (and prematurely) at 24GB. I believe the process may be requesting more and these particular nodes have a lot more (unused) memory on them.

Is there a way to never kill a job due to memory? Or am I misreading the logs?

Any help appreciated,
Kris


===========================================
condor_local:

CONDOR_HOST = master
COLLECTOR_NAME = GRID
COLLECTOR_HOST = $(CONDOR_HOST):9886?sock=collector
DAEMON_LIST = {{getv "/condor/daemons"}}
# DAEMON_LIST = MASTER, SCHEDD, STARTD
# DAEMON_LIST = MASTER, SCHEDD
## ÂWhen something goes wrong with condor at your site, who should get
## Âthe email?

CONDOR_ADMIN Â Â Â Â Â= admin@xxxxxxxx
#UID_DOMAIN Â Â Â Â Â Â= viqi.org
#TRUST_UID_DOMAIN Â Â Â= TRUE
#SOFT_UID_DOMAIN Â Â Â = TRUE
#FILESYSTEM_DOMAIN Â Â = viqi.org
## ÂDo you want to use NFS for file access instead of remote system calls
ALLOW_READ Â= $(ALLOW_READ), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, {{getv "/condor/allowextra" ""}}
ALLOW_NEGOTIATOR Â Â Â= 172.*, 10.*, 128.111.*, {{getv "/condor/allowextra" ""}}

#ALLOW_READ Â= $(ALLOW_READ), 172.*, 10.*, *.viqi.org
#ALLOW_WRITE = $(ALLOW_WRITE), 172.*, 10.*, *.viqi.org
#ALLOW_NEGOTIATOR Â Â Â= 172.*, 10.*, 128.111.*
#ALLOW_ADMINISTRATOR Â = 172.*, 10.*,128.111.*
#ALLOW_CONFIG Â Â Â Â Â= 172.*,10.*,128.111.*
#ALLOW_DAEMON Â Â Â Â Â= 172.*,10.*,128.111.*


# Use CCB with shared port so outside units can talk to

USE_SHARED_PORT = TRUE
SHARED_PORT_ARGS = -p 9886
UPDATE_COLLECTOR_WITH_TCP = TRUE
CCB_ADDRESS = $(COLLECTOR_HOST)
PRIVATE_NETWORK_NAME = VIQI
BIND_ALL_INTERFACES = TRUE

SEC_DEFAULT_AUTHENTICATION = NEVER
SEC_DEFAULT_NEGOTIATION = NEVER
#https://lists.cs.wisc.edu/archive/htcondor-users/2016-December/msg00046.shtml
DISCARD_SESSION_KEYRING_ON_STARTUP = false


# Slots for multi-cpu machines
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = true

START = True
PREEMPT = False
SUSPEND = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE= False
CONTINUE= True



===========================

universe = vanilla
executable=/run/bisque/data/SYSTEMS/data/staging/00-taCQiAyS5g6kWBSbtTGGCn/docker_run
error = ./launcher.err
output = ./launcher.out
log = ./launcher.log
# == False)&&(ExitCode == 0)
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
notification = never
# variables from local environment here
match_list_length=3
request_memory=24000
request_cpus=1
requirements=(Arch == "x86_64") && (TARGET.Name =!= LastMatchName1) && (OpSys == "LINUX")
# store mex id for stopping
+MexID = "00-taCQiAyS5g6kWBSbtTGGCn"

initialdir = /run/bisque/data/SYSTEMS/data/staging/00-taCQiAyS5g6kWBSbtTGGCn
#transfer_input_files Â=
#transfer_output_files = output_files/
transfer_output_files = .
arguments Â= python fibronest.py https://data.viqi.org/module_service/mex/00-taCQiAyS5g6kWBSbtTGGCn admin:00-taCQiAyS5g6kWBSbtTGGCn:7504b1ad64de8cdc54c8fd4564191401ac36dd65
queue

=====================================

cat launcher.log
000 (305.000.000) 09/29 19:57:46 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
  DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (305.000.000) 09/29 19:58:02 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (305.000.000) 09/29 19:58:11 Image size of job updated: 20472
    20 Â- ÂMemoryUsage of job (MB)
    20472 Â- ÂResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:03:12 Image size of job updated: 20824
    21 Â- ÂMemoryUsage of job (MB)
    20824 Â- ÂResidentSetSize of job (KB)
...
006 (305.000.000) 09/29 20:08:12 Image size of job updated: 22548
    23 Â- ÂMemoryUsage of job (MB)
    22548 Â- ÂResidentSetSize of job (KB)
...
005 (305.000.000) 09/29 20:31:19 Job terminated.
    (1) Normal termination (return value 137)
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    1487908352 Â- ÂRun Bytes Sent By Job
    932 Â- ÂRun Bytes Received By Job
    1487908352 Â- ÂTotal Bytes Sent By Job
    932 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :   ÂUsage ÂRequest Allocated
     ÂCpus         :    0.00    Â1     1
     ÂDisk (KB)      Â: 1453026      1   61449
     ÂMemory (MB)     Â:   Â23    24000   24064
...
000 (306.000.000) 09/29 20:31:32 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
  DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (306.000.000) 09/29 20:31:32 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (306.000.000) 09/29 20:31:41 Image size of job updated: 20140
    20 Â- ÂMemoryUsage of job (MB)
    20140 Â- ÂResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:36:42 Image size of job updated: 20304
    20 Â- ÂMemoryUsage of job (MB)
    20304 Â- ÂResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 20:51:43 Image size of job updated: 22100
    22 Â- ÂMemoryUsage of job (MB)
    22100 Â- ÂResidentSetSize of job (KB)
...
006 (306.000.000) 09/29 21:03:48 Image size of job updated: 314076
    22 Â- ÂMemoryUsage of job (MB)
    22100 Â- ÂResidentSetSize of job (KB)
...
005 (306.000.000) 09/29 21:04:05 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    1490578432 Â- ÂRun Bytes Sent By Job
    932 Â- ÂRun Bytes Received By Job
    1490578432 Â- ÂTotal Bytes Sent By Job
    932 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :   ÂUsage ÂRequest Allocated
     ÂCpus         :    0.16    Â1     1
     ÂDisk (KB)      Â: 1455635      1   61449
     ÂMemory (MB)     Â:   Â22    24000   24064
...
000 (307.000.000) 09/29 21:04:18 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
  DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (307.000.000) 09/29 21:04:19 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (307.000.000) 09/29 21:04:28 Image size of job updated: 20072
    20 Â- ÂMemoryUsage of job (MB)
    20072 Â- ÂResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:09:29 Image size of job updated: 20224
    20 Â- ÂMemoryUsage of job (MB)
    20224 Â- ÂResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:14:30 Image size of job updated: 20340
    20 Â- ÂMemoryUsage of job (MB)
    20340 Â- ÂResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:24:30 Image size of job updated: 22084
    22 Â- ÂMemoryUsage of job (MB)
    22084 Â- ÂResidentSetSize of job (KB)
...
006 (307.000.000) 09/29 21:34:32 Image size of job updated: 22168
    22 Â- ÂMemoryUsage of job (MB)
    22084 Â- ÂResidentSetSize of job (KB)
...
005 (307.000.000) 09/29 21:37:06 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    1490579456 Â- ÂRun Bytes Sent By Job
    932 Â- ÂRun Bytes Received By Job
    1490579456 Â- ÂTotal Bytes Sent By Job
    932 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :  ÂUsage ÂRequest Allocated
     ÂCpus         :    Â0    Â1     1
     ÂDisk (KB)      Â: Â1455636    Â1   61449
     ÂMemory (MB)     Â:    22  Â24000   24064
...
000 (308.000.000) 09/29 21:37:19 Job submitted from host: <10.42.149.147:9886?CCBID=10.42.91.93:9886%3faddrs%3d10.42.91.93-9886%26noUDP%26sock%3dcollector#40&PrivNet=VIQI&addrs=10.42.149.147-9886&noUDP&sock=204_d1d0_3>
  DAG Node: 00-taCQiAyS5g6kWBSbtTGGCn
...
001 (308.000.000) 09/29 21:37:40 Job executing on host: <10.42.154.157:9886?PrivNet=VIQI&addrs=10.42.154.157-9886&noUDP&sock=82_2342_3>
...
006 (308.000.000) 09/29 21:37:48 Image size of job updated: 22048
    22 Â- ÂMemoryUsage of job (MB)
    22048 Â- ÂResidentSetSize of job (KB)
...
006 (308.000.000) 09/29 21:42:48 Image size of job updated: 22248
    22 Â- ÂMemoryUsage of job (MB)
    22248 Â- ÂResidentSetSize of job (KB)
...
005 (308.000.000) 09/29 22:10:33 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:02 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    1490580608 Â- ÂRun Bytes Sent By Job
    932 Â- ÂRun Bytes Received By Job
    1490580608 Â- ÂTotal Bytes Sent By Job
    932 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :  ÂUsage ÂRequest Allocated
     ÂCpus         :    Â0    Â1     1
     ÂDisk (KB)      Â: Â1455637    Â1   61449
     ÂMemory (MB)     Â:    22  Â24000   24064
...


--
Kris Kvilekval, Ph.D.
PresidentÂViQi Inc
(805)-699-6081