Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [HTCondor-users] dynamic slots configuration

Date: Wed, 02 Dec 2020 11:02:22 +0200
From: Mihai Ciubancan <ciubancan@xxxxxxxx>
Subject: Re: [HTCondor-users] dynamic slots configuration
Hello,

I have increased the verbosity level for the logs and I see that I get the
following error:

12/02/20 10:31:29 (pid:2754) Got SIGTERM. Performing graceful shutdown.
12/02/20 10:31:29 (pid:2754) Started timer to call main_shutdown_fast in
1800 seconds
12/02/20 10:31:29 (pid:2754) ShutdownGraceful all jobs.
12/02/20 10:31:29 (pid:2754) in VanillaProc::ShutdownGraceful()
12/02/20 10:31:29 (pid:2754) Send_Signal(): Doing kill(2878,15) [SIGTERM]
12/02/20 10:31:29 (pid:2754) Process exited, pid=2878, signal=15

But still, it doesn't say the reason.
Do you have any idea why the jobs are being killed?

Below is the entire log for a job:

12/02/20 10:21:51 (pid:2754) Using config source: /etc/condor/condor_config
12/02/20 10:21:51 (pid:2754) Using local config sources:
12/02/20 10:21:51 (pid:2754)    /etc/condor/config.d/00-general
12/02/20 10:21:51 (pid:2754)    /etc/condor/config.d/01-security
12/02/20 10:21:51 (pid:2754)    /etc/condor/config.d/50-dynamic
12/02/20 10:21:51 (pid:2754)    /etc/condor/condor_config.local
12/02/20 10:21:51 (pid:2754) config Macros = 116, Sorted = 115,
StringBytes = 3604, TablesBytes = 4240
12/02/20 10:21:51 (pid:2754) CLASSAD_CACHING is OFF
12/02/20 10:21:51 (pid:2754) Daemon Log is logging: D_FULLDEBUG D_ALWAYS
D_ERROR
12/02/20 10:21:51 (pid:2754) SharedPortEndpoint: waiting for connections
to named socket 21735_060d_39
12/02/20 10:21:51 (pid:2754) DaemonCore: command socket at
<192.168.181.110:9618?addrs=192.168.181.110-9618&noUDP&sock=21735_060d_39>
12/02/20 10:21:51 (pid:2754) DaemonCore: private command socket at
<192.168.181.110:9618?addrs=192.168.181.110-9618&noUDP&sock=21735_060d_39>
12/02/20 10:21:51 (pid:2754) Setting maximum accepts per cycle 8.
12/02/20 10:21:51 (pid:2754) Will use TCP to update collector
condor1atlas.nipne.ro <192.168.181.11:9618>
12/02/20 10:21:51 (pid:2754) Entering JICShadow::receiveMachineAd
12/02/20 10:21:51 (pid:2754) Communicating with shadow
<81.180.86.113:9618?addrs=81.180.86.113-9618&noUDP&sock=2035_976c_124730>
12/02/20 10:21:51 (pid:2754) Shadow version: $CondorVersion: 8.8.10 Aug 05
2020 BuildID: 513586 PackageID: 8.8.10-1 $
12/02/20 10:21:51 (pid:2754) Submitting machine is "arc6atlas1i.nipne.ro"
12/02/20 10:21:51 (pid:2754) SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION:
failed to get security session info from shadow.  Skipping creation of
match security session.
12/02/20 10:21:51 (pid:2754) Instantiating a StarterHookMgr
12/02/20 10:21:51 (pid:2754) Job does not define HookKeyword, not invoking
any job hooks.
12/02/20 10:21:51 (pid:2754) setting the orig job name in starter
12/02/20 10:21:51 (pid:2754) setting the orig job iwd in starter
12/02/20 10:21:51 (pid:2754) Submit FsDomain: "arc6atlas1.nipne.ro"
12/02/20 10:21:51 (pid:2754)  Local FsDomain: "wn100.nipne.ro"
12/02/20 10:21:51 (pid:2754) ShouldTransferFiles is "IF_NEEDED" but job's
FileSystemDomain does NOT match local value, transfering files
12/02/20 10:21:51 (pid:2754) Submit UidDomain: "nipne.ro"
12/02/20 10:21:51 (pid:2754)  Local UidDomain: "nipne.ro"
12/02/20 10:21:51 (pid:2754) Initialized user_priv as "pillhcb01"
12/02/20 10:21:51 (pid:2754) Copied machine ad's MachineResources to
ProvisionedResources
12/02/20 10:21:51 (pid:2754) Copied machine ad's Cpus to job ad's
CpusProvisioned
12/02/20 10:21:51 (pid:2754) Copied machine ad's AssignedCpus to job ad
12/02/20 10:21:51 (pid:2754) Copied machine ad's Memory to job ad's
MemoryProvisioned
12/02/20 10:21:51 (pid:2754) Copied machine ad's AssignedMemory to job ad
12/02/20 10:21:51 (pid:2754) Copied machine ad's Disk to job ad's
DiskProvisioned
12/02/20 10:21:51 (pid:2754) Copied machine ad's AssignedDisk to job ad
12/02/20 10:21:51 (pid:2754) Copied machine ad's Swap to job ad's
SwapProvisioned
12/02/20 10:21:51 (pid:2754) Copied machine ad's AssignedSwap to job ad
12/02/20 10:21:51 (pid:2754) Updating Provisioned and Assigned attributes:
12/02/20 10:21:51 (pid:2754) Entering JICShadow::updateShadow()
12/02/20 10:21:51 (pid:2754) Sent job ClassAd update to startd.
12/02/20 10:21:51 (pid:2754) Leaving JICShadow::updateShadow(): success
12/02/20 10:21:51 (pid:2754) Done moving to directory
"/var/lib/condor/execute/dir_2754"
12/02/20 10:21:51 (pid:2754) JICShadow::initIOProxy(): Job does not define
WantIOProxy; setting to false
12/02/20 10:21:51 (pid:2754) JICShadow::initIOProxy(): Job does not define
WantRemoteUpdates; setting to false.
12/02/20 10:21:51 (pid:2754) JICShadow::initIOProxy(): Job does not define
WantDelayedUpdates; enabling delayed updates.
12/02/20 10:21:51 (pid:2754) Chirp config summary: IO false, Updates
false, Delayed updates true.
12/02/20 10:21:51 (pid:2754) Initializing IO proxy with config file at
/var/lib/condor/execute/dir_2754/.chirp.config.
12/02/20 10:21:51 (pid:2754) Initialized IO Proxy.
12/02/20 10:21:51 (pid:2754) LocalUserLog::initFromJobAd: path_attr =
StarterUserLog
12/02/20 10:21:51 (pid:2754) LocalUserLog::initFromJobAd: xml_attr =
StarterUserLogUseXML
12/02/20 10:21:51 (pid:2754) No StarterUserLog found in job ClassAd
12/02/20 10:21:51 (pid:2754) Starter will not write a local UserLog
12/02/20 10:21:51 (pid:2754) JICShadow::initUserCredentials(): Job does
not define SendCredential; setting to false
12/02/20 10:21:51 (pid:2754) Done setting resource limits
12/02/20 10:21:51 (pid:2754) Changing the executable name
12/02/20 10:21:51 (pid:2754) entering FileTransfer::Init
12/02/20 10:21:51 (pid:2754) entering FileTransfer::SimpleInit
12/02/20 10:21:51 (pid:2754) Entering FileTransfer::AddInputFilenameRemaps
12/02/20 10:21:51 (pid:2754) FILETRANSFER: protocol "http" handled by
"/usr/libexec/condor/curl_plugin"
12/02/20 10:21:51 (pid:2754) FILETRANSFER: protocol "https" handled by
"/usr/libexec/condor/curl_plugin"
12/02/20 10:21:51 (pid:2754) FILETRANSFER: protocol "ftp" handled by
"/usr/libexec/condor/curl_plugin"
12/02/20 10:21:51 (pid:2754) FILETRANSFER: protocol "file" handled by
"/usr/libexec/condor/curl_plugin"
12/02/20 10:21:51 (pid:2754) FILETRANSFER: protocol "data" handled by
"/usr/libexec/condor/data_plugin"
12/02/20 10:21:51 (pid:2754) TransferIntermediate="(none)"
12/02/20 10:21:51 (pid:2754) entering FileTransfer::DownloadFiles
12/02/20 10:21:51 (pid:2754) SharedPortClient: sent connection request to
daemon at <81.180.86.113:9618> for shared port id 2035_976c_124730
12/02/20 10:21:51 (pid:2754) entering FileTransfer::Download
12/02/20 10:21:51 (pid:2754) FileTransfer: created download transfer
process with id 2776
12/02/20 10:21:51 (pid:2754) DaemonKeepAlive: in SendAliveToParent()
12/02/20 10:21:51 (pid:2754) SharedPortClient: sent connection request to
daemon at <192.168.181.110:9618> for shared port id 21700_f93a_3
12/02/20 10:21:51 (pid:2754) Completed DC_CHILDALIVE to daemon at
<192.168.181.110:9618>
12/02/20 10:21:51 (pid:2754) DaemonKeepAlive: Leaving SendAliveToParent()
- success
12/02/20 10:21:51 (pid:2754) File transfer completed successfully.
12/02/20 10:21:52 (pid:2754) Calling client FileTransfer handler function.
12/02/20 10:21:52 (pid:2754) HOOK_PREPARE_JOB not configured.
12/02/20 10:21:52 (pid:2754) Job 3585166.0 set to execute immediately
12/02/20 10:21:52 (pid:2754) Starting a VANILLA universe job with ID:
3585166.0
12/02/20 10:21:52 (pid:2754) In OsProc::OsProc()
12/02/20 10:21:52 (pid:2754) Main job KillSignal: 15 (SIGTERM)
12/02/20 10:21:52 (pid:2754) Main job RmKillSignal: 15 (SIGTERM)
12/02/20 10:21:52 (pid:2754) Main job HoldKillSignal: 15 (SIGTERM)
12/02/20 10:21:52 (pid:2754) in VanillaProc::StartJob()
12/02/20 10:21:52 (pid:2754) Requesting cgroup
/system.slice/condor.service/condor_var_lib_condor_execute_slot1_1@xxxxxxxxxxxxxx
for job.
12/02/20 10:21:52 (pid:2754) Value of RequestedChroot is unset.
12/02/20 10:21:52 (pid:2754) Adding mapping:
/var/lib/condor/execute/dir_2754/tmp/ -> /tmp.
12/02/20 10:21:52 (pid:2754) Checking the mapping of mount point /tmp.
12/02/20 10:21:52 (pid:2754) Current mount, /, is shared.
12/02/20 10:21:52 (pid:2754) Adding mapping:
/var/lib/condor/execute/dir_2754/var/tmp/ -> /var/tmp.
12/02/20 10:21:52 (pid:2754) Checking the mapping of mount point /var/tmp.
12/02/20 10:21:52 (pid:2754) Current mount, /, is shared.
12/02/20 10:21:52 (pid:2754) PID namespace option: false
12/02/20 10:21:52 (pid:2754) in OsProc::StartJob()
12/02/20 10:21:52 (pid:2754) IWD: /var/lib/condor/execute/dir_2754
12/02/20 10:21:52 (pid:2754) Input file: /dev/null
12/02/20 10:21:52 (pid:2754) Output file:
/var/lib/condor/execute/dir_2754/_condor_stdout
12/02/20 10:21:52 (pid:2754) Error file:
/var/lib/condor/execute/dir_2754/_condor_stdout
12/02/20 10:21:52 (pid:2754) Renice expr "10" evaluated to 10
12/02/20 10:21:52 (pid:2754) Env = TEMP=/var/lib/condor/execute/dir_2754
OMP_NUM_THREADS=1 BATCH_SYSTEM=HTCondor
TMPDIR=/var/lib/condor/execute/dir_2754 CONDOR_JOB_PIDS=
CHIRP_DELAYED_UPDATE_PREFIX=Chirp*
CONDOR_JOB_IWD=/var/lib/condor/execute/dir_2754
TMP=/var/lib/condor/execute/dir_2754 CONDOR_BIN=/usr/bin
CONDOR_JOB_AD=/var/lib/condor/execute/dir_2754/.job.ad
CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_2754 CONDOR_SLOT=slot1_1
CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_2754/.chirp.config
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_2754/.machine.ad
X509_USER_PROXY=/var/lib/condor/execute/dir_2754/user.proxy
12/02/20 10:21:52 (pid:2754) ENFORCE_CPU_AFFINITY not true, not setting
affinity
12/02/20 10:21:52 (pid:2754) Running job as user pillhcb01
12/02/20 10:21:52 (pid:2754) About to exec
/var/lib/condor/execute/dir_2754/condor_exec.exe
12/02/20 10:21:52 (pid:2754) Create_Process succeeded, pid=2878
12/02/20 10:21:52 (pid:2754) Initializing cgroup library.
12/02/20 10:21:52 (pid:2754) Not enforcing memory limit.
12/02/20 10:21:52 (pid:2754) Subscribed the starter to OOM notification
for this cgroup; jobs triggering an OOM will be put on hold.
12/02/20 10:22:01 (pid:2754) Entering JICShadow::updateShadow()
12/02/20 10:22:01 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:22:01 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:22:01 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:22:01 (pid:2754) Sent job ClassAd update to startd.
12/02/20 10:22:01 (pid:2754) Leaving JICShadow::updateShadow(): success
12/02/20 10:22:01 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:22:01 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:22:01 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Entering JICShadow::updateShadow()
12/02/20 10:27:02 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Sent job ClassAd update to startd.
12/02/20 10:27:02 (pid:2754) Leaving JICShadow::updateShadow(): success
12/02/20 10:27:02 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:27:02 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:31:29 (pid:2754) Got SIGTERM. Performing graceful shutdown.
12/02/20 10:31:29 (pid:2754) Started timer to call main_shutdown_fast in
1800 seconds
12/02/20 10:31:29 (pid:2754) ShutdownGraceful all jobs.
12/02/20 10:31:29 (pid:2754) in VanillaProc::ShutdownGraceful()
12/02/20 10:31:29 (pid:2754) Send_Signal(): Doing kill(2878,15) [SIGTERM]
12/02/20 10:31:29 (pid:2754) Process exited, pid=2878, signal=15
12/02/20 10:31:29 (pid:2754) Inside VanillaProc::JobReaper()
12/02/20 10:31:29 (pid:2754) Inside OsProc::JobReaper()
12/02/20 10:31:29 (pid:2754) Inside UserProc::JobReaper()
12/02/20 10:31:30 (pid:2754) Reaper: all=1 handled=1 ShuttingDown=1
12/02/20 10:31:30 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) HOOK_JOB_EXIT not configured.
12/02/20 10:31:30 (pid:2754) In VanillaProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) Inside OsProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) Inside UserProc::PublishUpdateAd()
12/02/20 10:31:30 (pid:2754) Entering JICShadow::updateShadow()
12/02/20 10:31:30 (pid:2754) Sent job ClassAd update to startd.
12/02/20 10:31:30 (pid:2754) Leaving JICShadow::updateShadow(): success
12/02/20 10:31:30 (pid:2754) Inside JICShadow::transferOutput(void)
12/02/20 10:31:30 (pid:2754) JICShadow::transferOutput(void): Transferring...
12/02/20 10:31:30 (pid:2754) Inside JICShadow::transferOutputMopUp(void)
12/02/20 10:31:30 (pid:2754) Inside OsProc::JobExit()
12/02/20 10:31:30 (pid:2754) Notifying exit status=15 reason=107
12/02/20 10:31:30 (pid:2754) Sent job ClassAd update to startd.
12/02/20 10:31:30 (pid:2754) All jobs have exited... starter exiting
12/02/20 10:31:30 (pid:2754) Removing /var/lib/condor/execute/dir_2754
12/02/20 10:31:30 (pid:2754) Attempting to remove
/var/lib/condor/execute/dir_2754 as SuperUser (root)
12/02/20 10:31:30 (pid:2754) **** condor_starter (condor_STARTER) pid 2754
EXITING WITH STATUS 0
12/02/20 10:31:30 (pid:2754) Deleting the StarterHookMgr


Thanks,
Mihai


>
> Hello,
>
> I'm running ATLAS(LHC) single/multi-core jobs, and I have observed that
> nodes that have 20 cores are behaving as a black-hole when they are
> running multicore and single-core jobs. I have configured dynamic slots &
> defragmentation, and I'm wondering if is not something wrong in my
> configuration. I have splitted in 6 slots, 2 slots with 8 cores and 4
> slots with 1 core, in order to use all 20cores of the machine:
>
> CLAIM_WORKLIFE=3600
> CONTINUE=TRUE
> JOB_RENICE_INCREMENT=10
> KILL=FALSE
> NUM_SLOTS=6
> NUM_SLOTS_TYPE_1=2
> SLOT_TYPE_1_PARTITIONABLE=TRUE
> SLOT_TYPE_1=cpus=8
> NUM_SLOTS_TYPE_2=4
> SLOT_TYPE_2_PARTITIONABLE=TRUE
> SLOT_TYPE_2=cpus=1
> PREEMPT=FALSE
> RANK=0
> SUSPEND=FALSE
> SLOT_TYPE_1_CONSUMPTION_POLICY=False
> SLOT_TYPE_2_CONSUMPTION_POLICY=False
> CONSUMPTION_POLICY=False
> CLAIM_PARTITIONABLE_LEFTOVERS=False
>
> Alos below you can see my defragmentation configuration file:
>
> SETTABLE_ATTRS_CONFIG=DEFRAG_MAX_WHOLE_MACHINES
> ,DEFRAG_MAX_CONCURRENT_DRAINING ,DEFRAG_DRAINING_MACHINES_PER_HOUR
> ENABLE_RUNTIME_CONFIG=TRUE
>
> DEFRAG_MAX_WHOLE_MACHINES = 1
> DEFRAG_MAX_CONCURRENT_DRAINING = 1
> DEFRAG_DRAINING_MACHINES_PER_HOUR = 20
> DEFRAG_WHOLE_MACHINE_EXPR = (Cpus >= 8 && PartitionableSlot)
>
> and the defragmentation script looks like this:
>
> #!/bin/bash
>
>
> function setDefrag () {
>
>    defrag_address=$(condor_status -any -autoformat MyAddress -constraint
> 'MyType =?= "Defrag"')
>
>    echo "Setting DEFRAG_MAX_CONCURRENT_DRAINING=$3,
> DEFRAG_DRAINING_MACHINES_PER_HOUR=$4, DEFRAG_MAX_WHOLE_MACHINES=$5
> (idle multicore=$1, running multicore=$2)"
>
>    /usr/bin/condor_config_val -address "$defrag_address" -rset
> "DEFRAG_MAX_CONCURRENT_DRAINING = $3" >& /dev/null
>    /usr/bin/condor_config_val -address "$defrag_address" -rset
> "DEFRAG_DRAINING_MACHINES_PER_HOUR = $4" >& /dev/null
>    /usr/bin/condor_config_val -address "$defrag_address" -rset
> "DEFRAG_MAX_WHOLE_MACHINES = $5" >& /dev/null
>    /usr/sbin/condor_reconfig -daemon defrag >& /dev/null
> }
>
> idle_jobs=$(condor_q atlas01 -constraint 'JobStatus==1' -af RequestCpus
> -name arc6atlas1.nipne.ro| grep 8|wc -l)
> running_jobs=$(condor_q atlas01 -constraint 'JobStatus==2' -af RequestCpus
> -name arc6atlas1.nipne.ro| grep 8|wc -l)
>
> if [ $idle_jobs -gt 15 ] && [ $running_jobs -lt 150 ]
> then
>    setDefrag $idle_jobs $running_jobs 40 25 120
> elif [ $idle_jobs -gt 15 ] && [ $running_jobs -gt 150 ]
> then
>    setDefrag $idle_jobs $running_jobs 4 4 120
> else
>    setDefrag $idle_jobs $running_jobs 1 1 4
> fi
>
>
> Looks ok for you my configuration?Should I look elsewhere for the
> problem(not in condor configuration)?
>
> Thank you,
> Mihai
>
>
>
> Dr. Mihai Ciubancan
> IT Department
> National Institute of Physics and Nuclear Engineering "Horia Hulubei"
> Str. Reactorului no. 30, P.O. BOX MG-6
> 077125, Magurele - Bucharest, Romania
> http://www.ifin.ro
> Work:   +40214042360
> Mobile: +40761345687
> Fax:    +40214042395
>
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
>
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
>


Dr. Mihai Ciubancan
IT Department
National Institute of Physics and Nuclear Engineering "Horia Hulubei"
Str. Reactorului no. 30, P.O. BOX MG-6
077125, Magurele - Bucharest, Romania
http://www.ifin.ro
Work:   +40214042360
Mobile: +40761345687
Fax:    +40214042395
Prev by Date: Re: [HTCondor-users] Job submission from a node outside a cluster instantiated on k8s
Next by Date: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU
Previous by thread: Re: [HTCondor-users] pool authorization failing
Next by thread: Re: [HTCondor-users] Running multiple jobs simultaneously on a single GPU
Index(es):
- Date
- Thread
Mailing List Archives

Public Access

Re: [HTCondor-users] dynamic slots configuration