[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] Please help me; about Shadow exception!



Hi~ I'm following the 'HTCondor Quick Start Guide'(https://research.cs.wisc.edu/htcondor/manual/quickstart.html
After I submit a job and it ran for about 5 seconds, it was turned into IDLE state from RUN state.
After It took too much time, its output file was successfully printed.
I cannot correctlyÂcount how much time it took but I just suppose about 20~30 min.
I thought there are some problems, so I ask condor-user mailing list about this problem.

I specify all information of current status of my machines from now.

The job file is:

#!/bin/bash
# file name: sleep.sh
TIMETOWAIT="10"
echo "sleeping for $TIMETOWAIT seconds"
/bin/sleep $TIMETOWAIT

The submit specification file is:

executable       Â= sleep.sh
log           = sleep.log
output         Â= outfile.txt
error          = errors.txt
should_transfer_files  = Yes
when_to_transfer_output = ON_EXIT
queue

its log file is(sleep.log):

000 (012.000.000) 02/10 21:34:25 Job submitted from host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_3>
...
001 (012.000.000) 02/10 21:34:26 Job executing on host: <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=297370_fa77_62>
...
007 (012.000.000) 02/10 21:34:28 Shadow exception!
    Error from slot1@ubuntu: Create_Process failed to register the job with the ProcD
    0 Â- ÂRun Bytes Sent By Job
    114 Â- ÂRun Bytes Received By Job
...
## above messageÂrepeated ##
001 (012.000.000) 02/10 22:02:30 Job executing on host: <10.150.21.171:9618?addrs=10.150.21.171-9618+[--1]-9618&noUDP&sock=42970_bd0c_4>
...
006 (012.000.000) 02/10 22:02:38 Image size of job updated: 380
    1 Â- ÂMemoryUsage of job (MB)
    380 Â- ÂResidentSetSize of job (KB)
...
005 (012.000.000) 02/10 22:02:41 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂRun Local Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00 Â- ÂTotal Local Usage
    24 Â- ÂRun Bytes Sent By Job
    114 Â- ÂRun Bytes Received By Job
    24 Â- ÂTotal Bytes Sent By Job
    2052 Â- ÂTotal Bytes Received By Job
    Partitionable Resources :  ÂUsage ÂRequest Allocated
     ÂCpus         :         1     1
     ÂDisk (KB)      Â:    Â9    Â1 Â27474539
     ÂMemory (MB)     Â:    Â1    Â1   Â4025
...

When I checked ShadowLog(/var/log/condor/ShadowLog), it says:
Â
02/07/17 15:57:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
02/07/17 15:57:54 ** /usr/sbin/condor_shadow
02/07/17 15:57:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
02/07/17 15:57:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
02/07/17 15:57:54 ** $CondorVersion: 8.6.0 Jan 26 2017 BuildID: 395190 $
02/07/17 15:57:54 ** $CondorPlatform: x86_64_Debian7 $
02/07/17 15:57:54 ** PID = 209324
02/07/17 15:57:54 ** Log last touched 2/7 15:57:52
02/07/17 15:57:54 ******************************************************
02/07/17 15:57:54 Using config source: /etc/condor/condor_config
02/07/17 15:57:54 Using local config sources:
02/07/17 15:57:54 Â Â/etc/condor/condor_config.local
02/07/17 15:57:54 config Macros = 67, Sorted = 67, StringBytes = 1769, TablesBytes = 1112
02/07/17 15:57:54 CLASSAD_CACHING is OFF
02/07/17 15:57:54 Daemon Log is logging: D_ALWAYS D_ERROR
02/07/17 15:57:54 SharedPortEndpoint: waiting for connections to named socket 209324_bd59
02/07/17 15:57:54 DaemonCore: command socket at <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 DaemonCore: private command socket at <10.150.21.170:9618?addrs=10.150.21.170-9618+[--1]-9618&noUDP&sock=209324_bd59>
02/07/17 15:57:54 ERROR "Assertion ERROR on (job_ad_file)" at line 165 in file /slots/01/dir_17483/sources/src/condor_shadow.V6.1/shadow_v61_main.cpp


Additionally, I add configuration information for HTCondor.

1. condor_config in central manager machine:
Â
LOCAL_DIR = /var
## ÂWhere is the machine-specific local config file for each host?
LOCAL_CONFIG_FILE = /etc/condor/condor_config.local
## ÂIf your configuration is on a shared file system, then this might be a better default
#LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
## ÂIf the local config file is not present, is it an error? (WARNING: This is a potential security issue.)
REQUIRE_LOCAL_CONFIG_FILE = false
STARTER_ALLOW_RUNAS_OWNER = TRUE
## ÂThe normal way to do configuration with RPMs is to read all of the
## Âfiles in a given directory that don't match a regex as configuration files.
## ÂConfig files are read in lexicographic order.
LOCAL_CONFIG_DIR = /etc/condor/config.d
#LOCAL_CONFIG_DIR_EXCLUDE_REGEXP = ^((\..*)|(.*~)|(#.*)|(.*\.rpmsave)|(.*\.rpmnew))$
## ÂUse a host-based security policy. By default CONDOR_HOST and the local machine will be allowed
use SECURITY : HOST_BASED
## ÂTo expand your condor pool beyond a single host, set ALLOW_WRITE to match all of the hosts
ALLOW_WRITE = nickeys-*.xxxxx.ac.kr
ALLOW_READ = nickeys-*.xxxxx.ac.kr
## ÂFLOCK_FROM defines the machines that grant access to your pool via flocking. (i.e. these machines can join your pool).Â
FLOCK_FROM =Ânickeys-fs.xxxxx.ac.krnickeys-1.xxxxx.ac.krnickeys-2.xxxxx.ac.krnickeys-3.xxxxx.ac.krnickeys-4.xxxxx.ac.krnickeys-5.xxxxx.ac.krnickeys-6.xxxxx.ac.krnickeys-7.xxxxx.ac.krnickeys-8.xxxxx.ac.kr
## ÂFLOCK_TO defines the central managers that your schedd will advertise itself to (i.e. these pools will give matches to your schedd).Â
FLOCK_TO =Ânickeys-fs.xxxxx.ac.krnickeys-1.xxxxx.ac.krnickeys-2.xxxxx.ac.krnickeys-3.xxxxx.ac.krnickeys-4.xxxxx.ac.krnickeys-5.xxxxx.ac.krnickeys-6.xxxxx.ac.krnickeys-7.xxxxx.ac.krnickeys-8.xxxxx.ac.krÂ
UID_DOMAIN =Âxxxxx.ac.kr
RUN Â Â = $(LOCAL_DIR)/run/condor
LOG Â Â = $(LOCAL_DIR)/log/condor
LOCK Â Â= $(LOCAL_DIR)/lock/condor
SPOOL Â = $(LOCAL_DIR)/lib/condor/spool
EXECUTE = $(LOCAL_DIR)/lib/condor/execute
BIN Â Â = $(RELEASE_DIR)/bin
LIB Â Â = $(RELEASE_DIR)/lib/condor
INCLUDE = $(RELEASE_DIR)/include/condor
SBIN Â Â= $(RELEASE_DIR)/sbin
LIBEXEC = $(RELEASE_DIR)/lib/condor/libexec
SHARE Â = $(RELEASE_DIR)/share/condor
GANGLIA_LIB64_PATH = /lib,/usr/lib,/usr/local/lib
PROCD_ADDRESS = $(RUN)/procd_pipe
## ÂWhat machine is your central manager?
CONDOR_HOST =Ânickeys-fs.xxxxx.ac.kr
FILESYSTEM_DOMAIN =Âxxxxx.ac.kr
## ÂThis macro determines what daemons the condor_master will start and keep its watchful eyes on.
## ÂThe list is a comma or space separated list of subsystem names
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

2. condor_config.local in the execution machine:

FILESYSTEM_DOMAIN =Âxxxxx.ac.kr

I wrote all information about my HTCondor system as I know as.

Please give me any small hint, I have been suffered from this problem for 3 days...
I could not find any clue about it, even with googling.

Sincerely,