[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Job disconnected



Very mysterious.  A few ideas to try out:

1. In your condor_config.local, please append

  MASTER_DEBUG = D_ALL

and then do a condor_reconfig or a restart of HTCondor. This will result in a lot more information going into MasterLog, perhaps letting us learn what it is trying to do at the time of the crash.

2. Another idea is perhaps try updating to the latest stable release (currently v8.6.5) and see if the problem persists.

3. Another idea is to install the HTCondor debug symbols package

regards,
Todd


On 8/21/2017 6:11 AM, Hervà Lemaitre wrote:

When I looked in log directory I found core files and the MasterLog displays:

08/21/17 12:38:48 Can't open directory "/condor/local/paraty/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory) 08/21/17 12:38:48 Cannot open /condor/local/paraty/config: No such file or directory
08/21/17 12:38:48 ******************************************************
08/21/17 12:38:48 ** condor_master (CONDOR_MASTER) STARTING UP
08/21/17 12:38:48 ** /condor/install_ubuntu/sbin/condor_master
08/21/17 12:38:48 ** SubsystemInfo: name=MASTER type=MASTER(2) class="DAEMON"(1) 08/21/17 12:38:48 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
08/21/17 12:38:48 ** $CondorVersion: 8.6.3 May 08 2017 BuildID: 404928 $
08/21/17 12:38:48 ** $CondorPlatform: x86_64_Ubuntu14 $
08/21/17 12:38:48 ** PID = 4380
08/21/17 12:38:48 ** Log last touched 8/21 12:37:46
08/21/17 12:38:48 ******************************************************
08/21/17 12:38:48 Using config source: /condor/install_centos/etc/condor_config
08/21/17 12:38:48 Using local config sources:
08/21/17 12:38:48ÂÂÂ /condor/local/paraty/condor_config.local
08/21/17 12:38:48 config Macros = 72, Sorted = 72, StringBytes = 2083, TablesBytes = 2640
08/21/17 12:38:48 CLASSAD_CACHING is OFF
08/21/17 12:38:48 Daemon Log is logging: D_ALWAYS D_ERROR
08/21/17 12:38:49 Removed /tmp/condor-lock.0.533212494411604/shared_port_ad (assuming it is left over from previous run) 08/21/17 12:38:49 SharedPortEndpoint: waiting for connections to named socket 4380_e4e7 08/21/17 12:38:49 SharedPortEndpoint: failed to open /tmp/condor-lock.0.533212494411604/shared_port_ad: No such file or directory 08/21/17 12:38:49 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 08/21/17 12:38:49 DaemonCore: private command socket at <10.42.0.25:0?sock=4380_e4e7> 08/21/17 12:38:49 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False) 08/21/17 12:38:49 Master restart (GRACEFUL) is watching /condor/install_ubuntu/sbin/condor_master (mtime:1502961752)
08/21/17 12:38:49 Collector port not defined, will use default: 9618
08/21/17 12:38:49 Started DaemonCore process "/condor/install_ubuntu/libexec/condor_shared_port", pid and pgroup = 4403 08/21/17 12:38:49 Waiting for /tmp/condor-lock.0.533212494411604/shared_port_ad to appear.
08/21/17 12:38:49 systemd watchdog notification support not available.
08/21/17 12:38:50 Found /tmp/condor-lock.0.533212494411604/shared_port_ad.
08/21/17 12:38:50 Started DaemonCore process "/condor/install_ubuntu/sbin/condor_schedd", pid and pgroup = 4404 08/21/17 12:38:50 Started DaemonCore process "/condor/install_ubuntu/sbin/condor_startd", pid and pgroup = 4405
08/21/17 12:38:54 Setting ready state 'Ready' for STARTD
Stack dump for process 4380 at timestamp 1503313127 (9 frames)
/condor/install_ubuntu/sbin/../lib/libcondor_utils_8_6_3.so(dprintf_dump_stack+0x72)[0x7f8af261fd32]
/condor/install_ubuntu/sbin/../lib/libcondor_utils_8_6_3.so(_Z18linux_sig_coredumpi+0x24)[0x7f8af27f5ca4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f8af0d10390]
/lib/x86_64-linux-gnu/libc.so.6(__select+0x13)[0x7f8af0a32573]
/condor/install_ubuntu/sbin/../lib/libcondor_utils_8_6_3.so(_ZN8Selector7executeEv+0xa6)[0x7f8af261ca16]
/condor/install_ubuntu/sbin/../lib/libcondor_utils_8_6_3.so(_ZN10DaemonCore6DriverEv+0x1052)[0x7f8af27e8fb2]
/condor/install_ubuntu/sbin/../lib/libcondor_utils_8_6_3.so(_Z7dc_mainiPPc+0x13a4)[0x7f8af27f9314]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f8af0955830]
/condor/install_ubuntu/sbin/condor_master[0x40a70f]
08/21/17 12:59:48 Can't open directory "/condor/local/paraty/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory) 08/21/17 12:59:48 Cannot open /condor/local/paraty/config: No such file or directory
08/21/17 12:59:48 ******************************************************
08/21/17 12:59:48 ** condor_master (CONDOR_MASTER) STARTING UP
08/21/17 12:59:48 ** /condor/install_ubuntu/sbin/condor_master
08/21/17 12:59:48 ** SubsystemInfo: name=MASTER type=MASTER(2) class="DAEMON"(1) 08/21/17 12:59:48 ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
08/21/17 12:59:48 ** $CondorVersion: 8.6.3 May 08 2017 BuildID: 404928 $
08/21/17 12:59:48 ** $CondorPlatform: x86_64_Ubuntu14 $
08/21/17 12:59:48 ** PID = 4718
08/21/17 12:59:48 ** Log last touched 8/21 12:58:47
08/21/17 12:59:48 ******************************************************
08/21/17 12:59:48 Using config source: /condor/install_centos/etc/condor_config
08/21/17 12:59:48 Using local config sources:
08/21/17 12:59:48ÂÂÂ /condor/local/paraty/condor_config.local
08/21/17 12:59:48 config Macros = 72, Sorted = 72, StringBytes = 2083, TablesBytes = 2640
08/21/17 12:59:48 CLASSAD_CACHING is OFF
08/21/17 12:59:48 Daemon Log is logging: D_ALWAYS D_ERROR
08/21/17 12:59:49 Removed /tmp/condor-lock.0.533212494411604/shared_port_ad (assuming it is left over from previous run) 08/21/17 12:59:49 SharedPortEndpoint: waiting for connections to named socket 4718_bbea 08/21/17 12:59:49 SharedPortEndpoint: failed to open /tmp/condor-lock.0.533212494411604/shared_port_ad: No such file or directory 08/21/17 12:59:49 SharedPortEndpoint: did not successfully find SharedPortServer address. Will retry in 60s. 08/21/17 12:59:49 DaemonCore: private command socket at <10.42.0.25:0?sock=4718_bbea> 08/21/17 12:59:49 Adding SHARED_PORT to DAEMON_LIST, because USE_SHARED_PORT=true (to disable this, set AUTO_INCLUDE_SHARED_PORT_IN_DAEMON_LIST=False) 08/21/17 12:59:49 Master restart (GRACEFUL) is watching /condor/install_ubuntu/sbin/condor_master (mtime:1502961752)
08/21/17 12:59:49 Collector port not defined, will use default: 9618
08/21/17 12:59:49 Started DaemonCore process "/condor/install_ubuntu/libexec/condor_shared_port", pid and pgroup = 4740 08/21/17 12:59:49 Waiting for /tmp/condor-lock.0.533212494411604/shared_port_ad to appear.
08/21/17 12:59:49 systemd watchdog notification support not available.
08/21/17 12:59:50 Found /tmp/condor-lock.0.533212494411604/shared_port_ad.
08/21/17 12:59:50 Started DaemonCore process "/condor/install_ubuntu/sbin/condor_schedd", pid and pgroup = 4743 08/21/17 12:59:50 Started DaemonCore process "/condor/install_ubuntu/sbin/condor_startd", pid and pgroup = 4744
08/21/17 12:59:54 Setting ready state 'Ready' for STARTD

It seems that my daemons are restarting every 15 minutes and I do not know why.


Università Paris-Sud	
*Hervà LEMAITRE*

U1000 "Neuroimagerie en Psychiatrie"
Service hospitalier FrÃdÃric Joliot - 4, Place du GÃnÃral Leclerc
91401 Orsay





_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/