[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] job cannot reconnect to starter running MPI



Hello Condor experts!

I do not have much experience with Condor and basically my problem is:

I have a python application that uses mpi4py(openmpi). Submitting a Condor job to my 2 dedicated nodes, where I can run MPI Jobs, the job makes crazy itself.

After some time running, Condor set it to Idle, also the claimed slots are set to Preemptive|Vacating followed by Unclaimed, and Condor reinitializes the job from scratch keeping the same jobID. It seems between the node where mpirun is started but I do not know how can I solve it. In other hand running the same application outside Condor, just using mpirun, I do not have any problems. 

This is part of the ShadowLog in the submit machine, maybe it can useful.

06/06/17 19:45:05 ******************************************************
06/06/17 19:45:05 ** condor_shadow (CONDOR_SHADOW) STARTING UP
06/06/17 19:45:05 ** /opt/condor/sbin/condor_shadow
06/06/17 19:45:05 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
06/06/17 19:45:05 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
06/06/17 19:45:05 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
06/06/17 19:45:05 ** $CondorPlatform: x86_64_rhap_6.3 $
06/06/17 19:45:05 ** PID = 2409
06/06/17 19:45:05 ** Log last touched 6/6 19:41:35
06/06/17 19:45:05 ******************************************************
06/06/17 19:45:05 Using config source: /opt/condor/etc/condor_config
06/06/17 19:45:05 Using local config sources: 
06/06/17 19:45:05    /opt/condor/etc/condor_config.local
06/06/17 19:45:05 DaemonCore: command socket at <10.1.1.12:41168?noUDP>
06/06/17 19:45:05 DaemonCore: private command socket at <10.1.1.12:41168>
06/06/17 19:45:05 Setting maximum accepts per cycle 8.
06/06/17 19:45:05 Initializing a PARALLEL shadow for job 1115.0
06/06/17 19:45:06 (1115.0) (2409): Request to run on slot6@xxxxxxxxxx <10.1.255.219:42920> was ACCEPTED
06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300> <10.1.255.217:47300> was ACCEPTED
[...]
06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300> <10.1.255.217:47300> was ACCEPTED
06/06/17 19:47:52 (1115.0) (2409): Can no longer talk to condor_starter <10.1.255.217:47300>
06/06/17 19:47:52 (1115.0) (2409): This job cannot reconnect to starter, so job exiting
06/06/17 19:47:52 (1115.0) (2409): ERROR "Can no longer talk to condor_starter <10.1.255.217:47300>" at line 208 in file /slots/11/dir_17560/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
06/06/17 19:47:54 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
06/06/17 19:47:54 Setting maximum accepts per cycle 8.
06/06/17 19:47:54 ******************************************************
06/06/17 19:47:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
06/06/17 19:47:54 ** /opt/condor/sbin/condor_shadow
06/06/17 19:47:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
06/06/17 19:47:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
06/06/17 19:47:54 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
06/06/17 19:47:54 ** $CondorPlatform: x86_64_rhap_6.3 $
06/06/17 19:47:54 ** PID = 2783
06/06/17 19:47:54 ** Log last touched 6/6 19:47:52
06/06/17 19:47:54 ******************************************************
06/06/17 19:47:54 Using config source: /opt/condor/etc/condor_config
06/06/17 19:47:54 Using local config sources: 
06/06/17 19:47:54    /opt/condor/etc/condor_config.local
06/06/17 19:47:54 DaemonCore: command socket at <10.1.1.12:41168?noUDP>
06/06/17 19:47:54 DaemonCore: private command socket at <10.1.1.12:41168>
06/06/17 19:47:54 Setting maximum accepts per cycle 8.
06/06/17 19:47:54 Initializing a PARALLEL shadow for job 1115.0
06/06/17 19:47:55 (1115.0) (2783): Request to run on slot6@xxxxxxxxxx <10.1.255.219:42920> was DELAYED (previous job still being vacated)
[...]
06/06/17 19:48:15 (1115.0) (2783): Request to run on slot6@xxxxxxxxxx <10.1.255.219:42920> was DELAYED (previous job still being vacated)
06/06/17 19:48:15 (1115.0) (2783): activateClaim(): Too many retries, giving up.
06/06/17 19:48:15 (1115.0) (2783): Job 1115.0 is being evicted
06/06/17 19:48:16 (1115.0) (2783): logEvictEvent with unknown reason (108), aborting
06/06/17 19:48:16 (1115.0) (2783): **** condor_shadow (condor_SHADOW) pid 2783 EXITING WITH STATUS 108
06/06/17 19:48:38 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
06/06/17 19:48:38 Setting maximum accepts per cycle 8.
06/06/17 19:48:38 ******************************************************


Thank you for the help.


--
Carlos Adean
www.linea.gov.br