[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI



Hello,


----- Mensagem original -----
> De: "Todd L Miller" <tlmiller@xxxxxxxxxxx>
> Para: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
> Enviadas: Quarta-feira, 7 de junho de 2017 17:59:58
> Assunto: Re: [HTCondor-users] job cannot reconnect to starter running MPI
> 
>  	So what this says is that about two minutes into the job, the
> starter either crashed or hung (or the network went away, but that
> seems
> unlikely), and the shadow doesn't know why.  At this point, it would
> make
> sense to look at the execute node(s) -- their startd and starter logs
> --
> and see what's going on at the same time.
> 
> - ToddM


Thanks ToddM for your reply.

I was monitoring the logs in the execution nodes as suggested earlier and I got some errors just after HTCondor set the job from Running -> Idle. 

At the same time a Shadow Exception is registered in the logfile of the job. Is it possible that the application is requiring more resources than is available?


== Job directory - log file ==
014 (1202.000.017) 06/08 14:37:13 Node 17 executing on host: <10.1.255.217:45695>
014 (1202.000.019) 06/08 14:37:13 Node 19 executing on host: <10.1.255.217:45695>
001 (1202.000.000) 06/08 14:37:13 Job executing on host: MPI_job
007 (1202.000.000) 06/08 14:40:04 Shadow exception!
        Can no longer talk to condor_starter <10.1.255.217:45695>
        0  -  Run Bytes Sent By Job
        0  -  Run Bytes Received By Job

== NODE01 StartLog ==

06/08/17 14:40:04 slot6: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot7: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot8: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot9: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot10: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot1: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot2: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot3: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot4: Called deactivate_claim_forcibly()
06/08/17 14:40:04 slot5: Called deactivate_claim_forcibly()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:44453>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot6: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:47520>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot7: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:46790>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot8: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:47811>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot9: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:46864>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot10: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:42263>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot1: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:45623>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot2: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:43942>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot3: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:49709>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot4: Called deactivate_claim()
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 28 bytes to <10.1.1.12:42044>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 slot5: Called deactivate_claim()
06/08/17 14:40:07 slot6: Got activate claim while starter is still alive.
06/08/17 14:40:07 slot6: Telling shadow to try again later.
06/08/17 14:40:08 slot6: Got activate claim while starter is still alive.
06/08/17 14:40:08 slot6: Telling shadow to try again later.
06/08/17 14:40:09 slot6: Got activate claim while starter is still alive.
06/08/17 14:40:09 slot6: Telling shadow to try again later.

== NODE02 StartLog.slot* ==

06/08/17 14:40:04 Got SIGQUIT.  Performing fast shutdown.
06/08/17 14:40:04 ShutdownFast all jobs.
06/08/17 14:40:04 Process exited, pid=25337, status=0
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 329 bytes to <10.1.1.12:47154>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 78 bytes to <10.1.1.12:47154>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 Failed to send job exit status to shadow
06/08/17 14:40:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
06/08/17 14:40:04 Returning from CStarter::JobReaper()
06/08/17 14:40:04 Got SIGTERM. Performing graceful shutdown.
06/08/17 14:40:04 ShutdownGraceful all jobs.
06/08/17 14:40:04 condor_write(): Socket closed when trying to write 78 bytes to <10.1.1.12:47154>, fd is 6
06/08/17 14:40:04 Buf::write(): condor_write() failed
06/08/17 14:40:04 Failed to send job exit status to shadow
06/08/17 14:40:04 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
06/08/17 14:40:04 **** condor_starter (condor_STARTER) pid 25327 EXITING WITH STATUS 0
06/08/17 14:40:44 Can't open directory "/var/opt/condor/config" as PRIV_UNKNOWN, errno: 2 (No such file or directory)
06/08/17 14:40:44 Setting maximum accepts per cycle 8.
06/08/17 14:40:44 ******************************************************
06/08/17 14:40:44 ** condor_starter (CONDOR_STARTER) STARTING UP
06/08/17 14:40:44 ** /opt/condor/sbin/condor_starter
06/08/17 14:40:44 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
06/08/17 14:40:44 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
06/08/17 14:40:44 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
06/08/17 14:40:44 ** $CondorPlatform: x86_64_rhap_6.3 $
06/08/17 14:40:44 ** PID = 26290
06/08/17 14:40:44 ** Log last touched 6/8 14:40:04
06/08/17 14:40:44 ******************************************************
06/08/17 14:40:44 Using config source: /opt/condor/etc/condor_config
06/08/17 14:40:44 Using local config sources: 
06/08/17 14:40:44    /opt/condor/etc/condor_config.local
06/08/17 14:40:44 DaemonCore: command socket at <10.1.255.217:42701>
06/08/17 14:40:44 DaemonCore: private command socket at <10.1.255.217:42701>
06/08/17 14:40:44 Setting maximum accepts per cycle 8.
06/08/17 14:40:45 Communicating with shadow <10.1.1.12:42046?noUDP>
06/08/17 14:40:45 Submitting machine is "webback2.local"
06/08/17 14:40:45 setting the orig job name in starter
06/08/17 14:40:45 setting the orig job iwd in starter
06/08/17 14:40:45 Job has WantIOProxy=true
06/08/17 14:40:45 Initialized IO Proxy.
06/08/17 14:40:45 Done setting resource limits
06/08/17 14:40:45 Job 1202.0 set to execute immediately
06/08/17 14:40:45 Starting a PARALLEL universe job with ID: 1202.0
06/08/17 14:40:45 IWD: /home/carlosadean/condor-examples/test_cosmosis
06/08/17 14:40:45 Output file: /home/carlosadean/condor-examples/test_cosmosis/demo.out
06/08/17 14:40:45 Error file: /home/carlosadean/condor-examples/test_cosmosis/demo.err
06/08/17 14:40:45 About to exec /home/carlosadean/condor-examples/test_cosmosis/mp1script /mnt/eups/packages/Linux64/cosmosis_portal/1.4+1/bin/cosmosis --mpi /home/carlosadean/condor-examples/test_cosmosis/acf_grid.ini
06/08/17 14:40:45 Running job as user root
06/08/17 14:40:45 Create_Process succeeded, pid=26301
06/08/17 14:40:46 IOProxy: accepting connection from 10.1.255.217
06/08/17 14:40:46 condor_write() failed: send() 1 bytes to <10.1.255.217:55669> returned -1, timeout=0, errno=32 Broken pipe.
06/08/17 14:40:46 IOProxyHandler: closing connection to 10.1.255.217
06/08/17 14:41:02 IOProxy: accepting connection from 10.1.255.217
06/08/17 14:41:02 IOProxyHandler: closing connection to 10.1.255.217
06/08/17 14:41:02 IOProxy: accepting connection from 10.1.255.217
06/08/17 14:41:02 IOProxyHandler: closing connection to 10.1.255.217


--
Carlos Adean
IT Team
linea.gov.br
skype: carlosadean