[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] ERROR "Can no longer talk to condor_starter <host:slot>" at line 209 in file src/condor_shadow.V6.1/NTreceivers.cpp



Dear experts,

I have some questions for debugging:
Can I avoid restarting of a job in vanilla and/or parallel universe if I use
Requirements = (NumJobStarts==0)
in the condor submit description file?
If it works, will the job stay idle or will be removed?

I found a job in the vanilla universe started at 12/9 and restarted shortly 
before Christmas and still running. I assume the reason were also network 
problems, but unfortunatelly our last condor and system log files are from 
January.
Is there any possibility to make condor a little bit more robust against 
network problems via configuration? Just wait a little bit longer or make more 
reconnection tries?

We are working on automatic restart of the mpi jobs and try to use more 
frequent checkpoints, but it seems a lot of work, therefore any idea
would be welcome.

Best
Harald 


On Monday 06 February 2017 23:43:47 Harald van Pee wrote:
> There is one important argument, why I think the problem is condor related
> not mpi (of course I can be wrong).
> The condor communication goes via ethernet, and the ethernet connection has
> a problem for several minutes.
> The mpi communication goes via infiniband, and there is no infiniband
> problem during this time.
> 
> Harald
> 
> On Monday 06 February 2017 23:04:01 Harald van Pee wrote:
> > Hi Greg,
> > 
> > thanks for your answer.
> > 
> > On Monday 06 February 2017 22:18:08 Greg Thain wrote:
> > > On 02/06/2017 02:40 PM, Harald van Pee wrote:
> > > > Hello,
> > > > 
> > > > we got mpi running in parallel universe with htcondor 8.4 using
> > > > openmpiscript and its working in general without any problem.
> > > 
> > > In general, the MPI jobs themselves cannot survive a network outage or
> > > partition, even a temporary one.  HTCondor will reconnect the shadow to
> > > the starters, if the problem is just between the submit machine and the
> > > execute machines, but if the network problem also impacts node-to-node
> > > communication, then the job has to be aborted and restarted from
> > > scratch because of the way MPI works.
> > 
> > The problem seems between submit machine and one running node (not the
> > node where mpirun was started).
> > If you are right it should be possible to get or found an error of mpirun
> > because it lost one node right?
> > But it seems condor kills the job because of a shadow exception.
> > Unfortunatelly we do not see the output of the stoped job because its
> > overwritten by the new started.
> > Any suggestion how to find out if its realy an mpi related problem?
> > 
> > > If possible, we would recommend that long-running jobs that suffer from
> > > this problem try to self-checkpoint themselves, so that when they are
> > > restarted, they don't need to be restarted from scratch.
> > > 
> > > -greg
> > > _______________________________________________
> > > HTCondor-users mailing list
> > > To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
> > > with a subject: Unsubscribe
> > > You can also unsubscribe by visiting
> > > https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> > > 
> > > The archives can be found at:
> > > https://lists.cs.wisc.edu/archive/htcondor-users/