[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] First experience with the parallel universe.

I just finally figured this out, so I thought I should update for future
people like me, looking through the archives.

It turns out it was my own stupidity. Of course.

I eventually found my way back to the config file for my dedicated
parallel nodes, and found this:

##  There are three basic options for the policy on dedicated
##  resources: 
##  1) Only run dedicated jobs
##  2) Always run jobs, but prefer dedicated ones
##  3) Always run dedicated jobs, but only allow non-dedicated jobs to
##     run on an opportunistic basis.   
##  You MUST uncomment the set of policy expressions you want to use
##  at your site.

And then three sets of policy expressions, all of which were commented
out. I uncommented one, and now things seem to work properly.

I think it was defaulting to option 3, and some of the parallel subjobs
were getting preempted. That's why it would restart, restart, restart,
but then eventually run to completion. It just required luck that none
of the jobs would get preempted.

Anyway. Embarrassing. But maybe it'll help someone else in the future.

On Wed, Sep 13, 2017 at 03:16:27PM +0000, Michael Pelletier wrote:
> One thing I discovered in my ongoing foray into the parallel universe is that some multi-machine jobs make assumptions about their execution environments that don't always hold in HTCondor. For example, CST Microwave Studio's job scheduler tries to find the same port available on all of the job's nodes so that they can all use the same solver-daemon configuration file.
> However, it's entirely possible under normal circumstances for two nodes of the same parallel universe job to be assigned to the same physical machine, particularly if you've adjusted the negotiator rank to do depth-first fill of machines, say to allow the disk buffer cache to absorb some of the NFS server network traffic.
> Needless to say, if two different instances of the daemon in two different slots on the same machine are configured to use the same port, bad things happen. Could it be that when the job succeeds it's only because it got lucky enough to spawn all 12 followers on separate physical machines? Given the assertion failure this may not be too likely, but worth double-checking.
> Under the hood, here's the shadow function which is failing the assertion:
> void
> ParallelShadow::spawnAllComrades( void )
> {
> 		/* 
> 		   If this function is being called, we've already spawned the
> 		   root node and gotten its ip/port from our special pseudo
> 		   syscall.  So, all we have to do is loop over our remote
> 		   resource list, modify each resource's job ad with the
> 		   appropriate info, and spawn our node on each resource.
> 		*/
>     MpiResource *rr;
> 	int last = ResourceList.getlast();
> 	while( nextResourceToStart <= last ) {
>         rr = ResourceList[nextResourceToStart];
> 		spawnNode( rr );  // This increments nextResourceToStart 
>     }
> 	ASSERT( nextResourceToStart == numNodes );
> }
> That assertion suggests that the activateClaim() call in spawnNode() did not succeed on at least one of the nodes.
> Or, that nextResourceToStart in spawnAllComrades is not starting at the right value.
> 	-Michael Pelletier.
> > -----Original Message-----
> > From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> > Of Amy Bush
> > Sent: Tuesday, September 12, 2017 5:01 PM
> > To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> > Subject: Re: [HTCondor-users] First experience with the parallel universe.
> > 
> > Thanks for the reply, Jason.
> > 
> > I'm not using partitionable slots, but just in case, I went ahead and
> > upgraded to 8.6.5 and had the user try again. Alas, still failing in the
> > same way.
> > 
> > Job executing on host: MPI_job
> > 
> > Greetings from all the nodes
> > 
> > Job started on all the nodes, and then
> > 
> > 007 (070.000.000) 09/12 15:55:58 Shadow exception!
> >         Assertion ERROR on (nextResourceToStart == numNodes)
> >         0  -  Run Bytes Sent By Job
> >         0  -  Run Bytes Received By Job
> > 
> > And the whole thing starts over again. And over and over and over.
> > 
> > And if it follows the same trend, eventually it will run. After 70
> > failures. That's one of the most mystifying parts.
> > 
> > If anyone else has any thoughts or things I might try or log files I might
> > look in to figure this out, I'm desperately in need of all those things.
> > 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/