[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] First experience with the parallel universe.

One thing I discovered in my ongoing foray into the parallel universe is that some multi-machine jobs make assumptions about their execution environments that don't always hold in HTCondor. For example, CST Microwave Studio's job scheduler tries to find the same port available on all of the job's nodes so that they can all use the same solver-daemon configuration file.

However, it's entirely possible under normal circumstances for two nodes of the same parallel universe job to be assigned to the same physical machine, particularly if you've adjusted the negotiator rank to do depth-first fill of machines, say to allow the disk buffer cache to absorb some of the NFS server network traffic.

Needless to say, if two different instances of the daemon in two different slots on the same machine are configured to use the same port, bad things happen. Could it be that when the job succeeds it's only because it got lucky enough to spawn all 12 followers on separate physical machines? Given the assertion failure this may not be too likely, but worth double-checking.

Under the hood, here's the shadow function which is failing the assertion:

ParallelShadow::spawnAllComrades( void )
		   If this function is being called, we've already spawned the
		   root node and gotten its ip/port from our special pseudo
		   syscall.  So, all we have to do is loop over our remote
		   resource list, modify each resource's job ad with the
		   appropriate info, and spawn our node on each resource.

    MpiResource *rr;
	int last = ResourceList.getlast();
	while( nextResourceToStart <= last ) {
        rr = ResourceList[nextResourceToStart];
		spawnNode( rr );  // This increments nextResourceToStart 
	ASSERT( nextResourceToStart == numNodes );

That assertion suggests that the activateClaim() call in spawnNode() did not succeed on at least one of the nodes.

Or, that nextResourceToStart in spawnAllComrades is not starting at the right value.

	-Michael Pelletier.

> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Amy Bush
> Sent: Tuesday, September 12, 2017 5:01 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: Re: [HTCondor-users] First experience with the parallel universe.
> Thanks for the reply, Jason.
> I'm not using partitionable slots, but just in case, I went ahead and
> upgraded to 8.6.5 and had the user try again. Alas, still failing in the
> same way.
> Job executing on host: MPI_job
> Greetings from all the nodes
> Job started on all the nodes, and then
> 007 (070.000.000) 09/12 15:55:58 Shadow exception!
>         Assertion ERROR on (nextResourceToStart == numNodes)
>         0  -  Run Bytes Sent By Job
>         0  -  Run Bytes Received By Job
> And the whole thing starts over again. And over and over and over.
> And if it follows the same trend, eventually it will run. After 70
> failures. That's one of the most mystifying parts.
> If anyone else has any thoughts or things I might try or log files I might
> look in to figure this out, I'm desperately in need of all those things.