[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] job cannot reconnect to starter running MPI



Hi Michael,

First of all, good luck!

For a new user is not so easy to understand how the parallel universe works and your last email was very useful to clarify the big picture. As soon as possible I'll update to the version to 8.6.

Many thanks!

--
Carlos Adean
IT Team
linea.gov.br
skype: carlosadean

----- Mensagem original -----
De: "Michael Pelletier" <Michael.V.Pelletier@xxxxxxxxxxxx>
Para: "HTCondor-Users Mail List" <htcondor-users@xxxxxxxxxxx>
Enviadas: Quarta-feira, 7 de junho de 2017 13:47:08
Assunto: Re: [HTCondor-users] job cannot reconnect to starter running MPI

Carlos,

The Parallel Universe can be a bit of a challenge to wrap your mind around, I found. I've been poking at it for the better part of four years and only recently did I finally have a full-on light-bulb moment about it. I'm sure it would have come earlier if I'd had any call to use it in production.

The 8.6 version has substantial improvements to the code and documentation that also helped significantly. Here's the basic gist of what's going on...

The parallel universe starts the "executable" with its arguments at essentially the same time on every slot. Each slot is distinguished by the "NodeID" ClassAd attribute and a matching environment variable. The executable should be a script which sets up the MPI environment.

Now, under a normal MPI setup you'd have a list of hostnames in a machine list file which the script would build, and the node-0 process would SSH out to each of those hostnames and start up the MPI daemon.

However, with HTCondor, each Linux slot has resources reserved for it via the cgroups and the scratch space, and so an SSH session coming in through the "front door," as it were, would not wind up in the slot's sandbox, not be accounted for by HTCondor, not be subject to machine policy, and be treated just like a direct user login to the exec node. Thus a list of hostnames doesn't do any good.

In addition, a single host might have more than one slot associated with the MPI job, so it won't work even more.

So in order to get the SSH connection from the node-0 to land in the sandbox where it can start an accounted-for process, we have to do something different.

Enter the "sshd.sh" script and "condor_ssh."

Each slot which is part of the parallel universe job will have the startup script first run the "/usr/libexec/condor/sshd.sh script. When you look over an 8.6 copy of this script, you'll see that it fires up an SSH daemon on a private port within the slot, creating a unique host keypair in the process. Then, using "chirp," it sends the public key, address, its node ID, and port number back to the node-0 scratch space.

On node-0, in addition to doing this, it waits for all the other slots to finish delivering their sshd.sh keys and info, and once that's complete, it builds what's called a "contact" file containing the details about each node ID in the job. In addition, it creates a machine list for "mpirun" (or whatever you're running) containing the node numbers rather than hostnames - 0 1 2 3 4, etc.

Then, it uses condor_ssh, pointing to that contact file and targeting not hostnames but node ID numbers looked up in the contact file, to SSH to each node's slot-sandboxed and HTCondor-accounting-linked ssh daemon on the proper port to start up the MPI daemon or what have you. The condor_ssh command uses the contact file to translate the node number into a port number and IP address.

Once that's done and all the mpids are up and running, the node0 script will then start up the mpirun (or call a vendor script which does so), with its SSH command setting or env variable pointing to "condor_ssh," and the machinelist option pointing to the "0 1 2 3 4" file, and it will go ahead and start whatever mpirun wants under the sandboxed SSH daemon on each slot in the parallel job's collection of slots.

I hope that makes sense, and is reasonably correct. I'm sure the CHTC team will put finer points on it if necessary.

I'm getting ready to dive into it in earnest to get CST Microwave Studio HTCondor jobs up and running under the vendor's launch script.

Good luck, and wish me luck too!

	-Michael Pelletier.



> -----Original Message-----
> From: HTCondor-users [mailto:htcondor-users-bounces@xxxxxxxxxxx] On Behalf
> Of Carlos Adean
> Sent: Tuesday, June 06, 2017 7:48 PM
> To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
> Subject: [HTCondor-users] job cannot reconnect to starter running MPI
> 
> Hello Condor experts!
> 
> I do not have much experience with Condor and basically my problem is:
> 
> I have a python application that uses mpi4py(openmpi). Submitting a Condor
> job to my 2 dedicated nodes, where I can run MPI Jobs, the job makes crazy
> itself.
> 
> After some time running, Condor set it to Idle, also the claimed slots are
> set to Preemptive|Vacating followed by Unclaimed, and Condor reinitializes
> the job from scratch keeping the same jobID. It seems between the node
> where mpirun is started but I do not know how can I solve it. In other
> hand running the same application outside Condor, just using mpirun, I do
> not have any problems.
> 
> This is part of the ShadowLog in the submit machine, maybe it can useful.
> 
> 06/06/17 19:45:05 ******************************************************
> 06/06/17 19:45:05 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 06/06/17 19:45:05 ** /opt/condor/sbin/condor_shadow
> 06/06/17 19:45:05 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
> class=DAEMON(1)
> 06/06/17 19:45:05 ** Configuration: subsystem:SHADOW local:<NONE>
> class:DAEMON
> 06/06/17 19:45:05 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
> 06/06/17 19:45:05 ** $CondorPlatform: x86_64_rhap_6.3 $
> 06/06/17 19:45:05 ** PID = 2409
> 06/06/17 19:45:05 ** Log last touched 6/6 19:41:35
> 06/06/17 19:45:05 ******************************************************
> 06/06/17 19:45:05 Using config source: /opt/condor/etc/condor_config
> 06/06/17 19:45:05 Using local config sources:
> 06/06/17 19:45:05    /opt/condor/etc/condor_config.local
> 06/06/17 19:45:05 DaemonCore: command socket at <10.1.1.12:41168?noUDP>
> 06/06/17 19:45:05 DaemonCore: private command socket at <10.1.1.12:41168>
> 06/06/17 19:45:05 Setting maximum accepts per cycle 8.
> 06/06/17 19:45:05 Initializing a PARALLEL shadow for job 1115.0
> 06/06/17 19:45:06 (1115.0) (2409): Request to run on slot6@xxxxxxxxxx
> <10.1.255.219:42920> was ACCEPTED
> 06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300>
> <10.1.255.217:47300> was ACCEPTED [...]
> 06/06/17 19:45:06 (1115.0) (2409): Request to run on <10.1.255.217:47300>
> <10.1.255.217:47300> was ACCEPTED
> 06/06/17 19:47:52 (1115.0) (2409): Can no longer talk to condor_starter
> <10.1.255.217:47300>
> 06/06/17 19:47:52 (1115.0) (2409): This job cannot reconnect to starter,
> so job exiting
> 06/06/17 19:47:52 (1115.0) (2409): ERROR "Can no longer talk to
> condor_starter <10.1.255.217:47300>" at line 208 in file
> /slots/11/dir_17560/userdir/src/condor_shadow.V6.1/NTreceivers.cpp
> 06/06/17 19:47:54 Can't open directory "/var/opt/condor/config" as
> PRIV_UNKNOWN, errno: 2 (No such file or directory)
> 06/06/17 19:47:54 Setting maximum accepts per cycle 8.
> 06/06/17 19:47:54 ******************************************************
> 06/06/17 19:47:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
> 06/06/17 19:47:54 ** /opt/condor/sbin/condor_shadow
> 06/06/17 19:47:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6)
> class=DAEMON(1)
> 06/06/17 19:47:54 ** Configuration: subsystem:SHADOW local:<NONE>
> class:DAEMON
> 06/06/17 19:47:54 ** $CondorVersion: 7.8.5 Oct 09 2012 BuildID: 68720 $
> 06/06/17 19:47:54 ** $CondorPlatform: x86_64_rhap_6.3 $
> 06/06/17 19:47:54 ** PID = 2783
> 06/06/17 19:47:54 ** Log last touched 6/6 19:47:52
> 06/06/17 19:47:54 ******************************************************
> 06/06/17 19:47:54 Using config source: /opt/condor/etc/condor_config
> 06/06/17 19:47:54 Using local config sources:
> 06/06/17 19:47:54    /opt/condor/etc/condor_config.local
> 06/06/17 19:47:54 DaemonCore: command socket at <10.1.1.12:41168?noUDP>
> 06/06/17 19:47:54 DaemonCore: private command socket at <10.1.1.12:41168>
> 06/06/17 19:47:54 Setting maximum accepts per cycle 8.
> 06/06/17 19:47:54 Initializing a PARALLEL shadow for job 1115.0
> 06/06/17 19:47:55 (1115.0) (2783): Request to run on slot6@xxxxxxxxxx
> <10.1.255.219:42920> was DELAYED (previous job still being vacated) [...]
> 06/06/17 19:48:15 (1115.0) (2783): Request to run on slot6@xxxxxxxxxx
> <10.1.255.219:42920> was DELAYED (previous job still being vacated)
> 06/06/17 19:48:15 (1115.0) (2783): activateClaim(): Too many retries,
> giving up.
> 06/06/17 19:48:15 (1115.0) (2783): Job 1115.0 is being evicted
> 06/06/17 19:48:16 (1115.0) (2783): logEvictEvent with unknown reason
> (108), aborting
> 06/06/17 19:48:16 (1115.0) (2783): **** condor_shadow (condor_SHADOW) pid
> 2783 EXITING WITH STATUS 108
> 06/06/17 19:48:38 Can't open directory "/var/opt/condor/config" as
> PRIV_UNKNOWN, errno: 2 (No such file or directory)
> 06/06/17 19:48:38 Setting maximum accepts per cycle 8.
> 06/06/17 19:48:38 ******************************************************
> 
> 
> Thank you for the help.
> 
> 
> --
> Carlos Adean
> www.linea.gov.br
> 
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with
> a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> 
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/
_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/