[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] HTCondor as NoSql database.



Alexander,

Take a look at the scripts associated with the parallel universe. There's illustrations of the use of condor_chirp to transfer SSH keys in the /usr/libexec/condor/sshd.sh script, for example. There's also lamscript, mp1script, mp2script, and openmpiscript in /usr/share/doc/condor-8.8.3/examples. (Replace 8.8.3 with your version number)

In sshd.sh, the non-proc-0 processes create an "$_CONDOR_PROCNO.key" file containing the host identity key and stashes it in the $_CONDOR_REMOTE_SPOOL_DIR path using a "condor_chirp put," and the proc-0 process uses "condor_chirp fetch" to pull down each of those files to gather the list of SSH keys for all processes in the job.

So in your case, I think you'll want to use "condor_chirp put" in the main process to create a file in the spool dir containing the address and port number, which the worker-startup script can watch for and "condor_chirp fetch" when it appears, and then the main process can simply wait until all the workers check in before proceeding.


Michael V. Pelletier
Information Technology
Digital Transformation & Innovation
Integrated Defense Systems
Raytheon Company

From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of Alexander Prokhorov
Sent: Friday, August 9, 2019 4:32 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [External] Re: [HTCondor-users] HTCondor as NoSql database.

Dear Colleagues,

Thank you for responses. Actually, the goal Ivan and I are trying to achieve is the following. Possibly you can help us to find a proper HT Condor based solution.

We run parallel universe job to be sure all processes are running at the same time (to guarantee there are not deadlocks when many such jobs run). Then we need to establish  connections between all worker processes and the main one. The difficulty here is that all running processes (both main and workers) bind to an arbitrary ports and we need some discovery mechanism to let them find each other. So the idea was to publish main process endpoint somewhere (that is where we thought about HT Condor as a key-value storage) and let workers request this endpoint and check-in to the main process.

May be you can advise something here. Thanks in advance.

All the best,
Alexander A. Prokhorov
mailto:prokher@xxxxxxxxx