[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor 7.4.2 (x86_64) on Fedora 14 - Condor_chirp error in Parallel runs



Hello again,

A Good Day to everyone!

After literally turning the Internet inside out and using various "Google search strategies", I came across a thread on the Red-Hat MRG (https://bugzilla.redhat.com/show_bug.cgi?id=549432) which addresses exactly the same issue.

It turns out it was a bug (or a change / improvement) which was first noticed in Condor 7.4.1 and was later corrected (further changed / improved) in Condor 7.4.4. Unfortunately for me, the version of Condor currently available in the Fedora 14 repository happens to be right in the middle (v 7.4.2).... which as I discovered, still contains the issue.

So I downloaded the latest stable version of Condor (7.4.4) and adapted the Fedora 14 Condor 7.4.2 SPEC file to work with this version. I then compiled and generated the new RPMs and installed them on the Primary (master) system and on one slave Node.

This immediately corrected the issue with "condor_chirp" and the shifting of the SSH keys from the execute to the submit machine, and a parallel test isolated to just the master machine (which is a 4-core SMP) worked without a hitch.

However, when running a parallel test across two machines (6 Cores) it came up with some "permission denied" errors.

I then had to further modify the "sshd.sh" and the sample "openmpiscript" scripts based on an attachment I found on one of the follow-up bug-fixes on the RedHat MRG Bugzilla website.

After these modifications, everything fell in place, and now the cluster is back on track :-)!

Note..... I also do not need to explicitly include "+WantIOProxy = True" in the submit file.

It is surprising that this issue was not reported by anyone on the Condor-users mailing list.... I had gone through (maybe I missed something) all the threads on the mailing list till around June 2010 before posting this message.

Anyway, I was wondering.... does anyone know who maintains the Fedora Condor RPMs? The screen-name of the person is "sharkcz".... I guess we should be moving up to Condor 7.4.4 in the Fedora repository as soon as possible.


Have a great weekend!

Regards,
Philippose





On Fri, Jan 7, 2011 at 1:55 PM, Philippose Rajan <philippose.rajan@xxxxxxxxx> wrote:
Hello,

I have been using Condor (7.4.0 I think) on our cluster running on Fedora 10 (x86_64) for around 2 years now. Recently upgraded our cluster to Fedora 14 and installed the Condor package using "yum install condor", which installed Condor 7.4.2.

After setting up the condor.config and condor.config.local files as before (but accounting for the changes between 7.4.0 and 7.4.2), the cluster works fine for normal "vanilla" jobs.

All the machines have ports 9600-9700 as well as ports 4400-5000 (required by the sshd.sh script when running parallel jobs) open, and machines within the cluster are given trusted full access using specific firewall rules.

Test programs which check if "mpirun" runs across multiple machines on the cluster all work fine (indicating that passwordless ssh and the firewall settings are all ok).

When trying to run an OpenMPI parallel job using the sample "openmpiscript" wrapper, the job refuses to go through due to errors thrown by "condor_chirp" very early on during the job execution process.

Basically, the "openmpiscript" wrapper calls "/usr/libexec/condor/sshd.sh" in order to prepare the ssh environment (key generation, passing keys between the machines, and starting the ssh server daemon) before finally running "mpirun".

Within the "sshd.sh" script, after generation of the "hostkey" and "idkey" on the respective machines, condor uses "condor_chirp put -perm 0700 $idkey _CONDOR_REMOTE_SPOOL_DIR/_
CONDOR_PROCNO.key"..... the execution always fails at this point.

The condor Error log shows:
Can't chirp_client_open /var/lib/condor/spool/cluster9.proc0.subproc0/1.key:-1
Can't chirp_client_open /var/lib/condor/spool/cluster9.proc0.subproc0/0.key:-1

And the normal log file shows:
error 0 chirp putting identity keys back
error 0 chirp putting identity keys back

This is also the case when trying to run a parallel case using multiple cores on the same physical machine (hence no network access issues would come in the way).

By putting in some debug lines into the "sshd.sh" script, I was able to confirm that these files referred to above were actually created during the key generation phase, and exist.

I have tried everything I can think of, but to no avail:
1. Disabled the firewall
2. Disabled SELinux
3. Gave complete write access to the "execute" and "spool" folders
4. Added the user accounts to the "condor" group
5. Used "+WantIOProxy = True" in the submit file

I have run out of ideas, and am beginning to wonder if it might be a bug in the "condor_chirp" program.....

I would greatly appreciate it if someone could give me some pointers regarding this issue..... it is currently holding up an otherwise perfectly functioning cluster.

A great day ahead!

Philippose


(P.S. Correction ..... The original Condor installation in the Fedora 10 system was Condor 7.2.4 and not Condor 7.4.0.... sorry for the slip-up)