[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor 7.4.2 (x86_64) on Fedora 14 - Condor_chirp error in Parallel runs



Hello,

I have been using Condor (7.4.0 I think) on our cluster running on Fedora 10 (x86_64) for around 2 years now. Recently upgraded our cluster to Fedora 14 and installed the Condor package using "yum install condor", which installed Condor 7.4.2.

After setting up the condor.config and condor.config.local files as before (but accounting for the changes between 7.4.0 and 7.4.2), the cluster works fine for normal "vanilla" jobs.

All the machines have ports 9600-9700 as well as ports 4400-5000 (required by the sshd.sh script when running parallel jobs) open, and machines within the cluster are given trusted full access using specific firewall rules.

Test programs which check if "mpirun" runs across multiple machines on the cluster all work fine (indicating that passwordless ssh and the firewall settings are all ok).

When trying to run an OpenMPI parallel job using the sample "openmpiscript" wrapper, the job refuses to go through due to errors thrown by "condor_chirp" very early on during the job execution process.

Basically, the "openmpiscript" wrapper calls "/usr/libexec/condor/sshd.sh" in order to prepare the ssh environment (key generation, passing keys between the machines, and starting the ssh server daemon) before finally running "mpirun".

Within the "sshd.sh" script, after generation of the "hostkey" and "idkey" on the respective machines, condor uses "condor_chirp put -perm 0700 $idkey _CONDOR_REMOTE_SPOOL_DIR/_
CONDOR_PROCNO.key"..... the execution always fails at this point.

The condor Error log shows:
Can't chirp_client_open /var/lib/condor/spool/cluster9.proc0.subproc0/1.key:-1
Can't chirp_client_open /var/lib/condor/spool/cluster9.proc0.subproc0/0.key:-1

And the normal log file shows:
error 0 chirp putting identity keys back
error 0 chirp putting identity keys back

This is also the case when trying to run a parallel case using multiple cores on the same physical machine (hence no network access issues would come in the way).

By putting in some debug lines into the "sshd.sh" script, I was able to confirm that these files referred to above were actually created during the key generation phase, and exist.

I have tried everything I can think of, but to no avail:
1. Disabled the firewall
2. Disabled SELinux
3. Gave complete write access to the "execute" and "spool" folders
4. Added the user accounts to the "condor" group
5. Used "+WantIOProxy = True" in the submit file

I have run out of ideas, and am beginning to wonder if it might be a bug in the "condor_chirp" program.....

I would greatly appreciate it if someone could give me some pointers regarding this issue..... it is currently holding up an otherwise perfectly functioning cluster.

A great day ahead!

Philippose


(P.S. Correction ..... The original Condor installation in the Fedora 10 system was Condor 7.2.4 and not Condor 7.4.0.... sorry for the slip-up)