[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Condor-users] Parallel Jobs + Chirp in 7.6.4
- Date: Thu, 03 Nov 2011 11:24:15 -0400
- From: William Strecker-Kellogg <willsk@xxxxxxx>
- Subject: [Condor-users] Parallel Jobs + Chirp in 7.6.4
I'm having trouble debugging a cluster that wants to run MPI jobs. They
are getting failures in the sshd.sh script that ships with condor in the
chirp: couldn't putfile: No such file or directory
/usr/libexec/condor/sshd.sh: line 69: 23981 Aborted
$CONDOR_CHIRP put -perm 0700 $idkey
Tracing the relevant processes I see the following sent from chirp to
"putfile /var/spool/condor/astro/30/0/cluster30.proc0.subproc0/1.key 448
and gets "\1\0\0\0\20" and
"\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\2" from the shadow, and
then writes "-3" to chirp which fails.
In the shadow log I'm getting things like:
ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxx: File
25.proc0.subproc0/contact maps to url 1320272782, which I don't know how
and stracing it it tries to open "var/spool/...etc..." without a forward
slash and fails (not sure if this matters).
I've checked the obvious (to me) things like permissions on spool,
etc... and they look OK. Any help would be greatly appreciated.