[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Parallel Jobs + Chirp in 7.6.4



Hi all,

I'm having trouble debugging a cluster that wants to run MPI jobs.  They
are getting failures in the sshd.sh script that ships with condor in the
jobs stderr:

chirp: couldn't putfile: No such file or directory
/usr/libexec/condor/sshd.sh: line 69: 23981 Aborted
$CONDOR_CHIRP put -perm 0700 $idkey
$_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key

Tracing the relevant processes I see the following sent from chirp to
the starter:

"putfile /var/spool/condor/astro/30/0/cluster30.proc0.subproc0/1.key 448
1675"

starter sends
"\1\0\0\0S\0\0\0\0\0\0\1&var/spool/condor/astro/32/0/cluster32.proc0.subproc0/0.key\0\0\0\0\0\0\0\1\300\0\0\0\0\0\0\6\213"
and gets "\1\0\0\0\20" and
"\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\2" from the shadow, and
then writes "-3" to chirp which fails.

In the shadow log I'm getting things like:

ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxx: File
var/spool/condor/astro/25/0/cluster
25.proc0.subproc0/contact maps to url 1320272782, which I don't know how
to open.

and stracing it it tries to open "var/spool/...etc..." without a forward
slash and fails (not sure if this matters).

I've checked the obvious (to me) things like permissions on spool,
etc... and they look OK.  Any help would be greatly appreciated.

Thanks,
William Strecker-Kellogg