[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel Jobs + Chirp in 7.6.4


We're seeing the same problem since upgrading 7.4.4 -> 7.6.4, which is a real pain. I note that another user reported the error for 7.6.2 back in August (https://lists.cs.wisc.edu/archive/condor-users/2011-August/msg00033.shtml), but unfortunately his post didn't get a reply. We're loathe to downgrade back to 7.4.4, so any help from the community and/or developers on the issue would be greatly appreciated.


On 03/11/11 15:24, William Strecker-Kellogg wrote:
Hi all,

I'm having trouble debugging a cluster that wants to run MPI jobs.  They
are getting failures in the sshd.sh script that ships with condor in the
jobs stderr:

chirp: couldn't putfile: No such file or directory
/usr/libexec/condor/sshd.sh: line 69: 23981 Aborted
$CONDOR_CHIRP put -perm 0700 $idkey

Tracing the relevant processes I see the following sent from chirp to
the starter:

"putfile /var/spool/condor/astro/30/0/cluster30.proc0.subproc0/1.key 448

starter sends
and gets "\1\0\0\0\20" and
"\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\2" from the shadow, and
then writes "-3" to chirp which fails.

In the shadow log I'm getting things like:

ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxx: File
25.proc0.subproc0/contact maps to url 1320272782, which I don't know how
to open.

and stracing it it tries to open "var/spool/...etc..." without a forward
slash and fails (not sure if this matters).

I've checked the obvious (to me) things like permissions on spool,
etc... and they look OK.  Any help would be greatly appreciated.

William Strecker-Kellogg
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting

The archives can be found at: