[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Parallel Jobs + Chirp in 7.6.4



Just to keep the online record complete, the relevant bug has been identified by the Condor developers (it occurs in the condor_starter daemon) and the fix will appear in 7.6.5 and 7.7.3.

Mark

On 09/11/2011 11:54, Mark Calleja wrote:
Hi,

We're seeing the same problem since upgrading 7.4.4 -> 7.6.4, which is a real pain. I note that another user reported the error for 7.6.2 back in August (https://lists.cs.wisc.edu/archive/condor-users/2011-August/msg00033.shtml), but unfortunately his post didn't get a reply. We're loathe to downgrade back to 7.4.4, so any help from the community and/or developers on the issue would be greatly appreciated.

Best,
Mark

On 03/11/11 15:24, William Strecker-Kellogg wrote:
Hi all,

I'm having trouble debugging a cluster that wants to run MPI jobs.  They
are getting failures in the sshd.sh script that ships with condor in the
jobs stderr:

chirp: couldn't putfile: No such file or directory
/usr/libexec/condor/sshd.sh: line 69: 23981 Aborted
$CONDOR_CHIRP put -perm 0700 $idkey
$_CONDOR_REMOTE_SPOOL_DIR/$_CONDOR_PROCNO.key

Tracing the relevant processes I see the following sent from chirp to
the starter:

"putfile /var/spool/condor/astro/30/0/cluster30.proc0.subproc0/1.key 448
1675"

starter sends
"\1\0\0\0S\0\0\0\0\0\0\1&var/spool/condor/astro/32/0/cluster32.proc0.subproc0/0.key\0\0\0\0\0\0\0\1\300\0\0\0\0\0\0\6\213"
and gets "\1\0\0\0\20" and
"\377\377\377\377\377\377\377\377\0\0\0\0\0\0\0\2" from the shadow, and
then writes "-3" to chirp which fails.

In the shadow log I'm getting things like:

ERROR "Error from slot2@xxxxxxxxxxxxxxxxxxxxx: File
var/spool/condor/astro/25/0/cluster
25.proc0.subproc0/contact maps to url 1320272782, which I don't know how
to open.

and stracing it it tries to open "var/spool/...etc..." without a forward
slash and fails (not sure if this matters).

I've checked the obvious (to me) things like permissions on spool,
etc... and they look OK.  Any help would be greatly appreciated.

Thanks,
William Strecker-Kellogg
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/