[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job problmes



Yu
You didn't say so explictly but it looks like you are using
the vdt nfs-lite job manager and transferring all the
job files via condor_transfer_files from the globus
gatekeeper to the worker node, is that right?

If so then you need to check all of the /etc/grid-security/certificates
directories on both ends, the globus gatekeeper/schedd and the
worker node in question.  Probably one of those is not working.
For more info put D_SECURITY into the SCHEDD_DEBUG and SHADOW_DEBUG
settings on the gatekeeper and STARTD_DEBUG and STARTER_DEBUG
on the worker nodes.

Steve
On Sat, 31 Oct 2009, Yu Fu wrote:

We got some problems with jobs submitted to an OSG/condor cluster
recently. The problem is soon after a job is picked up to execute it is
terminated with "Failed to transfer files" errors ilke:

10/30 22:24:30 Using config source: /etc/condor/condor_config
10/30 22:24:30 Using local config sources:
10/30 22:24:30    /opt/condor/condor_config.local
10/30 22:24:30 DaemonCore: Command Socket at <128.227.221.101:38240>
10/30 22:24:30 Done setting resource limits
10/30 22:24:47 Communicating with shadow <128.227.221.11:38425>
10/30 22:24:47 Submitting machine is "pg.ihepa.ufl.edu"
10/30 22:24:47 setting the orig job name in starter
10/30 22:24:47 setting the orig job iwd in starter
10/30 22:25:07 relisock_gsi_get (read from socket) failure
10/30 22:25:07 ReliSock::get_x509_delegation(): delegation
failed: x509_receive_delegation failed at line 850
10/30 22:25:07 DoDownload: STARTER at 128.227.221.101 failed to receive
file /wntmp/execute/dir_11877/x509_up
10/30 22:25:07 condor_write(): Socket closed when trying to write 193
bytes to <128.227.221.11:38425>, fd is 9
10/30 22:25:07 Buf::write(): condor_write() failed
10/30 22:25:07 Failed to send download failure report to
<128.227.221.11:38425>.
10/30 22:25:07 File transfer failed (status=0).
10/30 22:25:07 ERROR "Failed to transfer files" at line 1781 in file
jic_shadow.cpp
10/30 22:25:07 ShutdownFast all jobs.

We have made sure the corresponsing directory is writeable for the
account. This sounds like a GSI problem as we see "relisock_gsi_get
(read from socket) failure", but we are not sure. Can this be a Condor
problem? What could cause this prolbem?

We have tried upgrading OSG to 1.2.3 and Condor 7.2.4 but still got the
same problems.

Thanks,

Yu


--
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
timm@xxxxxxxx  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.