[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] job problmes



We got some problems with jobs submitted to an OSG/condor cluster
recently. The problem is soon after a job is picked up to execute it is
terminated with "Failed to transfer files" errors ilke:

10/30 22:24:30 Using config source: /etc/condor/condor_config
10/30 22:24:30 Using local config sources:
10/30 22:24:30    /opt/condor/condor_config.local
10/30 22:24:30 DaemonCore: Command Socket at <128.227.221.101:38240>
10/30 22:24:30 Done setting resource limits
10/30 22:24:47 Communicating with shadow <128.227.221.11:38425>
10/30 22:24:47 Submitting machine is "pg.ihepa.ufl.edu"
10/30 22:24:47 setting the orig job name in starter
10/30 22:24:47 setting the orig job iwd in starter
10/30 22:25:07 relisock_gsi_get (read from socket) failure
10/30 22:25:07 ReliSock::get_x509_delegation(): delegation
failed: x509_receive_delegation failed at line 850
10/30 22:25:07 DoDownload: STARTER at 128.227.221.101 failed to receive
file /wntmp/execute/dir_11877/x509_up
10/30 22:25:07 condor_write(): Socket closed when trying to write 193
bytes to <128.227.221.11:38425>, fd is 9
10/30 22:25:07 Buf::write(): condor_write() failed
10/30 22:25:07 Failed to send download failure report to
<128.227.221.11:38425>.
10/30 22:25:07 File transfer failed (status=0).
10/30 22:25:07 ERROR "Failed to transfer files" at line 1781 in file
jic_shadow.cpp
10/30 22:25:07 ShutdownFast all jobs.

We have made sure the corresponsing directory is writeable for the
account. This sounds like a GSI problem as we see "relisock_gsi_get
(read from socket) failure", but we are not sure. Can this be a Condor
problem? What could cause this prolbem?

We have tried upgrading OSG to 1.2.3 and Condor 7.2.4 but still got the
same problems.

Thanks,

Yu