[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] pseudo_ops.cpp on starter

On Thu, Oct 21, 2010 at 11:41 AM, Mag Gam <magawake@xxxxxxxxx> wrote:

It is. It happens when I submit jobs from a Linux machine using condor_submit, using a share that's mounted on the Linux machine via Samba, that target a Linux machine to run and are using Condor to write the data back. Incredibly weirdly specific. If I submit the same way but target Windows nodes the jobs run fine.

It's got something to do with the data copy back from the job to the mounted share. But it's quite odd since all the data transfer should be going through the condor_shadow, and the shadows are running on the same machine, it's only the starters that differ here.

What do you see in your starterlog? and startlog?

On the execute side I see a similar assert error message but from a different source file:

10/20 23:16:45 Create_Process succeeded, pid=15975
10/20 23:17:06 DaemonCore: pid 15975 exited with status 0, invoking reaper 1 <Reaper>
10/20 23:17:06 Process exited, pid=15975, status=0
10/20 23:17:06 condor_write(): Socket closed when trying to write 13 bytes to daemon at <>, fd is 9, errno=104 Connection reset by peer
10/20 23:17:06 Buf::write(): condor_write() failed
10/20 23:17:06 ReliSock::put_file_with_permissions(): Failed to send permissions
10/20 23:17:06 DoUpload: STARTER at failed to send file(s) to <>: error sending /opt/condor.local/execute/dir_15972/45-0.stdout.txt
10/20 23:17:06 ERROR "Assertion ERROR on (m_ft_info.hold_code != 0)" at line 435 in file jic_shadow.cpp
10/20 23:17:06 ShutdownFast all jobs.
10/20 23:17:06 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <>.
10/20 23:17:06 IO: Failed to read packet header
10/20 23:17:06 Failed to send job exit status to shadow
10/20 23:17:06 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

- Ian