[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Assertion Error



Hi

We keep getting an Assertion Error on some of our jobs.

 

From StarterLog:

 

condor_write(): Socket closed when trying to write 13 bytes to daemon at <10.69.200.126:34008>, fd is 9, errno=104 Connection reset by peer

08/01 13:32:03 Buf::write(): condor_write() failed

08/01 13:32:03 ERROR "Assertion ERROR on (m_ft_info.hold_code != 0)" at line 435 in file jic_shadow.cpp

08/01 13:32:03 ShutdownFast all jobs.

08/01 13:32:03 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from <10.69.200.126:46675>.

08/01 13:32:03 IO: Failed to read packet header

08/01 13:32:03 Failed to send job exit status to shadow

08/01 13:32:03 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

 

Then the job will restart. We are using file transer. Files are originally located on a samba-share. When submitted, the files are moved to a local drive on the executing machine. It seems that the problem arises when the files are moved back after the job is done. Some files are not moved, and the restarts. They are now stuck in a loop.

 

Any ideas?

 

Peter