[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Problems with final transfer of files



Hello,

We are heving troubles with some vanilla jobs that get an error _after_ they are finished, and apparently after the final file transfer has taken place. This makes them start from the beginning over and over again. I have put full debug both in the starter and in the shadow daemons, and yet I have found no clue about it.

It must be said that this doesn't happen in all the jobs, the ones where this happen are arguably the longest ones and the ones that generates bigger files, but still are all of them below 2G (there is one 1.4G big results file).

Here is the relevant part from ShadowLog:

7/5 09:11:10 (2.0) (5950): wrote 8149 bytes
7/5 09:11:10 (2.0) (5950): Entering BaseShadow::updateJobInQueue
7/5 09:11:10 (2.0) (5950): SHADOW_TIMEOUT_MULTIPLIER is undefined, using default value of 0
7/5 09:11:10 (2.0) (5950): SEC_DEBUG_PRINT_KEYS is undefined, using default value of False
7/5 09:11:10 (2.0) (5950): AUTHENTICATE_FS: used file /tmp/qmgr_Is2ATk, status: 1
7/5 09:11:10 (2.0) (5950): Updating Job Queue: SetAttribute(BytesSent, -4030657536.000000)
7/5 09:11:10 (2.0) (5950): Updating Job Queue: SetAttribute(BytesRecvd, 8546448.000000)
7/5 09:11:10 (2.0) (5950): condor_read(): Socket closed when trying to read buffer
7/5 09:11:10 (2.0) (5950): ERROR "Can no longer talk to condor_starter on execute machine (aaa.bbb.ccc.ddd)" at line 63 in file NTreceivers.C
7/5 09:11:10 (2.0) (5950): FileLock::obtain(1) failed - errno 37 (No locks available)
7/5 09:11:11 PASSWD_CACHE_REFRESH is undefined, using default value of 300


And from the equivalent StarterLog

7/5 09:46:46 DoUpload: send file ModHarp153630.sta
7/5 09:46:46 ReliSock: put_file: sent 8149 bytes
7/5 09:46:46 DoUpload: exiting at 1413
7/5 09:46:46 ERROR "Assertion ERROR on (filetrans->UploadFiles(true, final_transfer))" at line 336 in file jic_shadow.C
7/5 09:46:46 ShutdownFast all jobs.
7/5 09:46:46 Got ShutdownFast when no jobs running.
7/5 09:46:51 PASSWD_CACHE_REFRESH is undefined, using default value of 300



(Yes, I have just realized that the clock in this machine hasn't got the right time. Anyway, it's less than 1h between them, and I think it souldn't matter, as we have got problems as well with other machines in the pool).


Thanks in advance,
   Joan