[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Problems with final transfer of files



On Wed July 6 2005 2:53 am, Joan J. Piles Contreras wrote:
Hello,

> We are heving troubles with some vanilla jobs that get an error _after_
> they are finished, and apparently after the final file transfer has
> taken place. This makes them start from the beginning over and over
> again. I have put full debug both in the starter and in the shadow
> daemons, and yet I have found no clue about it.

What version of Condor are you running on what O/S?

> It must be said that this doesn't happen in all the jobs, the ones where
> this happen are arguably the longest ones and the ones that generates
> bigger files, but still are all of them below 2G (there is one 1.4G big
> results file).
>
> Here is the relevant part from ShadowLog:
>
> 7/5 09:11:10 (2.0) (5950): condor_read(): Socket closed when trying to
> read buffer
> 7/5 09:11:10 (2.0) (5950): ERROR "Can no longer talk to condor_starter
> on execute machine (aaa.bbb.ccc.ddd)" at line 63 in file NTreceivers.C
> 7/5 09:11:10 (2.0) (5950): FileLock::obtain(1) failed - errno 37 (No
> locks available)

This isn't the cause of the problems, but concerns me.  If I'm reading the 
code correctly, this error means that the user/job log code couldn't lock the 
log file to log the error.  Do you have a low file lock set on your system or 
some such?  In general, you shouldn't see this, I think.

> 7/5 09:11:11 PASSWD_CACHE_REFRESH is undefined, using default value of 300
>
> And from the equivalent StarterLog
>
> 7/5 09:46:46 DoUpload: send file ModHarp153630.sta
> 7/5 09:46:46 ReliSock: put_file: sent 8149 bytes
> 7/5 09:46:46 DoUpload: exiting at 1413
> 7/5 09:46:46 ERROR "Assertion ERROR on (filetrans->UploadFiles(true,
> final_transfer))" at line 336 in file jic_shadow.C

Now, this is where the actual error occurred.  Knowing which version of Condor 
could help narrow down where it went wrong.

> (Yes, I have just realized that the clock in this machine hasn't got the
> right time. Anyway, it's less than 1h between them, and I think it
> souldn't matter, as we have got problems as well with other machines in
> the pool).

I don't think that this should be an issue.

> Thanks in advance,
>     Joan

Glad to help

-Nick

-- 
           <<< There is no spoon. >>>
 /`-_    Nicholas R. LeRoy               The Condor Project
{     }/ http://www.cs.wisc.edu/~nleroy  http://www.cs.wisc.edu/condor
 \    /  nleroy@xxxxxxxxxxx              The University of Wisconsin
 |_*_|   608-265-5761                    Department of Computer Sciences