[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Too many open files

Hi Chris,

On Thursday, 3 November, 2011 at 4:30 PM, Christopher Martin wrote:


We're getting errors in the job log files indicating that there are too many files open:
007 (196430.005.000) 11/03 08:13:00 Shadow exception!
Error from slot12@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx: Failed to open '/mnt/render/jobs/job_141798_rndrgatebegin_yko_120_0400_syanye/chr_all_rp_tcrender-196430-5-stdout.txt' as standard output: Too many open files (errno 24)
0  -  Run Bytes Sent By Job
0  -  Run Bytes Received By Job

The file it's complaining about is the stdout from the job's executable. I've taken a look at the submit/scheduler machine and we're nowhere near the file limit. Same thing on the execution machine. We are however logging to a Windows share mounted to the submit/scheduler machine over CIFS. We've been experiencing extremely heavy load on the windows filer that we're logging to so I'm guessing it's a result of that but I wanted to throw this out there in case anyone else has run into similar issues before.
Samba mount? I'm not particularly fond of Samba in large deployments -- it doesn't scale up well. Windows file access semantics use locks over zealously and SMB is an aging protocol, Samba can't really keep up. It usually adds up to disaster above a 200 hundred concurrent handles or so, no matter how powerful the underlying hardware.

Your best bet is to move logging to local disk. You could try NFS-mounted remote but there are file lock issues on NFS to contend with as well.

- Ian

Ian Chesal

Cycle Computing, LLC
Leader in Open Compute Solutions for Clouds, Servers, and Desktops
Enterprise Condor Support and Management Tools