[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] confusion around new spool in 7.5.5





On 2/18/11 11:12 AM, Peter Doherty wrote:

On Feb 18, 2011, at 11:26 , Peter Doherty wrote:

I upgraded to v7.5.5 and there's one thing I'm scratching my head over.

I used to have a SPOOL directory filled with directories with names like:
cluster15093481.proc0.subproc0.tmp/

According to the changelog I should now have dirs in the format of:
$(SPOOL)/<#>/<#>/cluster<#>.proc<#>.subproc<#>


But the thing is, I don't have anything.
my SPOOL just has:
job_queue.log
local_univ_execute
spool_version

I've got a few thousand jobs in the queue right now.
Where are the spool files? I'm sure I'm looking in the correct directory. I've tried to find them, but I can't. I see a lot of lock files in $(TMP_DIR)

I believe the constant I/O of all the spool files was one of the bottlenecks of our Schedd, so if that's really been improved upon, I'm eager to see the effect, but from reading the changelog, the only different should have been subdirs for the spool to keep from hitting ext3 limits.


Hmm, okay. Jobs seem to be running okay, but I see a lot of these errors in the Shadow Log:

02/18/11 12:09:25 (pid:649) (15101845.0) (649): Directory::setOwnerPriv() -- failed to find owner of /raid0/gwms_schedd/spool/1845/0/cluster15101845.proc0.subproc0.tmp 02/18/11 12:09:25 (pid:649) (15101845.0) (649): Directory::Rewind(): failed to find owner of "/raid0/gwms_schedd/spool/1845/0/cluster15101845.proc0.subproc0.tmp"

I guess that's part of the problem. I checked the perms on the spool directory, and then I set it to 777 and verified regular users can write to it, but that didn't stop the errors, or cause files to be created there.
So I'm not really clear what's going on.

It appears that these messages are expected for jobs that do not have spool directories (see my other message). Therefore, we should fix the shadow not to generate noise in this case.

--Dan