[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] HUGE condor spool files?



Joseph:

The large files in your SPOOL directory are checkpoint files, resulting
from submitting your jobs to the STANDARD universe, which offers the
feature of periodic checkpointing.  The large file sizes reflect the
large runtime image size of your jobs.  The Condor manual has more on
checkpointing:
http://www.cs.wisc.edu/condor/manual/v6.6/4_2Condor_s_Checkpoint.html#17477

There are several strategies you can follow to decrease the size demand
upon your SPOOL directory:

1). First, consider whether you need checkpointing.  If not, submitting
to the VANILLA universe may suit your requirements.  See the
condor_submit manual page:
http://www.cs.wisc.edu/condor/manual/v6.6/condor_submit.html

2). Consider increasing the size of your SPOOL directory.
http://www.cs.wisc.edu/condor/manual/v6.6/3_3Configuration.html#7678

3). Checkpoints are stored in the SPOOL directory of the job submit
host.  Consider submitting jobs from multiple submit hosts to divide the
per-host SPOOL size required.

4). Finally, the Condor Checkpoint Server is intended for exactly this
application.  Consider installing a Checkpoint Server, then checkpoints
will no longer be stored in the SPOOL directory.
http://www.cs.wisc.edu/condor/manual/v6.6/3_4Contrib_Module.html#10907

	Jeff


On Mon, 2005-01-03 at 21:04, Joseph Turian wrote:
> I am finding HUGE condor spool files. They exhaust disk space and then
> everything starts going wrong.
> 
> Why are these spool files so large? Is there some way to disable them?
> 
>    Joseph
> 
> condor_version:
> $CondorVersion: 6.6.7 Oct 11 2004 $
> $CondorPlatform: I386-LINUX_RH80 $
> 
> # ls -l /usr/local/condor/local.coop/spool/
> total 9056520
> -rw-------    1 condor   condor     176128 Jan  3 21:51 Accountantnew.log
> -rw-------    1 condor   condor          0 Jan  3 22:00 Accountantnew.log.tmp
> -rw-r--r--    1 condor   condor   283403263 Jan  3 18:18
> cluster840.proc0.subproc0
> -rw-r--r--    1 condor   condor   233517056 Jan  3 21:17
> cluster840.proc0.subproc0.tmp
> -rw-r--r--    1 condor   condor   283952127 Jan  3 18:20
[snip]
-- 
Jeff Weber
University of Wisconsin, Madison