[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Leftovers of checkpointed jobs accumulate in SPOOL



On Mon, Mar 12, 2012 at 11:24:09AM -0500, Alan De Smet wrote:
> Michael Hanke <michael.hanke@xxxxxxxxx> wrote:
> > I'm testing DMTCP-based checkpointing of vanilla job in our Condor pool
> > (all version 7.7.5). I noticed that jobs once evicted remain in SPOOL
> > even after they got restarted on an exec node again. Checkpoint files,
> > executable, restart script and various other files remain -- I assume
> > that is just everything.
> 
> I'm not clear what you're reporting.  Files in SPOOL should
> remain as long as the associated job is still in the queue.  Are
> you saying that the job in question left the queue (is no longer
> visible in condor_q), but still has a subdirectory in SPOOL? 

Yes, that is the case -- sorry for having been vague. It doesn't seem to
make a difference whether a job terminates normally, or gets
condor_rm'ed. Whenever such a job got checkpointed once, it leaves its
remains in SPOOL.

> If so, that would likely be a bug.  That it's using DMTCP
> checkpointing shouldn't have any impact on the behavior, although it's
> possible that the DMTCP integration code is somehow tickling a Condor
> bug other code isn't.

It may or may not be related to the behavior of Condor transferring
checkpoint files back to the submitter machine upon job completion --
maybe those files are considered job output.

Michael

-- 
Michael Hanke
http://mih.voxindeserto.de