[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Are flocking jobs considered remote jobs?

Hi Jacek,

Whether a job is flocking is independent of whether or not SPOOL is used; use of SPOOL is a per-job setting.

Now, the *use* of flocking might trigger an unexpected use of SPOOL.  For example, if you have:

when_to_transfer_output = ON_EXIT_OR_EVICT

This would transfer the job's output files (potentially many) to the spool when it gets preempted.  To quote the manual:

The ON_EXIT_OR_EVICT option is intended for fault tolerant jobs which periodically save their own state and can restart where they left off. In this case, files are spooled to the submit machine any time the job leaves a remote site, either because it exited on its own, or was evicted by the HTCondor system for any reason prior to job completion. The files spooled back are placed in a directory defined by the value of the SPOOL configuration variable. Any output files transferred back to the submit machine are automatically sent back out again as input files if the job restarts.

So, if the job can get evicted during flocking -- but you don't have eviction enabled on the "home pool", then this previously-innocuous setting can certainly cause problems!

A few options:

1.  The spool can be moved to a filesystem with per-user quotas as the files are owned by the job owner in current versions of HTCondor.  (This is a common choice).  The user may get strange error messages if spool runs out of quota - but at least they don't disrupt things.
2.  There's the useful-but-tricky ALTERNATE_JOB_SPOOL configuration variable.  That would allow you to place a spool in each user's home directory (note: you may need to pre-create this directory).  Useful if the quotas are done at the directory level (a-la Ceph) and not the user level.

Sorry to hear about your dead schedd on Wednesday!


> On Apr 6, 2022, at 1:57 PM, Jacek Kominek via HTCondor-users <htcondor-users@xxxxxxxxxxx> wrote:
> Hi,
> We run into an issue today where our submit node run out of storage space, which killed schedd. It turns out that the culprit was a couple of job subdirectories under '/var/lib/condor/spool/' and it seems that these were submitted as flocking jobs to another pool. The manual states in various places that the SPOOL dir contains both input and output files from remote jobs and we wonder if that could be the case here? If so, a second question is whether flocking jobs can be configured so that thee outputs are written directly to the user's folder where they were submitted from, or do they always have to go through the SPOOL?
> Of course, one solution is to move SPOOL to a different location on our end that will not run out of space, and just might end up doing that, but it would be good to know how it is supposed to work regardless.
> Thanks in advance,
> Best,
> -Jacek
> -- 
> Jacek Kominek, PhD
> University of Wisconsin-Madison
> 1552 University Avenue, Wisconsin Energy Institute 4154 Madison, WI
> 53726-4084, USA
> jkominek@xxxxxxxx
> _______________________________________________
> HTCondor-users mailing list
> To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
> subject: Unsubscribe
> You can also unsubscribe by visiting
> https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users
> The archives can be found at:
> https://lists.cs.wisc.edu/archive/htcondor-users/