[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] disk space on file server fills up and condor drops the complete output



> While the job is running the network drive fills up
> The job finishes and condor tries to transfer the results
> back to this network recourse.
> This fails due to lack of space, but during this attempted
> copy it still deletes the original files from the condor server.
>
> Currently I take backups of /condor/execute fairly regularly
> throughout the day. However if this problem occurs at the
> beginning of the weekend we can lose two days of running time.
>
> Has anyone seen this issue before? Do you know of a workaround
> or fix for it?

Yup. See it all the time. As part of your job flow test the copy back
and if it fails with an out-of-space error put the job to sleep instead
of ending it. Wake up periodically, test again, repeat. You can even
have the job send email if it ends up in this state, stuck on a machine,
so the user can grab data from the remote machine's drive and opt to
just kill the job forcefully with condor_rm.

We monitor our free NAS space with Nagios and admins with pagers get
emails on NAS events (like less than 10% free space left) and can do
things like add temporary space to get us through a weekend.

- Ian

Confidentiality Notice.
This message may contain information that is confidential or otherwise protected from disclosure. If you are not the intended recipient, you are hereby notified that any use, disclosure, dissemination, distribution,  or copying  of this message, or any attachments, is strictly prohibited.  If you have received this message in error, please advise the sender by reply e-mail, and delete the message and any attachments.  Thank you.