[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [HTCondor-users] Files in /var/lib/condor/execute/ filling up disk



Well, as I said, I think the condor daemons crashed because the disk was full. And I think the disk was full because condor filled it. But it does not seem to be happening again. This may be an edge condition. I am not entirely sure the central manager was right. I am just going to forget about it unless it happens again.



On 6/12/19 12:07 PM, John M Knoeller wrote:
Do you know when or why the HTCondor daemons stopped running?

Hopefully that information would be in the daemon logs   run

condor_config_val STARTD_LOG STARTER_LOG

To find where the StartdLog and StarterLog.* are

The condor_startd daemon should clean up execute directory when it starts up
as well as when job exits/crashes and leaves behind stuff in the execute directory.
condor_vacate should not be necessary in this situation.   In fact I would expect
condor_vacate to do nothing in this situation.

If the condor_startd was hard killed by some external process or user, it is plausible
that this would leave behind files in execute until the HTCondor daemons had a chance to run
again.  but if the HTCondor daemons did an orderly shutdown, then it is a bug that
the execute directory was not cleaned up as part of shutdown - please let us know.

-tj

-----Original Message-----
From: HTCondor-users <htcondor-users-bounces@xxxxxxxxxxx> On Behalf Of John G Heim
Sent: Tuesday, June 11, 2019 11:03 AM
To: HTCondor-Users Mail List <htcondor-users@xxxxxxxxxxx>
Subject: [HTCondor-users] Files in /var/lib/condor/execute/ filling up disk

This morning I found that 2 of the machines in my condor cluster were
essentially down. It turns out that it was because the disk was full and
that was because there were several hundred gigabytes of iles in
/var/lib/condor/execute/. I guessed that condor_vacate would remove them
but condor_vacate returned an error message indicating the the condor
daemon was not running. I had to clear some space on the disk before I
could restart the condor daemons. At that point condor_vacate worked and
the funtion of the machiens returned to normal.

I would like to keep this from happening again. Any ideas on how?