Re: [HTCondor-users] Schedd RAM usage exploding after condor

Mailing List Archives Public Access	UW Madison Computer Sciences Department Computer Systems Lab

Hi Todd,

On 31 March 2016 at 21:09, Todd Tannenbaum <tannenba@xxxxxxxxxxx> wrote:

FWIW, I did a quick test with HTCondor 8.5.4 on my Windows 7 laptop, and doing the same condor_hold you did below on 10,000 jobs resulted in the private memory of the schedd increasing by 32MB and the job_queue.log increasing by 2MB. I know I am not comparing apples to apples, but there is a big difference between my observed increase of 32 MB and your observed increase of 15GB.

ÂThis is what I would naively expect.

Brian has a good point, are you using custom classad functions at all?

No.

If there a memory leak in the custom python functions, that could explain the intense increase in ram usage, but not necessarily the increase job_queue.log file size...Â ÂThe PNG was memory usage of just the schedd process alone, right?Â I wonder if something else was happening with your schedd, and the condor_hold is a red herring...

This is the entire memory usage, but in 'top' the condor_schedd process was leading with memory consumption when I checked after the first restart attempt (and steadily growing) which fits the graph (minus some offset).

It seems that the condor metrics are not published for this particular machine (Failed to execute gmetric -h: Permission denied)! Gonna fix that.

Cheers,

Luke

regards,
Todd

As far as I know the user in queued each job separately (i.e. the job
IDs do not have a subjob ID: 694488.0, 694489.0, ...).
The command I used to put the jobs on hold was:
condor_hold <username> -constraint JobStatus==1 -reason '<bad user, blah
blah>'

I have tried to stop the condor service since we were at risk to run out
of RAM & swap, however both condor_master and condor_schedd were
hanging. In the meantime, no command was able to query the scheduler.

After some deliberation it was decided to kill the processes and start
the service from scratch....-> same pattern (RAM & swap).
While I was chatting with the users and debating about deleting the
queue (most jobs were from the bad user) I added 100 GB of swap, updated
condor to 8.4.5 and attempted another restart.

Sometime during that the user in question must have deleted his held
jobs as I found the queue with only his running jobs and other users' jobs.

All-in-all I am positively surprised that whatever happened with the
held jobs, other users did not suffer.

Thanks for the reply and for reading :).

Cheers,
Luke

On 31 March 2016 at 20:02, Todd Tannenbaum <tannenba@xxxxxxxxxxx
<mailto:tannenba@xxxxxxxxxxx>> wrote:

Â Â On 3/31/2016 7:55 AM, L Kreczko wrote:

Â Â Â Â Dear experts,

Â Â Â Â I am trying to understand the schedd behaviour I witnessed today.
Â Â Â Â After sending 10k (bad) jobs to hold status, the RAM usage of the
Â Â Â Â condor_schedd process exploded (see attached png).

Â Â Â Â The job_queue log is now 9.3GB and contains all ClassAds of the held
Â Â Â Â jobs (I assume this is what is causing the RAM usage).
Â Â Â Â This was not the case when the jobs where idle. Is this
Â Â Â Â behaviour expected?
Â Â Â Â Can I do something to prevent this from happening?

Â Â Â Â Cheers,
Â Â Â Â Luke

Â Â Hi Luke,

Â Â What HTCondor version / operating system are your using?

Â Â Including version information in any incident report is always a
Â Â good idea. :)

Â Â Also, did you submit these 10k jobs via 10,000 invocations of
Â Â condor_submit, or via one invocation with "queue 10000" ?

Â Â Just to be sure we have the correct facts: you submitted the 10k
Â Â jobs, and memory usage of the schedd was fine (i.e. less than 5 gig
Â Â according to your graph).Â Then schedd memory usage exploded to
Â Â 15GB+ as soon as you did the condor_hold, and most (all?) of the
Â Â jobs you put on hold were previously in the idle state.

Â Â Also, could you send the output of
Â Â Â Âcondor_schedd -v
Â Â and
Â Â Â Âcondor_config_val -dump QUEUE

Â Â As you is there something you can do to prevent this:Â Once we have
Â Â clarification on the above, we can investigate more (i.e. reproduce
Â Â here) and hopefully give better advice.Â Until then I cannot
Â Â precisely say what is going on, so my naive initial in the mean time
Â Â advice would be run the latest release in whatever series you are
Â Â using, and perhaps hold jobs a chunk at a time , i.e 500 at a time
Â Â could be done like
Â Â Â Âcondor_hold -cons 'ClusterId > 5000 && Cluster <= 5500'

Â Â Certainly HTCondor should be able to handle putting 10k jobs on hold
Â Â in one go.Â As to what I think is going on: When you do condor_hold
Â Â (or whatever) on a large group of jobs all at once, either all the
Â Â jobs will go on hold, or none of the jobs will go on hold (i.e.
Â Â database-style transactional processing).Â The schedd will store 10k
Â Â changes to a transaction log in RAM... I wouldn't expect this log to
Â Â take many gigs of ram however!Â But one improvement we've had in
Â Â mind for a while (mainly for speed) is instead of having 10k
Â Â transaction log entries would be to have one transaction log action
Â Â that effectively gives a constraint like "all jobs" or whatever you
Â Â gave to condor_hold...Â A downside of implementing this is it would
Â Â not be forwards compatible - i.e. after upgrading to a new schedd
Â Â with this feature, you may not be able to downgrade anymore (because
Â Â the job_queue.log file may contains entries an old schedd would not
Â Â understand).

Â Â Absolute worst case you could shutdown HTCondor and remove
Â Â everything in the $(SPOOL) directory, effectively flushing all your
Â Â jobs to the bitbucket.Â Then before restarting you could set config
Â Â knob SCHEDD_CLUSTER_INITIAL_VALUE to a number higher than your
Â Â previous job id so that you don't repeat job id numbers, if you care
Â Â about that.Â Of course it shouldn't have to come down to this
Â Â extreme option, but I thought I'd mention it just in case everything
Â Â is on fire and restarting HTCondor doesn't help.

Â Â Thanks
Â Â Todd

Â Â _______________________________________________
Â Â HTCondor-users mailing list
Â Â To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx
Â Â <mailto:htcondor-users-request@xxxxxxxxxxx> with a
Â Â subject: Unsubscribe
Â Â You can also unsubscribe by visiting
Â Â https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

Â Â The archives can be found at:
Â Â https://lists.cs.wisc.edu/archive/htcondor-users/

--
*********************************************************
*Dr Lukasz Kreczko
Â ÂResearch Associate*
*Â Department of Physics*
*Â Particle Physics Group
*
Â ÂUniversity of Bristol
Â ÂHH Wills Physics Lab
Â ÂUniversity of Bristol
Â ÂTyndall Avenue
Â ÂBristol
Â ÂBS8 1TL

+44 (0)117 928 8724
L.Kreczko@xxxxxxxxxxxxx <mailto:L.Kreczko@xxxxxxxxxxxxx>
Â ÂA top 5 UK university with leading employers (2015)
Â ÂA top 5 UK university for research (2014 REF)
Â ÂA world top 40 university (QS Ranking 2015)
*********************************************************

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

--
Todd Tannenbaum <tannenba@xxxxxxxxxxx> University of Wisconsin-Madison
Center for High Throughput ComputingÂ ÂDepartment of Computer Sciences
HTCondor Technical LeadÂ Â Â Â Â Â Â Â 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132Â Â Â Â Â Â Â Â Â Madison, WI 53706-1685

_______________________________________________
HTCondor-users mailing list
To unsubscribe, send a message to htcondor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/htcondor-users/

*********************************************************
Â Dr Lukasz Kreczko Â Â Â Â ÂÂ
Â Research Associate

Â Department of Physics

Â Particle Physics Group

Â University of Bristol
Â HH Wills Physics Lab
Â University of Bristol
Â Tyndall Avenue
Â Bristol
Â BS8 1TL

Â +44 (0)117 928 8724Â

Â L.Kreczko@xxxxxxxxxxxxx

ÂÂ

Â A top 5 UK university with leading employers (2015)

Â A top 5 UK university for research (2014 REF)

Â A world top 40 university (QS Ranking 2015)

*********************************************************

Mailing List Archives

Public Access

Re: [HTCondor-users] Schedd RAM usage exploding after condor_hold of 10k jobs