Run Python Applications Efficiently with malloc_trim

itamarst · on Nov 16, 2020

Memory usage is an interesting problem, because the failure mode is so much more painful (crashes and complete lock-ups) than high CPU usage.

And even if you return memory to the OS, that doesn't actually solve too-high memory usage.

Some things you can do:

The article mentions `__slots__` for reducing object memory use, and other approaches include just having fewer objects: for example, a dict of lists uses far less memory than a list of dicts with repeating fields. And you can also in many cases switch to a dataframe with Pandas, saving even more memory (https://pythonspeed.com/articles/python-object-memory/ covers all of those).

For numeric data, a NumPy array gets rid of the per-integer overhead for Python, so a Python list of numbers use way more memory than an equivalent NumPy array (https://pythonspeed.com/articles/python-integers-memory/).

user5994461 · on Nov 16, 2020

Another thing you can do is to deduplicate strings in memory. got 17% memory reduction for loading json objects with some repeated values.

https://thehftguy.com/2020/09/25/how-to-deduplicate-string-o...

Twirrim · on Nov 16, 2020

There's a great talk from a company that had some Netflow SaaS product, that I can't find right now.

They were suffering from GC pauses etc. on their ingestion hosts. They spent ages experimenting with various ways of tweaking garbage collection in Java, using different GCs, tweaking settings, etc. They even spent time experimenting with manual memory allocation, but found that to be extremely painful and somewhat fragile.

In the end they found that all they really needed to do was just produce less garbage, which was their ultimate "well duh!" revelation. They spent time looking at what was actually producing the garbage, and how they could avoid it. Got rid of a lot of standard coding patterns from their code in favour of new patterns that reduced object allocation, and away went all their problems.

joshlk · on Nov 16, 2020

DataClassFrames also takes the approach of storing lists of dataclasses as a “dataclass of arrays” - know as data orientated design: https://gitHub.com/joshlk/dataclassframe

(Full disclosure I’m the author of the project)

chubot · on Nov 17, 2020

FWIW your comment caught my interest as a longstanding user of data frames and Python [1].

But I found the README quite confusing. I guess it's a new project so that's understandable.

Why use DataClassFrames and not pandas? Because it's statically typed? The comparison table puts pandas and dataclassframes on equal footing, and the rest of the README doesn't make much sense to me.

[1] and I thought about putting them in my shell project: http://www.oilshell.org/blog/2018/11/30.html

Although QTSV is probably in the more immediate future: https://news.ycombinator.com/item?id=25022836

Abishek_Muthian · on Nov 16, 2020

Disabling swap in Linux has helped me in better handling of memory in low memory environments, especially lock-ups even when the code is memory optimised.

ublaze · on Nov 16, 2020

I've also seen some tricks like stripping docstrings from the release since those strings (hopefully) aren't used in production.

itamarst · on Nov 16, 2020

I can't imagine that has any significant impact on memory usage, at least for normal multi-GB-RAM servers: the docstrings will only be loaded once.

tomwojcik · on Nov 16, 2020

That's nothing. Armin Ronacher was complaining somewhere about the performance hit from static typing in python and i never though about that before. It'd be best if all of that was removed in prod build along with unused imports, kind of what TypeScript does.

ublaze · on Nov 16, 2020

imports can't be safely automatically stripped in Python (or even 100% safely re-ordered) since they could have side effects.

We run a linter that confirms with the user before stripping them, which works out well in practice.

aitchnyu · on Nov 16, 2020

Its apparently true for RPython too, where side effects which cause "methods and other class attributes do not change after startup".

orf · on Nov 16, 2020

There used to be a performance hit because the annotation objects where all instantiated and attached to the function objects. This is no longer the case - they are only instantiated when inspected.

aaronbwebber · on Nov 16, 2020

like PEP484 type annotations having a runtime performance hit? I would be very interested in seeing a link to this if you can dig it up.

The_rationalist · on Nov 16, 2020

I use https://github.com/hakavlad/nohang and it's a game changer

hakavlad · on Nov 17, 2020

Look at the new tools:

https://github.com/hakavlad/prelockd

https://github.com/hakavlad/memavaild

It can greatly improve responsiveness. Demo:

https://youtu.be/QquulJ06dAo - playing supertux + 12 `tail /dev/zero` in background

https://youtu.be/DsXEWvq60Rw - `tail /dev/zero`, swap on HDD, memavaild, no freezes

The_rationalist · on Nov 17, 2020

Very interesting: Should I still use nohang in addition to prelockd and memavaild? I mean prelockd could/should trigger the OOM earlier but nohang would still catch it earlier thanks to PSI?

What about zram and zswap advantages?

BTW you're helping making the world to be a better place but only nerds download those tools. It would help even more if you could lobby such tools as default in distros such as manjaro, arch, Ubuntu, fedora, etc

ris · on Nov 16, 2020

There's... a lot of hand waving in this article and no numbers.

I'd be pretty surprised if `malloc_trim` had a significant effect on cpython memory usage as most python memory gets allocated in 256KiB "arenas", which, what with fragmentation, are unlikely to _ever_ be reclaimed.

On the other hand, the article dismisses threads with some vagueries around the GIL, suggesting people need to reach straight for processes if they're serious. Really, unless your code has almost no I/O or C-accelerated, GIL-less sections, if you're not using both threads and processes, you're just burning memory unnecessarily.

(edit: oh and then there's async but I'm a bit old-fashioned for that)

yxhuvud · on Nov 16, 2020

One thing that has been known for a while is that there are allocation patterns where LibC malloc doesn't give back a sufficient amount of pages even when the pages are clean.

See for example https://www.joyfulbikeshedding.com/blog/2019-03-14-what-caus... . In the end I think the consensus was that jemalloc was just better than invocing malloc_trim, but invoking malloc_trim now and then can certainly be a lot better than using neither malloc_trim or jemalloc.

lbolla · on Nov 16, 2020

I discovered a similar trick few years ago: https://stackoverflow.com/questions/35660899/reduce-memory-f...

The idea is to tweak when `mmap` or `malloc` are used by the Python interpreter. One allows memory to be released to the OS right away, whereas the other is not.

It is a useful trick if your application is generating lots of small objects.

jnwatson · on Nov 16, 2020

Keep in mind that’s from Python 2.5. Modern Python uses arenas that always use mmap.

lbolla · on Nov 16, 2020

I developed that tweak for Python 2.7 and I am running in production with Python 3.6. I tested with later Python versions, too.

Do you have some pointers about arenas always using mmap? I'd like to know how that trick can work if that was the case.

cmeacham98 · on Nov 17, 2020

It doesn't work, that's their point. With modern python versions those env variables do next to nothing. It won't crash your python but it also won't help you.

lbolla · on Nov 17, 2020

My point is that it does work. I tested with Python 2.7 and Python 3.6+.

ublaze · on Nov 16, 2020

That's a neat trick and even simpler than what I wrote about.

DannyB2 · on Nov 16, 2020

From TFA...

> For example, GC pauses are notorious in other managed memory languages like Java.

(includes link to 2017 article:

https://dzone.com/articles/how-to-reduce-long-gc-pause

)

The author might like to know that since 2017, Java has two of the most sophisticated Garbage Collectors on the planet.

1. Red Hat's Shenandoah GC

2. Oracle's ZGC

(I could also mention Azul Systems work on its CCCC)

Both Shenandoah and ZGC claim to run on multi terabyte heaps with 1 ms max pause times.

chubot · on Nov 16, 2020

Somewhat related, although this is for a huge scale:

Adaptive process and memory management for Python web servers https://instagram-engineering.com/adaptive-process-and-memor...

Previously we used two per-worker thresholds to control respawn: reload-on-rss and evil-reload-on-rss.

...

However, since worker respawn is expensive (i.e. there are warm-up costs, LRU cache repopulation, etc..)

For a smaller scale, I always wondered if FastCGI being the "default" would have saved a lot of headaches. Your workers just get recycled automatically all the time.

If you can make your startup fast enough (and I think most apps can), then you can just let the OS do its job. Although it's true that Python can be really slow to start if you import many modules...

https://news.ycombinator.com/item?id=24683304

viraptor · on Nov 16, 2020

If the app is long running and forks, you may also want to look into gc.freeze() (added on 3.7) which will save you from copying the imports memory over time and make the GC shorter.

jwilk · on Nov 16, 2020

> libc = ctypes.CDLL("libc.so.6")

glibc SONAME is different on some architectures.

You can use ctypes.CDLL(None) instead, which should work everywhere.

hchz · on Nov 16, 2020

if your api surface supports it, a simple and effective solution is to restart the process

crb002 · on Nov 17, 2020

It would also make sense to compress large strings in memory as it got tight.