A while ago I’ve read a nice story how Instagram disabled garbage collection for their Python apps and memory usage dropped and performance improved by 15%. This seems counter-intuitive at first but uncovers amazing details about Python (namely, CPython) memory management.
Instagram is a Python/Django app that is running on uWSGI.
To run a Python app uWSGI master process forks and launch apps in a child process. This should’ve been leveraging the Copy-on-Write (CoW) mechanism in Linux - memory is shared among the processes as long as it’s not modified. And shared memory is good because it doesn’t waste the RAM (because it’s shared) and it improves cache hit ratio because multiple processes read the same memory. Apps that are launched by uWSGI are mostly identical because it’s the same code and so there should be a lot of memory shared between uWSGI master and child processes. But, instead, shared memory was dropping at the start of the process.
At first, they thought that it was because of reference counting because every read of an object, including immutable ones like code objects, causes write to the memory for that reference counters. But disabling reference counting didn’t prove that, so they went for profiling!
With the help of perf, they found out that it was
the garbage collector that caused most of the page faults - the collect
function.
So they decided to disable garbage collector because there is a reference
counting that will still be used to free the memory. CPython provides a gc
module that allows you to control
garbage collection. Instagram guys found that it’s better to use
gc.set_threshold(0)
instead of gc.disable()
because some library (like
msgpack in their case) can reenable it back, but gc.set_threshold(0)
is
setting the collection frequency to zero effectively disabling it and also it’s
immune to any subsequent gc.enable()
calls.
This worked but the garbage collection was triggered at the exit of the child process and thrashed CPU for the whole minute which is useless because the process was about to be replaced by the new one. This can be dismissed in 2 ways:
atexit.register(os._exit, 0)
. This tells that at the exit of your
Python program just hard exit the process without further cleanup.--skip-atexit-teardown
option in the recent uWSGI.With all these hacks the next things now happen:
What I’ve discovered from this story is that CPython has an interesting scheme for automatic memory management – it uses reference counting to release the memory that is no longer used and tracing generational garbage collector to fight cyclic objects.
So this is how reference counting works. Each object in Python has reference
counter (ob_refcnt
in the PyObject
struct) - a special variable that is
incremented when the object is referenced (e.g. added to the list or passed to
the function) and decremented when it’s released. When the ref counter value is
decremented to zero it’s released by the runtime.
Reference counting is a very nice and simple method for automatic memory management. It’s deterministic and avoids any background processing which makes it more efficient on the low power systems such as mobile devices.
But, unfortunately, it has some really bad flaws. First, it adds overhead for storing reference counter in every single object. Second, for multithreaded apps ref counting has to be atomic and thus must be synchronized between CPU cores which is slow. And finally, the references can form cycles which prevent counters from decrementing and such cyclic objects remains allocated forever.
Anyway, CPython uses reference counting as the main method for memory management.
As for the drawbacks is not that scary in most cases. Memory overhead for
storing ref counters is not really noticeable - even for million objects, it
would be only 8 MiB (ref counter is ssize_t
which is 8 bytes). Synchronization
for ref counting is not applicable because CPython has Global Interpreter Lock
(GIL).
The only problem left is fighting cycles. That’s why CPython periodically invokes tracing garbage collector. CPython’s GC is generational, i.e. it has 3 generations - 0, 1 and 2, where 0 is the youngest generation where all objects are born and 2 is the oldest generation where objects live until the process exits. Objects that are survived GC get moved to the next generation.
The idea of dividing the objects into generations is based on the heuristic that most of the objects that are allocated are short lived and so GC should try to free these objects more frequently than longer lived objects that are usually live forever.
All of these might seem complicated but I think it’s good tradeoff for CPython to employ such scheme. Some might say - why not leave only GC like most of the languages do? Well, GC has its own drawbacks. First, it must run in the background which in CPython not really possible because of GIL, so GC is a stop-the-world process. And second, because GC happens in the background, the exact time frame for object releases is undetermined.
So I think for CPython it’s a good balance to use ref counting and GC to complement each other.
In the end, CPython is not the only language/runtime that is using reference counting. Objective-C, Swift has compile time automatic reference counting (ARC). Remember that ref counting is more deterministic, so it is a huge win for iOS devices.
Rust also uses reference counting
C++ has smart pointers which basically are objects with reference counters, which are destructed by C++ runtime.
Many others languages like Perl and PHP also uses reference counting for memory management.
But, yeah, most of the languages now are based on pure GC:
CPython has an interesting scheme for managing memory - objects lifetime are managed by reference counting and to fight cycles it employs tracing garbage collector.