The last time I was wondering on how to restrict program memory
consumption. As a trophy from my previous journey I've got
libmemrestrict - a library that implements wrappers over
allocation functions like
malloc to account memory usage - and
ptrace-restrict - ptrace-based tool that intercepts
mmap system calls to do the same thing.
But wait, why bother trying to work in memory restricted environment? Is that really such a common problem? Can you remember the last time when OOM killer shot your application? Do you always think about memory consumption when you're programming. I believe you don't, because memory is cheap and if your application is starving - just add a few gigs of RAM!
However, you can't add more and more RAM infinitely and it's not because you don't have infinite amount of memory. Today's software has to deal with things like Big Data, where you just can't fit your input into array. You need to distribute data between RAM, storage and network. You need an algorithms or at least techniques to handle data in that way.
So here I am, trying to get my hands dirty on such problems, though in somewhat trivial fashion, with popular question - how to sort 1 million integers (4 MiB of data) in 2 MiB of RAM? This could be generalized to the problem of sorting data that doesn't fit in RAM and this is what I will try solve here.
Program should output sorted array on stdout in text format.
Program should measure it's working time and output it to stderr1.
Program should work with amount of memory at least two times smaller than file
size (e.g. 2MiB of RAM for 4MiB file). To check this, we will use
Despite having such nice restriction tools, we will not use them for some
methods. For example, for
mmap it will be pointless - we need to
restrict physical RAM usage, not virtual (see details in corresponding
Programs will be tested to solve original problem (sort 4 MiB in 2 MiB). Also, I will run it in virtual machine with 128 MiB of RAM to sort 500 MB (125 millions of 4 bytes) of integers.
Let's try to sort it in naive way to get a baseline and see how much RAM we will
use. Here is an implementation that simply reads whole file into
array of integers and invokes
qsort on it (as a bonus you can find there a
simple dynamic array implementation via
We'll try to sort 4MB of data with this program. With no restrictions it works:
$ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.323146 seconds
but it's not interesting. When we try to restrict it in 2 MiB we'll get segmentation fault.
$ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault
However, if we increase2 limit to 4 MiB to hold all the data we will still fail.
$ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted Segmentation fault
Apparently, something is trying to get even more memory and, obviously, it's
qsort. Let's see how much memory it wants with help of Valgrind's massif:
$ valgrind --tool=massif ./naive ints $ ms_print massif.out.10676
Here is a nice picture:
MB 8.819^ :::::::::::::::::::::::::::# | : # | : # | : # | : # | : # | : # | : # | : # | :::::::@ #:::::::::::::::::::::::: | : @ # | : @ # | : @ # | : @ # | : @ # | @@@@@@: @ # | @ : @ # | @ : @ # | :::@ : @ # | ::: @ : @ # 0 +----------------------------------------------------------------------->Gi 0 1.721
You may see a few doubling allocations up to 4 MiB - that's my dynamic array and
then four more MiB for
qsort. Here are some stats:
-------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 21 173,222,581 5,247,504 4,000,568 1,246,936 0 22 173,223,802 5,246,920 4,000,000 1,246,920 0 23 173,226,655 5,247,504 4,000,568 1,246,936 0 24 173,229,202 5,246,920 4,000,000 1,246,920 0 25 173,229,311 9,246,928 8,000,000 1,246,928 0 26 869,058,772 9,246,928 8,000,000 1,246,928 0 86.52% (8,000,000B) (heap allocation functions) malloc/new/new, --alloc-fns, etc. ->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive) | ->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so) | ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive) | ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
4 million (useful) bytes used by me and 4 more million bytes is used by
qsort_r. There is also 1.2 MB extra-heap used by massif.
Looks like in this case
qsort behaves as O(n) for space complexity.3
Ok, is it able to sort 500 MB in 128 MiB of RAM?
$ ./naive 500M.bin > /dev/null Segmentation fault
Of course, not. As for performance:
$ ./naive 4M.bin > /dev/null 4000000 bytes sorted in 0.322712 seconds
So, naive sorting works well when not restricted, fails when restricted and
qsort for some strange reason requires O(n) memory4.
mmap is a nice and hacky way to sort large amount of data in small amount
of RAM. By using
mmap you shift the burden of balancing data between RAM and
swap space to the operating system kernel.
This is how it works:
mmapwhole file with data into memory.
mmapyou get the pointer to your data.
And that's it! From now on you won't exceed physical memory limits even though you are sorting file much larger than your RAM. To understand why is it working, you need a little knowledge about memory management in operating systems.
Every program represented by a process has its own private and isolated from other processes virtual address space. Length of address space is bound by CPU address bus width, e.g. for 32 bit CPU it's 232 which is 4 GiB.
All allocations that happen in process via
new or whatever else is
a virtual memory allocations. That allocated virtual memory are mapped to
physical memory by kernel memory management subsystem. Usually this is done in
lazy mode. It means that whenever process asks for some amount of memory kernel
will satisfy this request immediately, but will not do actual allocation - in
this case we say that virtual memory page is not mapped (to physical page
frame). Whenever such unmapped page is accessed (for example it's written)
MMU will generate the
"page fault" exception that kernel will handle by mapping virtual memory page to
a physical page frame. From now on page is mapped and writing by virtual
address within that page will be translated by MMU to physical address in RAM.
On the other hand, if you remember that virtual address space is bound only by CPU address bus size, you might get into situation where program can allocate much more memory than available in RAM. For example, your 32 bit system has only 256 MiB of RAM, but process can allocate and use 1 GiB of memory. In this case, memory pages can't be held in RAM and going to be swapped, i.e. evicted from RAM to backing storage like hard drive. Whenever process requests that swapped page, kernel will fetch it from disk and load into RAM (possibly by replacing some other page).
As you can see kernel can do a pretty good job in distributing data between RAM
and disk, so it's very natural to exploit this facility in our task. When we do
mmap of our file, kernel will reserve range of virtual addresses for our file
that won't be mapped. Whenever we will try to access them by changing bytes,
moving array pivot or anything else, kernel will fetch data from input file on
disk into the RAM. Whenever we will exhaust physical memory, kernel will
evict some pages to swap space. That way we will be able to balance our data
between original file on disk, RAM and swap space.
The only restrictions in that method is virtual address space, which is not much a restriction (4 GiB for 32bit system and 256 TiB5 for 64 bit system), and swap space that can be huge because hard drives are cheap today.
Also, note that because
mmap load whole file into virtual memory we can't
ptrace-restrict because they account for virtual
memory itself. If we try to restrict sorting 100M of data in 10M of virtual
mmap program will fail:
$ qemu-x86_64 -R 10M ./mmaped 100M.bin mmap stack: Cannot allocate memory
That's not a surprise because
mmap loads whole file in virtual memory and
then kernel distribute it between actual physical memory and swap. So, we need at
least 100M of virtual memory (plus some extra space for
qemu itself) to map
file into memory.
That's a why for this sorting method I will use virtual machine with 128 MiB of memory. Here is my mmap sorting program. And, you know what? It works like a charm!
$ free -m total used free shared buffers cached Mem: 119 42 76 0 4 16 -/+ buffers/cache: 21 97 Swap: 382 0 382 $ ll -h 500M.bin -rw-r--r-- 1 root root 477M Feb 3 05:39 500M.bin $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 32.250000 seconds
Here is the
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3167 root 20 0 480m 90m 90m R 84.6 76.4 1:18.65 mmaped
As you can see we use 500 MB of virtual memory6 while actual resident memory is only 90 MiB.
If we look at more detailed
vmstat output while sorting 500 MB we'll see how
kernel is balancing between swap, disk cache, buffers and free memory:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 77776 2120 15104 1 27 710 971 24 34 3 1 95 1 1 1 0 2060 488 90068 1 27 785 1057 25 37 3 1 95 1 1 0 928 3400 60 89744 1 27 799 1092 25 38 3 1 94 1 0 2 1908 1928 212 92040 1 27 852 1174 26 40 4 1 94 1 0 2 3436 2360 280 93056 1 27 911 1282 28 42 4 1 94 2 0 0 5272 3688 196 94688 1 27 1066 1471 31 48 4 1 93 2 0 0 5272 3720 204 94700 1 27 1064 1469 31 48 4 1 93 2
In the beginning we had ~70 MiB of free memory, empty swap and some memory
allocated for I/O buffers and disk cache. Then, we did a
mmap and all that
memory had gone to disk cache. When the free memory were exhausted, kernel had
started to use swap space and we can see it's increasing along with increasing
I/O load. And we end up in situation where almost all of the memory is
dedicated to disk cache, though it's OK because disk cache pages are first
victims to steal from when we need memory for application.
So, sorting with help of
mmap is a neat hack that requires simple understanding
of memory management, but gives you quick and easy solution to handle large
amount of data in small amount of RAM.
All right, suppose you can't use
mmap, for example, you want to sort 5 GiB file
on 32-bit system. What would you do?
There is a well-known and popular way to accomplish this named external merge sort. Motivation is simple - if you don't have enough memory to hold your data you just use some external storage like hard disk. Of course, you have to work with your data in piece by piece fashion because you still have only small amount of main memory.
External merge sort works like this:
filesize/buffersizechunks on your storage.
buffersize/#chunkspieces from each chunk to merge them in buffer and output to result file.
I made some trivial, absolutely-not-optimized-in-any-sense implementation that simply works:
$ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null 4194304 bytes sorted in 0.383171 seconds
We've sorted in 2 MiB of memory using 1 MiB buffer.
Now let's sort 500MB. First, disable swap -- we're handling chunks swapping by hand:
$ swapoff /dev/vda5
$ echo 3 > /proc/sys/vm/drop_caches
Check free memory:
$ free -m total used free shared buffers cached Mem: 119 28 90 0 0 6 -/+ buffers/cache: 21 97 Swap: 0 0 0
We'll use 50M as a buffer -- 10 times smaller than file size.
$ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 120.115180 seconds
Holy crap, two minutes! Why is that? Well, the main killer of performance here is
I/O. There are 3 things that cause lots of I/O and slow things down. On the
split phase we're sequentially reading file slipping it through a small buffer.
On the merge phase we're constantly opening and closing chunk files. And last
but not least is output -- on merge phase we output whole buffer (50 MB, 12.5
millions of integer) to stdout that, despite redirecting it to
creating a heavy load. We may turn it off. All in all, in
mmaped we output
everything in a single pass in the end of program and doesn't account it in
performance counters. So if we turn off output we'll run in ~90 seconds - 25%
$ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.140000 seconds
As for memory consumption -- it's fine. If we try to run it under massif we'll see that peak consumption is our buffer size plus some extra heap.
-------------------------------------------------------------------------------- Command: ./external-merge 500M.bin 50000000 Massif arguments: (none) ms_print arguments: massif.out.17423 -------------------------------------------------------------------------------- MB 47.75^ ::::: |#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ |# : : @ : : : : @ : : @ @ @ @ : @ @ @ @ 0 +----------------------------------------------------------------------->Gi 0 332.5 Number of snapshots: 98 Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 119,690 584 568 16 0 2 123,141 50,004,496 50,000,568 3,928 0 3 4,814,014 50,005,080 50,001,136 3,944 0 4 4,817,234 50,005,080 50,001,136 3,944 0 99.99% (50,001,136B) (heap allocation functions) malloc/new/new, --alloc-fns, etc. ->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge) | ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge) | ->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)
If we restrict memory to 50 MB7 we'll see that it works:
$ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 87.900000 seconds
Ok, memory consumption is fine, but performance is not that good. Recall that
mmaped has done this in 32 seconds. Let's see how we can improve our 90
Let's profile this nice program with gprof. Build instrumented binary
$ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof
And invoke this program multiple times accumulating statistics. To do so we'll use nice script from my article on gprof. Here is the output:
% cumulative self self total time seconds seconds calls Ts/call Ts/call name 81.98 432.87 432.87 compar 18.17 528.82 95.95 print_arr 0.00 528.82 0.00 1100 0.00 0.00 form_filename 0.00 528.82 0.00 100 0.00 0.00 merge 0.00 528.82 0.00 100 0.00 0.00 save_buf 0.00 528.82 0.00 10 0.00 0.00 external_merge_sort 0.00 528.82 0.00 10 0.00 0.00 split
Most of the time we have spent in sorting and printing. But also don't forget that gprof won't show you time spent in syscalls and I/O.
I can think of 2 things to improve here:
So, generic external merge sort is quite simple idea to sort a bunch of data in small RAM, but usually without tuning and improving it's slow.
One of the things that we can try is to use multiple threads on split and merge phases of our external sort. However, in this case, it's not a really great idea.
Using multithreading on split phase doesn't make sense because there is a single buffer that can hold the data. But we may try to advise kernel on how we'll read data. There are 2 functions for that:
readahead, though it's Linux specific.
These functions will inform memory management subsystem on how we'll read the data. In our case we read it sequentially and thus it would be nice to see file content in page cache.
On a merge phase we could avoid constant
close of chunk files by
creating dedicated thread for each of the chunks. Each thread will keep its
file open, and will fill buffer at thread offset. When buffer is filled it will
be sorted8 and output. Also, we will employ
readahead for each
Here is tuned and multithreaded version of external merge sort.
OK, so as I said, multithreading here is not a great idea. All these threads things are nice and dandy, but for single core CPU it's not showing any effect:
$ ./mt-ext-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 117.380000 seconds
It is the version with printing. And here is the version without printing:
$ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 91.040000 seconds
You may think that it's because we have hard times scheduling dozen of threads on a single CPU. Allright, let's compare it with other methods on quad-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):
$ ./naive 500M.bin > /dev/null 500000000 bytes sorted in 23.040499 seconds $ ./mmaped 500M.bin > /dev/null 500000000 bytes sorted in 23.542076 seconds $ ./external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 39.228695 seconds $ ./mt-external-merge 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 41.062793 seconds $ ./external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.893745 seconds $ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null 500000000 bytes sorted in 28.368976 seconds
Now with a 100 chunks (100 threads):
$ ./external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 27.107728 seconds $ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null 500000000 bytes sorted in 28.558468 seconds
You can see no changes in performance between
mt-external-merge. Why is that? Because for most cases multithreading is not
a solution for I/O bound
applications. Still, there are situations when spawning some threads will
speed up things:
Examples here are some graphic application that renders image in independent areas, or scientific application that makes some heavy calculations for multiple results.
In multithreaded external sort threads are dependant - you have to wait for main thread to sort buffer before you can do further reading from the chunk. Also, reading for slice is done much faster that sorting whole buffer, so most of the time threads are waiting for main thread to finish.
That's why multithreading won't help us, so we need to look for other ways to improve.
Now, let's try to use some sorting algorithm that would perform better than QuickSort assuming we know distribution of data. Like we know that we're sorting integers, so why not play around this? There are few sorting algorithms designed specifically for special type of data can fall in either of 2 groups:
Such algorithms are known to work better than O(n log(n)) - a lower bound for comparison based sort like QuickSort. Of course, not every algorithm will work for us because we have memory restriction. For example, radix sort and other kinds of bucket sorting won't help us because it requires additional memory for buckets, though there are implementations of in-place Radix sort. Anyway, such implementations require reading whole data set multiple times for each radix - 32 times for binary-radix sort and that's way too much. Also, deep recursion that arise from MSD radix sort is a memory consumption itself. That's why I came up to using counting sort.
If we know that our data are big, but it's range is small, then we can use counting sort. The idea is dead simple - instead of holding data in memory we will hold array of counters. We'll read our data sequentially and then increment relevant counter. The most important is that counting sort time complexity is linear and space complexity is proportional to range that is usually small.
A simple implementation works with consecutive range of integers from 0 to some
N. There is an array size of range, where integers correspond to array
indices. Here is my implementation on github performing
really well without any tuning9:
$ ./counting-array 500M-range.bin 1000000 > /dev/null Range is 1000000 500000000 bytes sorted in 3.240000 seconds
Yep, half gig of data sorted in 3 and a half seconds on 128 MiB machine with
single CPU. Compare it with
$ ./mmaped 500M-range.bin > /dev/null 500000000 bytes sorted in 76.150000 seconds
23 times faster!
But anyway you should be aware of restrictions that counting sort implies: only integers (or equivalent) from small and consecutive range. I've also tried to develop counting sort for non-consecutive range with hash tables and binary search trees - here is the code. However, its performance is pretty bad and, unfortunately, I still can't explain this.
Anyway, we can go even further and assume that numbers in our range is unique. Then, counter for value might be only in 2 states - present or not, so we could use a single bit for counter. This will enormously compact our array, although in that case we don't even need any array. We could store each number as a single bit, thus transforming our data into a bit vector - while reading a file, set Nth bit if there is an integer N in file. After bit vector is formed, read it and write to output file numbers that correspond to bits that are set.
Bit vector solution requires even more attention because, despite its seem compactness, you might violate your restrictions, for example to sort array of numbers from whole integers range (232) you would need a 1 bit for every integer, which is 4294967296 bits = 536870912 bytes = 512 MiB. In my case I have only 128 MiB, so it won't work for me. But there are cases where you will benefit like a boss - read a great story from "Programming Pearls" by Jon Bentley.
A moral here is "knowing your data is extremely useful".
For the last 5 months of writing this article I did a lot of things - I've implemented a dozen of programs, come up with a few good ideas, failed with many more and still I have things to try and/or fix. Now it's time for conclusions.
The simple problem of sorting data in small memory revealed a whole class of peculiarities we don't usually think of:
As for sorting here is a result table:
mmap and QuickSort
External merge sort
Multithreaded external merge sort
4 MiB in 2 MiB
500 MB in 128 MiB
The bottom line is "Know your data and develop a simple algorithm for it".
Thank you for your attention.
Sorting a million 32-bit integers in 2MB of RAM using Python - simple
external sort in pythonic way with generators,
array module and
Also, has nice explanation of buffering advantages.
External sorting of large datasets - External sort implementation by Andrew Wang with performance measurements.
Efficiently sorting datasets bigger than memory - Nice reddit thread regarding Andrew Wang's article showing some different methods along with pros and cons.
Cracking the Oyster (Column 1 of Programming Pearls) - Amazingly written story on doing large sorting. Discussed the bit vector idea in great details.
Multithreaded File I/O - How multithreaded I/O behaves on single disk and RAID5 array.
Algorithm Improvement through Performance Measurement: Part 2 presents in-place radix sort implementation.
Parallel sorting algorithms - comprehensive list of resources regarding sorting on multinode computer systems.
timeutility because it will include time for reading file into memory and time for outputting to console. [return]
LD_LIBRARY_PATHare well-known environment variables for ld-linux.so, and there is also less known, but extremely useful
LD_DEBUGenvironment variable for linkage debugging. [return]