Intro

The last time I was wondering on how to restrict program memory consumption. As a trophy from my previous journey I’ve got libmemrestrict — a library that implements wrappers over allocation functions like malloc to account memory usage — and ptrace-restrict — ptrace-based tool that intercepts brk, sbrk and mmap system calls to do the same thing.

But wait, why bother trying to work in memory restricted environment? Is that really such a common problem? Can you remember the last time when OOM killer shot your application? Do you always think about memory consumption when you’re programming. I believe you don’t, because memory is cheap and if your application is starving — just add a few gigs of RAM!

However, you can’t add more and more RAM infinitely and it’s not because you don’t have infinite amount of memory. Today’s software has to deal with things like Big Data, where you just can’t fit your input into array. You need to distribute data between RAM, storage and network. You need an algorithms or at least techniques to handle data in that way.

So here I am, trying to get my hands dirty on such problems, though in somewhat trivial fashion, with popular question — how to sort 1 million integers (4 MiB of data) in 2 MiB of RAM? This could be generalized to the problem of sorting data that doesn’t fit in RAM and this is what I will try solve here.

Agenda

We need to write a program(s) that will sort an integers from file. To generate such type of file I’ve written simple utilities randints and rangeints

Program should output sorted array on stdout in text format.

Program should measure it’s working time and output it to stderr1.

Program should work with amount of memory at least two times smaller than file size (e.g. 2MiB of RAM for 4MiB file). To check this, we will use libmemrestrict or ptrace-restrict.

Despite having such nice restriction tools, we will not use them for some methods. For example, for mmap it will be pointless — we need to restrict physical RAM usage, not virtual (see details in corresponding section).

Programs will be tested to solve original problem (sort 4 MiB in 2 MiB). Also, I will run it in virtual machine with 128 MiB of RAM to sort 500 MB (125 millions of 4 bytes) of integers.

Naive way

Let’s try to sort it in naive way to get a baseline and see how much RAM we will use. Here is an implementation that simply reads whole file into array of integers and invokes qsort on it (as a bonus you can find there a simple dynamic array implementation via realloc).

We’ll try to sort 4MB of data with this program. With no restrictions it works:

$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.323146 seconds

but it’s not interesting. When we try to restrict it in 2 MiB we’ll get segmentation fault.

$ LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted 
Segmentation fault

However, if we increase2 limit to 4 MiB to hold all the data we will still fail.

$ MR_THRESHOLD=5000000 LD_PRELOAD=./libmemrestrict.so ./naive ints > ints.sorted 
Segmentation fault

Apparently, something is trying to get even more memory and, obviously, it’s qsort. Let’s see how much memory it wants with help of Valgrind’s massif:

$ valgrind --tool=massif ./naive ints 
$ ms_print massif.out.10676

Here is a nice picture:

    MB
    8.819^               :::::::::::::::::::::::::::#                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |                   :                          #                         
     |            :::::::@                          #:::::::::::::::::::::::: 
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |            :      @                          #                         
     |      @@@@@@:      @                          #                         
     |      @     :      @                          #                         
     |      @     :      @                          #                         
     |   :::@     :      @                          #                         
     | :::  @     :      @                          #                         
       0 +----------------------------------------------------------------------->Gi
     0                                                                   1.721

You may see a few doubling allocations up to 4 MiB — that’s my dynamic array and then four more MiB for qsort. Here are some stats:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 21    173,222,581        5,247,504        4,000,568     1,246,936            0
 22    173,223,802        5,246,920        4,000,000     1,246,920            0
 23    173,226,655        5,247,504        4,000,568     1,246,936            0
 24    173,229,202        5,246,920        4,000,000     1,246,920            0
 25    173,229,311        9,246,928        8,000,000     1,246,928            0
 26    869,058,772        9,246,928        8,000,000     1,246,928            0
86.52% (8,000,000B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.26% (4,000,000B) 0x400A26: readfile (in /home/avd/dev/cs/sorting/external/naive)
| ->43.26% (4,000,000B) 0x400ACD: main (in /home/avd/dev/cs/sorting/external/naive)
|   
->43.26% (4,000,000B) 0x35D40383F7: qsort_r (in /usr/lib64/libc-2.18.so)
| ->43.26% (4,000,000B) 0x400B3D: main (in /home/avd/dev/cs/sorting/external/naive)
|   
->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)

4 million (useful) bytes used by me and 4 more million bytes is used by qsort_r. There is also 1.2 MB extra-heap used by massif.

Looks like in this case qsort behaves as O(n) for space complexity.3

Ok, is it able to sort 500 MB in 128 MiB of RAM?

$ ./naive 500M.bin > /dev/null
Segmentation fault

Of course, not. As for performance:

$ ./naive 4M.bin > /dev/null
4000000 bytes sorted in 0.322712 seconds

So, naive sorting works well when not restricted, fails when restricted and qsort for some strange reason requires O(n) memory4.

mmap’ed file

Using mmap is a nice and hacky way to sort large amount of data in small amount of RAM. By using mmap you shift the burden of balancing data between RAM and swap space to the operating system kernel.

This is how it works:

  1. You mmap whole file with data into memory.
  2. From mmap you get the pointer to your data.
  3. You invoke sorting algorithm on data under mmaped pointer.

And that’s it! From now on you won’t exceed physical memory limits even though you are sorting file much larger than your RAM. To understand why is it working, you need a little knowledge about memory management in operating systems.

Every program represented by a process has its own private and isolated from other processes virtual address space. Length of address space is bound by CPU address bus width, e.g. for 32 bit CPU it’s 232 which is 4 GiB.

All allocations that happen in process via malloc, new or whatever else is a virtual memory allocations. That allocated virtual memory are mapped to physical memory by kernel memory management subsystem. Usually this is done in lazy mode. It means that whenever process asks for some amount of memory kernel will satisfy this request immediately, but will not do actual allocation — in this case we say that virtual memory page is not mapped (to physical page frame). Whenever such unmapped page is accessed (for example it’s written) MMU will generate the «page fault» exception that kernel will handle by mapping virtual memory page to a physical page frame. From now on page is mapped and writing by virtual address within that page will be translated by MMU to physical address in RAM.

On the other hand, if you remember that virtual address space is bound only by CPU address bus size, you might get into situation where program can allocate much more memory than available in RAM. For example, your 32 bit system has only 256 MiB of RAM, but process can allocate and use 1 GiB of memory. In this case, memory pages can’t be held in RAM and going to be swapped, i.e. evicted from RAM to backing storage like hard drive. Whenever process requests that swapped page, kernel will fetch it from disk and load into RAM (possibly by replacing some other page).

As you can see kernel can do a pretty good job in distributing data between RAM and disk, so it’s very natural to exploit this facility in our task. When we do mmap of our file, kernel will reserve range of virtual addresses for our file that won’t be mapped. Whenever we will try to access them by changing bytes, moving array pivot or anything else, kernel will fetch data from input file on disk into the RAM. Whenever we will exhaust physical memory, kernel will evict some pages to swap space. That way we will be able to balance our data between original file on disk, RAM and swap space.

The only restrictions in that method is virtual address space, which is not much a restriction (4 GiB for 32bit system and 256 TiB5 for 64 bit system), and swap space that can be huge because hard drives are cheap today.

Also, note that because mmap load whole file into virtual memory we can’t use libmemrestrict or ptrace-restrict because they account for virtual memory itself. If we try to restrict sorting 100M of data in 10M of virtual memory, mmap program will fail:

$ qemu-x86_64 -R 10M ./mmaped 100M.bin 
mmap stack: Cannot allocate memory

That’s not a surprise because mmap loads whole file in virtual memory and then kernel distribute it between actual physical memory and swap. So, we need at least 100M of virtual memory (plus some extra space for qemu itself) to map file into memory.

That’s a why for this sorting method I will use virtual machine with 128 MiB of memory. Here is my mmap sorting program. And, you know what? It works like a charm!

$ free -m
             total       used       free     shared    buffers     cached
Mem:           119         42         76          0          4         16
-/+ buffers/cache:         21         97
Swap:          382          0        382

$ ll -h 500M.bin
-rw-r--r-- 1 root root 477M Feb  3 05:39 500M.bin

$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 32.250000 seconds

Here is the top info:

PID  USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
3167 root      20   0  480m  90m  90m R  84.6 76.4   1:18.65 mmaped 

As you can see we use 500 MB of virtual memory6 while actual resident memory is only 90 MiB.

If we look at more detailed vmstat output while sorting 500 MB we’ll see how kernel is balancing between swap, disk cache, buffers and free memory:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0  77776   2120  15104    1   27   710   971   24   34  3  1 95  1
 1  1      0   2060    488  90068    1   27   785  1057   25   37  3  1 95  1
 1  0    928   3400     60  89744    1   27   799  1092   25   38  3  1 94  1
 0  2   1908   1928    212  92040    1   27   852  1174   26   40  4  1 94  1
 0  2   3436   2360    280  93056    1   27   911  1282   28   42  4  1 94  2
 0  0   5272   3688    196  94688    1   27  1066  1471   31   48  4  1 93  2
 0  0   5272   3720    204  94700    1   27  1064  1469   31   48  4  1 93  2

In the beginning we had ~70 MiB of free memory, empty swap and some memory allocated for I/O buffers and disk cache. Then, we did a mmap and all that memory had gone to disk cache. When the free memory were exhausted, kernel had started to use swap space and we can see it’s increasing along with increasing I/O load. And we end up in situation where almost all of the memory is dedicated to disk cache, though it’s OK because disk cache pages are first victims to steal from when we need memory for application.

So, sorting with help of mmap is a neat hack that requires simple understanding of memory management, but gives you quick and easy solution to handle large amount of data in small amount of RAM.

Generic external sort

All right, suppose you can’t use mmap, for example, you want to sort 5 GiB file on 32-bit system. What would you do?

There is a well-known and popular way to accomplish this named external merge sort. Motivation is simple — if you don’t have enough memory to hold your data you just use some external storage like hard disk. Of course, you have to work with your data in piece by piece fashion because you still have only small amount of main memory.

External merge sort works like this:

  1. You split your data to chunks of available memory buffer size.
  2. Each chunk is sorted in memory buffer and written to external storage.
  3. Now you have filesize/buffersize chunks on your storage.
  4. Finally, you read buffersize/#chunks pieces from each chunk to merge them in buffer and output to result file.

I made some trivial, absolutely-not-optimized-in-any-sense implementation that simply works:

$ LD_PRELOAD=./libmemrestrict.so ./external-merge 4M.bin 1048576 > /dev/null
4194304 bytes sorted in 0.383171 seconds

We’ve sorted in 2 MiB of memory using 1 MiB buffer.

Now let’s sort 500MB. First, disable swap — we’re handling chunks swapping by hand:

$ swapoff /dev/vda5

Drop caches:

$ echo 3 > /proc/sys/vm/drop_caches

Check free memory:

$ free -m
             total       used       free     shared    buffers     cached
Mem:           119         28         90          0          0          6
-/+ buffers/cache:         21         97
Swap:            0          0          0

We’ll use 50M as a buffer — 10 times smaller than file size.

$ ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 120.115180 seconds

Holy crap, two minutes! Why is that? Well, the main killer of performance here is I/O. There are 3 things that cause lots of I/O and slow things down. On the split phase we’re sequentially reading file slipping it through a small buffer. On the merge phase we’re constantly opening and closing chunk files. And last but not least is output — on merge phase we output whole buffer (50 MB, 12.5 millions of integer) to stdout that, despite redirecting it to /dev/null, creating a heavy load. We may turn it off. All in all, in mmaped we output everything in a single pass in the end of program and doesn’t account it in performance counters. So if we turn off output we’ll run in ~90 seconds — 25% faster.

$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.140000 seconds

As for memory consumption — it’s fine. If we try to run it under massif we’ll see that peak consumption is our buffer size plus some extra heap.

--------------------------------------------------------------------------------
Command:            ./external-merge 500M.bin 50000000
Massif arguments:   (none)
ms_print arguments: massif.out.17423
--------------------------------------------------------------------------------


    MB
47.75^                                                                  ::::: 
     |#::::::@:::::::::::@:::::::::@:::@::::@::::@::::::::@::::@::::@:::@     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
     |# : :  @ :  : :  : @  : :    @   @    @    @   :    @    @    @   @     
   0 +----------------------------------------------------------------------->Gi
     0                                                                   332.5

Number of snapshots: 98
 Detailed snapshots: [4 (peak), 10, 20, 29, 32, 35, 38, 45, 48, 54, 64, 74, 84, 94]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1        119,690              584              568            16            0
  2        123,141       50,004,496       50,000,568         3,928            0
  3      4,814,014       50,005,080       50,001,136         3,944            0
  4      4,817,234       50,005,080       50,001,136         3,944            0
99.99% (50,001,136B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->99.99% (50,000,000B) 0x400FA2: external_merge_sort (in /root/external-merge)
| ->99.99% (50,000,000B) 0x40128E: main (in /root/external-merge)
|   
->00.00% (1,136B) in 1+ places, all below ms_print's threshold (01.00%)

If we restrict memory to 50 MB7 we’ll see that it works:

$ LD_PRELOAD=./libmemrestrict.so MR_THRESHOLD=51000000 ./external-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 87.900000 seconds

Ok, memory consumption is fine, but performance is not that good. Recall that mmaped has done this in 32 seconds. Let’s see how we can improve our 90 seconds.

Lets profile this nice program with gprof. Build instrumented binary

$ gcc -pg -g -Wall -Wextra external-merge.c -o external-merge-gprof

And invoke this program multiple times accumulating statistics. To do so we’ll use nice script from my article on gprof. Here is the output:

%   cumulative   self              self     total           
time   seconds   seconds    calls  Ts/call  Ts/call  name    
81.98    432.87   432.87                             compar
18.17    528.82    95.95                             print_arr
0.00    528.82     0.00     1100     0.00     0.00  form_filename
0.00    528.82     0.00      100     0.00     0.00  merge
0.00    528.82     0.00      100     0.00     0.00  save_buf
0.00    528.82     0.00       10     0.00     0.00  external_merge_sort
0.00    528.82     0.00       10     0.00     0.00  split

Most of the time we have spent in sorting and printing. But also don’t forget that gprof won’t show you time spent in syscalls and I/O.

I can think of 2 things to improve here:

  • Tune external sorting with multithreading and I/O tricks
  • Think about different sorting algorithm

So, generic external merge sort is quite simple idea to sort a bunch of data in small RAM, but usually without tuning and improving it’s slow.

Tuning external sort

One of the things that we can try is to use multiple threads on split and merge phases of our external sort. However, in this case, it’s not a really great idea.

Using multithreading on split phase doesn’t make sense because there is a single buffer that can hold the data. But we may try to advise kernel on how we’ll read data. There are 2 functions for that:

  1. readahead, though it’s Linux specific.
  2. posix_fadvise with POSIX_FADV_SEQUENTIAL.

These functions will inform memory management subsystem on how we’ll read the data. In our case we read it sequentially and thus it would be nice to see file content in page cache.

On a merge phase we could avoid constant open/close of chunk files by creating dedicated thread for each of the chunks. Each thread will keep its file open, and will fill buffer at thread offset. When buffer is filled it will be sorted8 and output. Also, we will employ readahead for each thread.

Here is tuned and multithreaded version of external merge sort.

OK, so as I said, multithreading here is not a great idea. All these threads things are nice and dandy, but for single core CPU it’s not showing any effect:

$ ./mt-ext-merge 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 117.380000 seconds

It is the version with printing. And here is the version without printing:

$ ./mt-ext-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 91.040000 seconds

You may think that it’s because we have hard times scheduling dozen of threads on a single CPU. Allright, lets compare it with other methods on quad-core machine (Intel Core i7-3612QM CPU @ 2.10GHz):

$ ./naive 500M.bin > /dev/null 
500000000 bytes sorted in 23.040499 seconds

$ ./mmaped 500M.bin > /dev/null
500000000 bytes sorted in 23.542076 seconds

$ ./external-merge 500M.bin 50000000 > /dev/null 
500000000 bytes sorted in 39.228695 seconds

$ ./mt-external-merge 500M.bin 50000000 > /dev/null 
500000000 bytes sorted in 41.062793 seconds

$ ./external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.893745 seconds

$ ./mt-external-merge-no-output 500M.bin 50000000 > /dev/null
500000000 bytes sorted in 28.368976 seconds

Now with a 100 chunks (100 threads):

$ ./external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 27.107728 seconds

$ ./mt-external-merge-no-output 500M.bin 5000000 > /dev/null
500000000 bytes sorted in 28.558468 seconds

You can see no changes in performance between external-merge and mt-external-merge. Why is that? Because for most cases multithreading is not a solution for I/O bound applications. Still, there are situations when spawning some threads will speed up things:

  • Your execution threads are independent
  • Your I/O resource can work in parallel, e.g. it’s RAID

Examples here are some graphic application that renders image in independent areas, or scientific application that makes some heavy calculations for multiple results.

In multithreaded external sort threads are dependant — you have to wait for main thread to sort buffer before you can do further reading from the chunk. Also, reading for slice is done much faster that sorting whole buffer, so most of the time threads are waiting for main thread to finish.

That’s why multithreading won’t help us, so we need to look for other ways to improve.

Special sorting algorithms

Now, let’s try to use some sorting algorithm that would perform better than QuickSort assuming we know distribution of data. Like we know that we’re sorting integers, so why not play around this? There are few sorting algorithms designed specifically for special type of data can fall in either of 2 groups:

  1. Don’t use compares.
  2. Don’t even require to load input array in memory.

Such algorithms are known to work better than O(n log(n)) — a lower bound for comparison based sort like QuickSort. Of course, not every algorithm will work for us because we have memory restriction. For example, radix sort and other kinds of bucket sorting won’t help us because it requires additional memory for buckets, though there are implementations of in-place Radix sort. Anyway, such implementations require reading whole data set multiple times for each radix — 32 times for binary-radix sort and that’s way too much. Also, deep recursion that arise from MSD radix sort is a memory consumption itself. That’s why I came up to using counting sort.

Counting sort

If we know that our data are big, but it’s range is small, then we can use counting sort. The idea is dead simple — instead of holding data in memory we will hold array of counters. We’ll read our data sequentially and then increment relevant counter. The most important is that counting sort time complexity is linear and space complexity is proportional to range that is usually small.

A simple implementation works with consecutive range of integers from 0 to some N. There is an array size of range, where integers correspond to array indices. Here is my implementation on github performing really well without any tuning9:

$ ./counting-array 500M-range.bin 1000000 > /dev/null
Range is 1000000
500000000 bytes sorted in 3.240000 seconds

Yep, half gig of data sorted in 3 and a half seconds on 128 MiB machine with single CPU. Compare it with qsort on mmap:

$ ./mmaped 500M-range.bin > /dev/null
500000000 bytes sorted in 76.150000 seconds

23 times faster!

But anyway you should be aware of restrictions that counting sort implies: only integers (or equivalent) from small and consecutive range. I’ve also tried to develop counting sort for non-consecutive range with hash tables and binary search trees — here is the code. However, its performance is pretty bad and, unfortunately, I still can’t explain this.

Anyway, we can go even further and assume that numbers in our range is unique. Then, counter for value might be only in 2 states — present or not, so we could use a single bit for counter. This will enormously compact our array, although in that case we don’t even need any array. We could store each number as a single bit, thus transforming our data into a bit vector — while reading a file, set Nth bit if there is an integer N in file. After bit vector is formed, read it and write to output file numbers that correspond to bits that are set.

Bit vector solution requires even more attention because, despite its seem compactness, you might violate your restrictions, for example to sort array of numbers from whole integers range (232) you would need a 1 bit for every integer, which is 4294967296 bits = 536870912 bytes = 512 MiB. In my case I have only 128 MiB, so it won’t work for me. But there are cases where you will benefit like a boss — read a great story from «Programming Pearls» by Jon Bentley.

A moral here is «knowing your data is extremely useful».

Recap

For the last 5 months of writing this article I did a lot of things — I’ve implemented a dozen of programs, come up with a few good ideas, failed with many more and still I have things to try and/or fix. Now it’s time for conclusions.

The simple problem of sorting data in small memory revealed a whole class of peculiarities we don’t usually think of:

  • Common widely used algorithms are not suitable for any problems.
  • Dynamic debugging and profiling are extremely useful and demonstrative.
  • I/O is a bitch, unless you fully rely on kernel.
  • Multithreading is not a silver bullet for performance.
  • Know your data, know your environment.

As for sorting here is a result table:

Test case Naive QuickSort mmap and QuickSort External merge sort Multithreaded external merge sort Counting sort
4 MiB in 2 MiB Segfault N/A 0.38s 0.41s 0.01
500 MB in 128 MiB Segfault 32.25s 87.14s 91.04 3.24

The bottom line is «Know your data and develop a simple algorithm for it».

Thank you for your attention.

References


  1. We can’t simply launch program under time utility because it will include time for reading file into memory and time for outputting to console.

  2. libmemrestrict gets a configuration from environment. Common practice for libraries, for example LD_PRELOAD and LD_LIBRARY_PATH are well-known environment variables for ld-linux.so, and there is also less known, but extremely useful LD_DEBUG environment variable for linkage debugging.

  3. It’s strange because qsort is in-place and uses optimizations suggested by Robert Sedgewick to, among others, guarantee O(log n) space complexity. You can dive into glibc qsort implementation.

  4. There is a strange behaviour though. For example, if I restrict memory for 5.3 MB it will work and not require O(n) memory. I’m still investigating on this.

  5. http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

  6. Note that MiB is 220 and MB is 106 = 1 million. So 500 MB = 500 000 000 bytes which is 500 000 000 >> 20 = 476 MiB.

  7. Plus extra 500 KB for temporary strings holding chunks paths.

  8. In this way it’s kind of N-way merge.

  9. Second argument is a buffer size in elements. Buffering improves performance drastically because it doesn’t read from file by 4 bytes.