Restricting program memory

November 25, 2014

On the other day, I’ve decided to solve a popular problem: how to sort 1 million integers in 1 MiB?

But before I’ve even started to do anything I thought – how can I restrict process memory to 1 MiB? Will it work? So, here is the answers.

Process virtual memory

What you have to know before diving in various methods is how the process’s virtual memory is structured. There is a, hands down, the best article you could ever find about that is Gustavo Duarte’s “Anatomy of a Program in Memory”. His whole blog is a treasure.

After reading Gustavo’s article I can propose 2 possible options for restricting memory – reduce virtual address space and restrict heap size.

First is to limit the whole virtual address space for the process. This is nice and easy but not fully correct. We can’t limit whole virtual address space of a process to 1 MB – we won’t be able to map kernel and libs.

Second is to limit heap size. This is not so easy and seems like nobody tries to do this because the only reasonable way to do this is playing with the linker. But for limiting available memory to such small values like 1 MiB it will be absolutely correct.

Also, I will look at other methods like monitoring memory consumption with intercepting library and system calls related to memory management and changing program environment with emulation and sandboxing.

For testing and illustrating I will use this little program big_alloc that allocates (and frees) 100 MiB.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

// 1000 allocation per 100 KiB = 100 000 KiB = 100 MiB
#define NALLOCS 1000
#define ALLOC_SIZE 1024*100 // 100 KiB

int main(int argc, const char *argv[])
{
    int i = 0;
    int **pp;
    bool failed = false;

    pp = malloc(NALLOCS * sizeof(int *));
    for(i = 0; i < NALLOCS; i++)
    {
        pp[i] = malloc(ALLOC_SIZE);
        if (!pp[i])
        {
            perror("malloc");
            printf("Failed after %d allocations\n", i);
            failed = true;
            break;
        }
        // Touch some bytes in memory to trick copy-on-write.
        memset(pp[i], 0xA, 100);
        printf("pp[%d] = %p\n", i, pp[i]);
    }

    if (!failed)
        printf("Successfully allocated %d bytes\n", NALLOCS * ALLOC_SIZE);

    for(i = 0; i < NALLOCS; i++)
    {
        if (pp[i])
            free(pp[i]);
    }
    free(pp);

    return 0;
}

All the sources are on github.

ulimit

It’s the first thing that old unix hacker can think of when asked to limit program memory. ulimit is bash utility that allows you to restrict program resources and is just interface for setrlimit.

We can set the limit to resident memory size.

$ ulimit -m 1024

Now check:

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7802
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) 1024
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

We set the memory limit to 1024 kbytes (-m) thus 1 MiB. But when we try to run our program it won’t fail. Setting the limit to something more reasonable like 30 MiB will anyway let our program allocate 100 MB. ulimit simply doesn’t work. Despite setting the resident set size to 1024 kbytes, I can see in top that resident memory for my program is 4872.

The reason is that Linux doesn’t respect this and man ulimit tells it directly:

ulimit [-HSTabcdefilmnpqrstuvx [limit]]
    ...
    -m     The maximum resident set size (many systems do not honor this limit)
    ...

There is also ulimit -d that is respected according to the kernel, but it still works because of mmap (see Linker chapter).

QEMU

When you want to modify program environment QEMU is the natural way for this kind of tasks. It has -R option to limit virtual address space. But like I said earlier you can’t restrict address space to small values – there will be no space to map libc and kernel.

Look:

$ qemu-i386 -R 1048576 ./big_alloc
big_alloc: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory

Here, -R 1048576 reserves 1 MiB for guest virtual address space.

For the whole virtual address space we have to set something more reasonable like 20 MB. Look:

$ qemu-i386 -R 20M ./big_alloc
malloc: Cannot allocate memory
Failed after 100 allocations

It successfully fails¹ after 100 allocations (10 MB).

So, QEMU is the first winner in restricting program’s memory size though you have to play with -R value to get the correct limit.

Container

Another option after QEMU is to launch an application in the container, restricting its resources. To do this you have several options:

Use fancy high-level docker.
Use regular usermode tools from lxc package.
Go hardcore and write your own script with libvirt.
Name it…

But after all, resources will be restricted with native Linux subsystem called cgroups. You can try to poke it directly but I suggest using lxc. I would like to use docker but it works only on 64-bit machines and my box is small Intel Atom netbook which is i386.

Ok, quick info. LXC is LinuX Containers. It’s a collection of userspace tools and libs for managing kernel facilities to create containers – isolated and secure environment for an application or the whole system.

Kernel facilities that provide such environment are:

Control groups (cgroups)
Kernel namespaces
chroot
Kernel capabilities
SELinux, AppArmor
Seccomp policies

You can find nice documentation on the official site, on the author’s blog and all over the internet.

To simply run an application in the container you have to provide config to lxc-execute where you will configure your container. Every sane person should start from examples in /usr/share/doc/lxc/examples. Man pages recommend starting with lxc-macvlan.conf. Ok, let’s do this:

# cp /usr/share/doc/lxc/examples/lxc-macvlan.conf lxc-my.conf
# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
Successfully allocated 102400000 bytes

It works!

Now let’s limit memory. This is what cgroup for. LXC allows you to configure memory subsystem for container’s cgroup by setting memory limits.

You can find available tunable parameters for the memory subsystem in this fine RedHat manual. I’ve found 2:

memory.limit_in_bytes – sets the maximum amount of user memory (including file cache)
memory.memsw.limit_in_bytes – sets the maximum amount for the sum of memory and swap usage

Here is what I added to lxc-my.conf:

lxc.cgroup.memory.limit_in_bytes = 2M
lxc.cgroup.memory.memsw.limit_in_bytes = 2M

Launch again:

# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
#

Nothing happened, looks like it’s way too small memory. Let’s try to launch it from the shell in the container.

# lxc-execute -n foo -f ./lxc-my.conf /bin/bash
#

Looks like bash failed to launch. Let’s try /bin/sh:

# lxc-execute -n foo -f ./lxc-my.conf -l DEBUG -o log /bin/sh
sh-4.2# ./dev/big_alloc/big_alloc 
Killed

Yay! We can see this nice act of killing in dmesg:

[15447.035569] big_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
...
[15447.035779] Task in /lxc/foo
[15447.035785]  killed as a result of limit of 
[15447.035789] /lxc/foo

[15447.035795] memory: usage 3072kB, limit 3072kB, failcnt 127
[15447.035800] memory+swap: usage 3072kB, limit 3072kB, failcnt 0
[15447.035805] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[15447.035808] Memory cgroup stats for /lxc/foo: cache:32KB rss:3040KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:1588KB active_anon:1448KB inactive_file:16KB active_file:16KB unevictable:0KB
[15447.035836] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[15447.035963] [ 9225]     0  9225      942      308      10        0 0 init.lxc
[15447.035971] [ 9228]     0  9228      833      698       6        0 0 sh
[15447.035978] [ 9252]     0  9252    16106      843      36        0 0 big_alloc
[15447.035983] Memory cgroup out of memory: Kill process 9252 (big_alloc) score 1110 or sacrifice child
[15447.035990] Killed process 9252 (big_alloc) total-vm:64424kB, anon-rss:2396kB, file-rss:976kB

Though we haven’t seen error message from big_alloc about malloc failure and how much memory we were able to get, I think we’ve successfully restricted memory via container technology and can stop with it for now.

Linker

Now, let’s try to modify binary image limiting space available for the heap.

Linking is the final part of building a program and it implies using linker and linker script. Linker script is the description of program sections in memory along with its attributes and stuff.

Here is a simple linker script:

ENTRY(main)

SECTIONS
{
  . = 0x10000;
  .text : { *(.text) }
  . = 0x8000000;
  .data : { *(.data) }
  .bss : { *(.bss) }
}

Dot is current location. What that script tells us is that .text section starts at address 0x10000, and then starting from 0x8000000 we have 2 subsequent sections .data and .bss. Entry point is main.

Nice and sweet but it will not work for any useful applications. And the reason is that main function that you write in C programs is not actually first function being called. There is a whole lot of initialization and cleanup code. That code is provided with C runtime (also shorthanded to crt) and spread into crt#.o libraries in /usr/lib.

You can see exact details if you launch gcc with -v option. You’ll see that at first it invokes cc1 and creates assembly, then translate it to object file with as and finally combines everything in ELF file with collect2. That collect2 is ld wrapper. It takes your object file and 5 additional libs to create the final binary image:

/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crt1.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crti.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/crtbegin.o
/tmp/ccEZwSgF.o <-- This one is our program object file
/usr/lib/gcc/i686-redhat-linux/4.8.3/crtend.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crtn.o

It’s really complicated so instead of writing my own script I’ll modify default linker script. Get default linker script passing -Wl,-verbose to gcc:

gcc big_alloc.c -o big_alloc -Wl,-verbose

Now let’s figure out how to modify it. Let’s see how our binary is built by default. Compile it and look for .data section address. Here is objdump -h big_alloc output

Sections:
Idx Name          Size      VMA       LMA       File off  Algn
...
12 .text         000002e4  080483e0  080483e0  000003e0  2**4
                 CONTENTS, ALLOC, LOAD, READONLY, CODE
...
23 .data         00000004  0804a028  0804a028  00001028  2**2
                 CONTENTS, ALLOC, LOAD, DATA
24 .bss          00000004  0804a02c  0804a02c  0000102c  2**2
                 ALLOC

.text, .data and .bss sections are located near 128 MiB.

Now, let’s see where is the stack with help of gdb:

[restrict-memory]$ gdb big_alloc
...
Reading symbols from big_alloc...done.
(gdb) break main
Breakpoint 1 at 0x80484fa: file big_alloc.c, line 12.
(gdb) r
Starting program: /home/avd/dev/restrict-memory/big_alloc 

Breakpoint 1, main (argc=1, argv=0xbffff164) at big_alloc.c:12
12              int i = 0;
Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686
(gdb) info registers 
eax            0x1      1
ecx            0x9a8fc98f       -1701852785
edx            0xbffff0f4       -1073745676
ebx            0x42427000       1111650304
esp            0xbffff0a0       0xbffff0a0
ebp            0xbffff0c8       0xbffff0c8
esi            0x0      0
edi            0x0      0
eip            0x80484fa        0x80484fa <main+10>
eflags         0x286    [ PF SF IF ]
cs             0x73     115
ss             0x7b     123
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x33     51

esp points to 0xbffff0a0 which is near 3 GiB. So we have ~2.9 GiB for heap.

In the real world, stack top address is randomized, e.g. you can see it in the output of

# cat /proc/self/maps

As we all know, heap grows up from the end of .data towards the stack. What if we move .data section to the highest possible address?

Let’s put data segment 2 MiB before stack. Take stack top, subtract 2 MiB:

0xbffff0a0 - 0x200000 = 0xbfdff0a0

Now shift all sections starting with .data to that address:

. =     0xbfdff0a0
.data           :
{
  *(.data .data.* .gnu.linkonce.d.*)
  SORT(CONSTRUCTORS)
}

Compile it:

$ gcc big_alloc.c -o big_alloc -Wl,-T hack.lst

-Wl is an option to linker and -T hack.lst is a linker option itself. It tells linker to use hack.lst as a linker script.

Now, if we look at header we’ll see that:

Sections:
Idx Name          Size      VMA       LMA       File off  Algn

 ...

 23 .data         00000004  bfdff0a0  bfdff0a0  000010a0  2**2
                  CONTENTS, ALLOC, LOAD, DATA
 24 .bss          00000004  bfdff0a4  bfdff0a4  000010a4  2**2
                  ALLOC

But nevertheless, it successfully allocates. How? That’s really neat. When I tried to look at pointer values that malloc returns I saw that allocation is starting somewhere over the end of .data section like 0xbf8b7000, continues for some time with increasing pointers and then resets pointers to lower address like 0xb7676000. From that address it will allocate for some time with pointers increasing and then resets pointers again to even lower address like 0xb5e76000. Eventually, it looks like heap growing down!

But if you think for a minute it doesn’t really that strange. I’ve examined some glibc sources and found out that when brk fails it will use mmap instead. So glibc asks the kernel to map some pages, kernel sees that process has lots of holes in virtual memory space and map page from that space for glibc, and finally glibc returns pointer from that page.

Running big_alloc under strace confirmed theory. Just look at normal binary:

brk(0)                                  = 0x8135000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77df000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77c7000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c6000
mprotect(0x42425000, 8192, PROT_READ)   = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0x42269000, 4096, PROT_READ)   = 0
munmap(0xb77c7000, 95800)               = 0
brk(0)                                  = 0x8135000
brk(0x8156000)                          = 0x8156000
brk(0)                                  = 0x8156000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77de000
brk(0)                                  = 0x8156000
brk(0x8188000)                          = 0x8188000
brk(0)                                  = 0x8188000
brk(0x81ba000)                          = 0x81ba000
brk(0)                                  = 0x81ba000
brk(0x81ec000)                          = 0x81ec000
...
brk(0)                                  = 0x9c19000
brk(0x9c4b000)                          = 0x9c4b000
brk(0)                                  = 0x9c4b000
brk(0x9c7d000)                          = 0x9c7d000
brk(0)                                  = 0x9c7d000
brk(0x9caf000)                          = 0x9caf000
...
brk(0)                                  = 0xe29c000
brk(0xe2ce000)                          = 0xe2ce000
brk(0)                                  = 0xe2ce000
brk(0xe300000)                          = 0xe300000
brk(0)                                  = 0xe300000
brk(0)                                  = 0xe300000
brk(0x8156000)                          = 0x8156000
brk(0)                                  = 0x8156000
+++ exited with 0 +++

and now the modified binary

brk(0)                                  = 0xbf896000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778f000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7777000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7776000
mprotect(0x42425000, 8192, PROT_READ)   = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0x42269000, 4096, PROT_READ)   = 0
munmap(0xb7777000, 95800)               = 0
brk(0)                                  = 0xbf896000
brk(0xbf8b7000)                         = 0xbf8b7000
brk(0)                                  = 0xbf8b7000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778e000
brk(0)                                  = 0xbf8b7000
brk(0xbf8e9000)                         = 0xbf8e9000
brk(0)                                  = 0xbf8e9000
brk(0xbf91b000)                         = 0xbf91b000
brk(0)                                  = 0xbf91b000
brk(0xbf94d000)                         = 0xbf94d000
brk(0)                                  = 0xbf94d000
brk(0xbf97f000)                         = 0xbf97f000
...
brk(0)                                  = 0xbff8e000
brk(0xbffc0000)                         = 0xbffc0000
brk(0)                                  = 0xbffc0000
brk(0xbfff2000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7676000
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7576000
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7476000
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7376000
...
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1c76000
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1b76000
brk(0)                                  = 0xbffc0000
brk(0xbfffa000)                         = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1a76000
brk(0)                                  = 0xbffc0000
brk(0)                                  = 0xbffc0000
brk(0)                                  = 0xbffc0000
...
brk(0)                                  = 0xbffc0000
brk(0)                                  = 0xbffc0000
brk(0)                                  = 0xbffc0000
+++ exited with 0 +++

That being said, shifting .data section up to stack (thus reducing space for heap) is pointless because kernel will map page for malloc from virtual memory empty area.

Sandbox

The other way to restrict program memory is sandboxing. The difference from emulation is that we’re not really emulating anything but instead, we track and control certain things in program behavior. Usually sandboxing is used for security research when you have some kind of malware and need to analyze it without harming your system.

I’ve come up with several sandboxing methods and implemented most promising.

LD_PRELOAD trick

LD_PRELOAD is the special environment variable that when set will make dynamic linker use “preloaded” library before any other, including libc, library. It’s used in a lot of scenarios from debugging to, well, sandboxing.

This trick is also infamously used by some malware.

I have written simple memory management sandbox that intercepts malloc/free calls, does a memory usage accounting and returns ENOMEM if memory limit is exceeded.

To do this I have written a shared library with my own malloc/free wrappers that will increment a counter by malloc size and decrement when free is called. This library is being preloaded with LD_PRELOAD when running an application under test.

Here is my malloc implementation.

void *malloc(size_t size)
{
    void *p = NULL;

    if (libc_malloc == NULL) 
        save_libc_malloc();

    if (mem_allocated <= MEM_THRESHOLD)
    {
        p = libc_malloc(size);
    }
    else
    {
        errno = ENOMEM;
        return NULL;
    }

    if (!no_hook) 
    {
        no_hook = 1;
        account(p, size);
        no_hook = 0;
    }

    return p;
}

libc_malloc is a pointer to original malloc from the libc. no_hook is a thread-local flag. It’s is used to be able to use malloc in malloc hooks and avoid recursive calls - an idea taken from Tetsuyuki Kobayashi presentation.

malloc is used implicitly in account function by uthash hash table library. Why use a hash table? It’s because when you call free you pass to it only the pointer and in free you don’t know how much memory has been allocated. So I have the hash table with a pointer as a key and allocated size as a value. Here is what I do on malloc:

struct malloc_item *item, *out;

item = malloc(sizeof(*item));
item->p = ptr;
item->size = size;

HASH_ADD_PTR(HT, p, item);

mem_allocated += size;

fprintf(stderr, "Alloc: %p -> %zu\n", ptr, size);

mem_allocated is that static variable that is compared against threshold in malloc.

Now when free is called here is what happened:

struct malloc_item *found;

HASH_FIND_PTR(HT, &ptr, found);
if (found)
{
    mem_allocated -= found->size;
    fprintf(stderr, "Free: %p -> %zu\n", found->p, found->size);
    HASH_DEL(HT, found);
    free(found);
}
else
{
    fprintf(stderr, "Freeing unaccounted allocation %p\n", ptr);
}

Yep, just decrement mem_allocated. It’s that simple.

But the really cool thing is that it works rock solid².

[restrict-memory]$ LD_PRELOAD=./libmemrestrict.so ./big_alloc
pp[0] = 0x25ac210
pp[1] = 0x25c5270
pp[2] = 0x25de2d0
pp[3] = 0x25f7330
pp[4] = 0x2610390
pp[5] = 0x26293f0
pp[6] = 0x2642450
pp[7] = 0x265b4b0
pp[8] = 0x2674510
pp[9] = 0x268d570
pp[10] = 0x26a65d0
pp[11] = 0x26bf630
pp[12] = 0x26d8690
pp[13] = 0x26f16f0
pp[14] = 0x270a750
pp[15] = 0x27237b0
pp[16] = 0x273c810
pp[17] = 0x2755870
pp[18] = 0x276e8d0
pp[19] = 0x2787930
pp[20] = 0x27a0990
malloc: Cannot allocate memory
Failed after 21 allocations

Full source code for library is on github

So, LD_PRELOAD is a great way to restrict memory!

ptrace

ptrace is another feature that can be used to build memory sandboxing. ptrace is a system call that allows you to control the execution of another process. It’s built into various POSIX operating system including, of course, Linux.

ptrace is the foundation of tracers like strace, ltrace, almost every sandboxing software like systrace, sydbox, mbox and all debuggers including gdb itself.

I have built custom tool with ptrace. It traces brk calls and looks for the distance between the initial program break value and new value set by the next brk call.

This tool forks and becomes 2 processes. The parent process is a tracer and child process is a tracee. In a child process I call ptrace(PTRACE_TRACEME) and then execv. In a parent I use ptrace(PTRACE_SYSCALL) to stop on syscall and filter brk calls from child and then another ptrace(PTRACE_SYSCALL) to get brk return value.

When brk exceeded threshold I set -ENOMEM as brk return value. This is set in eax register so I just overwrite it with ptrace(PTRACE_SETREGS). Here is the meaty part:

// Get return value
if (!syscall_trace(pid, &state))
{
    dbg("brk return: 0x%08X, brk_start 0x%08X\n", state.eax, brk_start);

    if (brk_start) // We have start of brk
    {
        diff = state.eax - brk_start;

        // If child process exceeded threshold 
        // replace brk return value with -ENOMEM
        if (diff > THRESHOLD || threshold) 
        {
            dbg("THRESHOLD!\n");
            threshold = true;
            state.eax = -ENOMEM;
            ptrace(PTRACE_SETREGS, pid, 0, &state);
        }
        else
        {
            dbg("diff 0x%08X\n", diff);
        }
    }
    else
    {
        dbg("Assigning 0x%08X to brk_start\n", state.eax);
        brk_start = state.eax;
    }
}

Also, I intercept mmap/mmap2 calls because libc is smart enough to call it when brk failed. So when I have threshold exceeded and see mmap calls I just fail it with ENOMEM.

It works!

[restrict-memory]$ ./ptrace-restrict ./big_alloc
pp[0] = 0x8958fb0
pp[1] = 0x8971fb8
pp[2] = 0x898afc0
pp[3] = 0x89a3fc8
pp[4] = 0x89bcfd0
pp[5] = 0x89d5fd8
pp[6] = 0x89eefe0
pp[7] = 0x8a07fe8
pp[8] = 0x8a20ff0
pp[9] = 0x8a39ff8
pp[10] = 0x8a53000
pp[11] = 0x8a6c008
pp[12] = 0x8a85010
pp[13] = 0x8a9e018
pp[14] = 0x8ab7020
pp[15] = 0x8ad0028
pp[16] = 0x8ae9030
pp[17] = 0x8b02038
pp[18] = 0x8b1b040
pp[19] = 0x8b34048
pp[20] = 0x8b4d050
malloc: Cannot allocate memory
Failed after 21 allocations

But… I don’t really like it. It’s ABI specific, i.e. it has to use rax instead of eax on 64-bit machine, so either I make different version of that tool or use #ifdef to cope with ABI differences or make you build it with -m32 option. But that’s not usable. Also it probably won’t work on other POSIX like systems, because they might have different ABI.

Other

There are also other things one may try which I rejected for different reasons:

malloc hooks. Deprecated as said man page so I didn’t bother trying it.
Seccomp and prctl with PR_SET_MM_START_BRK. This might work but as said in seccomp filtering kernel documentation it’s not a sandboxing but a “mechanism for minimizing the exposed kernel surface”. So I guess it will be even more awkward than using ptrace by hand. Though I might look at it sometime.
libvirt-sandbox. Nope, it’s just a wrapper over lxc and qemu.
SELinux sandbox. Nope. Just doesn’t work though it uses cgroup.

Recap

In the end, I’d like to recap:

There are a lot of ways to restricting memory:
- Resource limiting with ulimit and cgroup
- Running under an emulator like QEMU
- Sandboxing with LD_PRELOAD and ptrace
- Modifying segments in the binary image.
But not all of them are working
- ulimit doesn’t work.
- cgroup kinda works - crashing application
- Emulating works - crashing application
- LD_PRELOAD works amazing!
- ptrace works good enough but ABI dependant
- Linker magic doesn’t work because ingenious libc calls mmap.

References

I think I’ve just invented a new term for QA guys. ↩︎
Unless application itself uses LD_PRELOAD :-\ ↩︎