It was a nice and calm work day when suddenly a wild collegue appeared in front of my desk and asked:
— Hey, uhmm, could you help me with some strange thing?
— Yeah, sure, what’s matter?
— I have data corruption and it’s happening in a really crazy manner.
If you don’t know, data/memory corruption is the single most nasty and awful bug that can happen in your program. Especially, when you are storage developer.
So here was the case. We have RAID calculation algorithm. Nothing fancy — just a bunch of functions that gets pointer to buffer, do some math over that buffer and then return it. Initially, calculation algorithm was written in userspace for simpler debugging, correctness proof and profiling and then ported to kernel space. And that’s where the problem started.
Firstly, when building from kbuild, gcc was just crashing1 eating all the memory available. But I was not surprized at all considering files size — dozen of files each about 10 megabytes. Yes, 10 MB. Though that was not surprizing for me, too. That sources was generated from assembly and were actually a bunch of intrinsics. Anyway, it would be much better if gcc would not just crash.
So we’ve just written separate Makefile to build object files that will later be linked in kernel module.
Secondly, data was not corrupted every time. When you were reading 1 GB from disks it was fine. And when you were reading 2 GB sometimes it was ok and sometimes not.
Thorough source code reading had led to nothing. We saw that memory buffer was corrupted exactly in calculation functions. But that functions was pure math: just a calculation with no side effects — it didn’t call any library functions, it didn’t change anything except passed buffer and local variables. And that changes to buffer were right, while corruption was really corruption — calc functions just cannot generate such data.
And then we saw a pure magic. If we added to calc function single
then data was not corrupted at all. I thought such things were only subject of DailyWTF stories or developers jokes. We checked everything several times on different hosts — data was correct. Well, there were nothing left for us except disassemble object files to determine what was so special about
So we did a diff between 2 object files with and without
--- Calculation.s 2014-01-27 15:52:11.581387291 +0300 +++ Calculation_printk.s 2014-01-27 15:51:50.109512524 +0300 @@ -1,10 +1,15 @@ .file "Calculation.c" + .section .rodata.str1.1,"aMS",@progbits,1 +.LC0: + .string "" .text .p2align 4,,15 .globl Calculation_5d .type Calculation_5d, @function Calculation_5d: .LFB20: + subq $24, %rsp +.LCFI0: movq (%rdi), %rax movslq %ecx, %rcx movdqa (%rax,%rcx), %xmm4 @@ -46,7 +51,7 @@ pxor %xmm2, %xmm6 movdqa 96(%rax,%rcx), %xmm2 pxor %xmm5, %xmm1 - movdqa %xmm14, -24(%rsp) + movdqa %xmm14, (%rsp) pxor %xmm15, %xmm2 pxor %xmm5, %xmm0 movdqa 112(%rax,%rcx), %xmm14 @@ -108,11 +113,16 @@ movq 24(%rdi), %rax movdqa %xmm6, 80(%rax,%rcx) movq 24(%rdi), %rax - movdqa -24(%rsp), %xmm0 + movdqa (%rsp), %xmm0 movdqa %xmm0, 96(%rax,%rcx) movq 24(%rdi), %rax + movl $.LC0, %edi movdqa %xmm14, 112(%rax,%rcx) + xorl %eax, %eax + call printk movl $128, %eax + addq $24, %rsp +.LCFI1: ret .LFE20: .size Calculation_5d, .-Calculation_5d @@ -143,6 +153,14 @@ .long .LFB20 .long .LFE20-.LFB20 .uleb128 0x0 + .byte 0x4 + .long .LCFI0-.LFB20 + .byte 0xe + .uleb128 0x20 + .byte 0x4 + .long .LCFI1-.LCFI0 + .byte 0xe + .uleb128 0x8 .align 8 .LEFDE1: .ident "GCC: (GNU) 4.4.5 20110214 (Red Hat 4.4.5-6)"
Ok, looks like nothing changed much. String declaration in
.rodata section, call to
printk in the end. But what looked really strange to me is changes in
%rsp manipulations. Seems like there were doing the same, but in the printk version they shifted in 24 bytes because in the start it does
subq $24, %rsp.
We didn’t care much about it at first. On x86 architecture stack grows down, i.e. to smaller addresses. So to access local variables (these are on stack) you create new stack frame by saving current
%rbp and shifting
%rsp thus allocating space on stack. This is called function prologue and it was absent in our assembly function without printk.
You need this stack manipulation later to access your local vars by subtracting from
%rbp. But we were subtratcting from
%rsp, isn’t it strange?
Wait a minute… I decided to draw stack frame and got it!
Holy shucks! We are processing undefined memory. All instructions like this
movdqa -24(%rsp), %xmm0
moving aligned data from
xmm0 to address
is actually the access over the top of the stack!
I was really shocked. So shocked that I even asked on stackoverflow. And the answer was
In short, red zone is a memory piece of size 128 bytes over stack top, that according to amd64 ABI should not be accessed by any interrupt or signal handlers. And it was rock solid true, but for userspace. When you are in kernel space leave the hope for extra memory — stack is worth its weight in gold here. And you got a whole lot of interrupt handling here.
When interruption occurs, the interrupt handler uses stack frame of current kernel thread, but to avoid thread data corruption it holds it’s own data over stack top. And when our own code were compiled with red zone support the thread data were located over stack top as much as interrupt handlers data.
That’s why kernel compilation is done with
- gcc flag. It’s set implicitly by
But remember that we weren’t be able to build with
kbuild because it was crashing every time due to huge files.
Anyway, we just added in our Makefile
EXTRA_CFLAGS += - and it’s working now. But still I have a question why adding
printk(««) leads to preventing using red zone and space allocation for local variables with
subq $24, %rsp?
So, that day I learned a really tricky optimization that at the cost of potential memory corruption could save you couple of instructions for every leaf function.
That’s all, folks!