Previous post

Next Post

Now that we have gone over the needed definitions and number notations, we can begin to look at the assembly conversions of some code in C.

Lets start with the simplest possible function, being which one that does nothing.

void f(){
    return;
}

Lets now look at this code when converted to Assembly for different platforms.

x86

f:
        ret

There is only one instruction here, RET, which returns execution to the caller.

ARM

f       PROC
        BX      lr
        ENDP      

In ARM, the return address isn’t saved on the local stack, but rather in the link register, shown here as lr, and so the instruction BX LR causes execution to jump to the address stored in lr, effectively returning execution to the caller of the function.

Lets look at another simple function, which simply returns a constant value.

int f(){
    return 123;
}

Here are the results of this section of code when compiled with optimisation into the various platforms.

x86

f:
        mov     eax, 123
        ret

There are only 2 instructions here. The first one, MOV copies the value 123 into the register EAX, which is typically used to store return values. The next instruction RET returns execution to the caller of the function f. The caller can then take the value from the EAX register.

ARM

f       PROC
        MOV     r0,#0x7b
        BX      lr
        ENDP

ARM uses the R0 register for returning values, and so 123 is copied into the register R0. Note that 0x7b is hexadecimal for 123 as mentioned in the previous post. It is worth mentioning at this point that in both the ISAs for x86 and ARM, the instruction MOV is misleading, as the value is copied rather than moved.

Now that we’ve gotten the basic hang of visualising assembly from c code, lets get onto some more useful and well known examples. We’ll use the famous Hello, world example from The C Programming Language, 2nd edition by Brian W. Kernighan and Dennis M. Ritchie.

main.c

include <stdio.h>

int main()
{
    printf("hello, world\n");
    return 0;
}

Now, I know that I said that this guide was going to be done on Ubuntu, but for the sake of brevity, the next example will be compiled on Windows with MSVC as it makes the next section easy to visualise. We’re now going to generate an assembly listing file with the following commands

cl main.c /Fa main.asm

From this, we will get file main.asm generated with Intel style syntax. Here is a section from this generated file.

_DATA	SEGMENT
$SG7334	DB	'hello, world', 0aH, 00H
	ORG $+2
_DATA	ENDS
PUBLIC	___local_stdio_printf_options
PUBLIC	__vfprintf_l
PUBLIC	_printf
PUBLIC	_main
EXTRN	___acrt_iob_func:PROC
EXTRN	___stdio_common_vfprintf:PROC
EXTRN	___asm__:PROC
_DATA	SEGMENT
COMM	[email protected][email protected]@[email protected]:QWORD							; `__local_stdio_printf_options'::`2'::_OptionsStorage
_DATA	ENDS
; Function compile flags: /Odtp
_TEXT	SEGMENT
_main	PROC
; File c:\users\leo\source\main.c
; Line 4
	push	ebp
	mov	ebp, esp
; Line 5
	push	OFFSET $SG7333
	call	___asm__
	add	esp, 4
; Line 6
	push	OFFSET $SG7334
	call	_printf
	add	esp, 4
; Line 7
	push	OFFSET $SG7335
	call	___asm__
	add	esp, 4
; Line 8
	xor	eax, eax
; Line 9
	pop	ebp
	ret	0
_main	ENDP
_TEXT	ENDS

In main.c, the string “hello, world” is defined as being of type const char[], but is not given a name. Therefore, it is up to compiler to deal with the string by itself, and so it is given the internal name of $SG7334, as shown in the segment _DATA.

Following this logic, main.c can be rewritten to operate exactly the same with the following syntax.

#include <stdio.h>

const char $SG7334[] = "hello, world\n";

int main()
{
        printf($SG7334);
        return 0;
}

If we look back at the assembly listing, you can see that the string "hello, world" is terminated by a zero-byte character, which is the standard for C/C++ strings.

$SG7334	DB	'hello, world', 0aH, 00H 

0aH (or 0xa) is a newline character, 00H (or 0x0) is zero byte character.

in the code segment _TEXT, there is only one function, being main(). main() like almost any function starts with prologue instructions and ends with ending instructions. After the prologue instruction, we can see we can see a call to the printf() function. CALL _PRINTF. Before this call is made, a pointer to the address of the first character of $SG7334 is placed on the stack with the PUSH instruction. push OFFSET $SG7334. When the printf() function returns execution to main(), the pointer is still on the stack. Because it it no longer needed, the stack pointer (or ESP register) needs to be adjusted. add esp, 4 adds 4 to the ESP register value.

The reason that 4 is added is that because this is a 32 bit program, exactly 4 bytes are needed to pass through the value in the stack. x86_64 programs would require 8 bytes to do so.

Now, we are going to do the same thing to main.c on a Ubuntu system, with some slight modifications to main.c

main.c

#include <stdio.h>

int main()
{
    __asm__("nop");
    printf("hello, world\n");
    __asm__("nop");
    return 0;
}

The only difference here is we’ve added sections __asm__("nop"); on either side of our printf() function in order to keep track of where it is in the assembly source. the instruction nop does not do anything.

Compile the source, and disassemble it with gdb. Ensure that you have gdb installed

gcc -m32 -std=c89 main.c -o main
gdb main
(gdb) set disassembly-flavor intel
(gdb) disassemble main

Output:

Dump of assembler code for function main:
   0x0804840b <+0>:     lea    ecx,[esp+0x4]
   0x0804840f <+4>:     and    esp,0xfffffff0
   0x08048412 <+7>:     push   DWORD PTR [ecx-0x4]
   0x08048415 <+10>:    push   ebp
   0x08048416 <+11>:    mov    ebp,esp
   0x08048418 <+13>:    push   ecx
   0x08048419 <+14>:    sub    esp,0x4
   0x0804841c <+17>:    nop
   0x0804841d <+18>:    sub    esp,0xc
   0x08048420 <+21>:    push   0x80484c0
   0x08048425 <+26>:    call   0x80482e0 <[email protected]>
   0x0804842a <+31>:    add    esp,0x10
   0x0804842d <+34>:    nop
   0x0804842e <+35>:    mov    eax,0x0
   0x08048433 <+40>:    mov    ecx,DWORD PTR [ebp-0x4]
   0x08048436 <+43>:    leave
   0x08048437 <+44>:    lea    esp,[ecx-0x4]
   0x0804843a <+47>:    ret
End of assembler dump.

The result of this gcc assembly dump is pretty much the same as the result we got from MSVC. The instruction and esp,0xfffffff0 aligns the ESP register to a 16 byte boundary, resulting in all values in the stack being aligned the same way. sub esp,0x4 allocates 4 bytes on the stack.

One thing that you’ll notice is the following section of instructions:

   0x0804841c <+17>:    nop
   0x0804841d <+18>:    sub    esp,0xc
   0x08048420 <+21>:    push   0x80484c0
   0x08048425 <+26>:    call   0x80482e0 <[email protected]>
   0x0804842a <+31>:    add    esp,0x10
   0x0804842d <+34>:    nop

Which has nop on either side. Because of this, you know that the instructions in between the two occurances of nop are directly related to the printf() command included in main.c. The three instructions that we are mostly interested in are push 0x80484c0, call 0x80482e0 <[email protected]> and add esp,0x10, which are involved with the printing of ‘hello, world’.

push 0x80484c0 pushes the value 0x80484c0 to the top of the stack. The instruction call 0x80482e0 <[email protected]> pushes the return address (in this case, 0x80482e0) onto the stack, and changes register EIP to the call destination, effectively transfering control to the new target and begins execution there. add esp,0x10 adds 10h (or 16d) to the stack pointer, in order to pass over to the next value in the stack.

One thing that you may notice is that the function being called is puts(), rather than printf(). In this disassembly, the string constant being stored in memory would be equal to hello, world, rather than hello, world\n. It seems that in this case, gcc sees that the printf() call can be optimized. There is a set of situations in which printf() calls are optimized such as this. For example, depending on format strings. printf() is optimized to puts() if the format string is equal to "%s\n", if its only argument is a simple string constant followed by a newline as is shown in our main.c, or if no arguments are supplied. Similarly, it is optimized to putchar() if the format string is "%c", or if the string constant is only one character.

We can see the difference that this optimization makes by explicitly disabling it on compilation as such:

gcc main.c -std=c89 -m32 -fno-builtin-printf -o main
Dump of assembler code for function main:
   0x0804840b <+0>:     lea    ecx,[esp+0x4]
   0x0804840f <+4>:     and    esp,0xfffffff0
   0x08048412 <+7>:     push   DWORD PTR [ecx-0x4]
   0x08048415 <+10>:    push   ebp
   0x08048416 <+11>:    mov    ebp,esp
   0x08048418 <+13>:    push   ecx
   0x08048419 <+14>:    sub    esp,0x4
   0x0804841c <+17>:    nop
   0x0804841d <+18>:    sub    esp,0xc
   0x08048420 <+21>:    push   0x80484c0
   0x08048425 <+26>:    call   0x80482e0 <[email protected]>
   0x0804842a <+31>:    add    esp,0x10
   0x0804842d <+34>:    nop
   0x0804842e <+35>:    mov    eax,0x0
   0x08048433 <+40>:    mov    ecx,DWORD PTR [ebp-0x4]
   0x08048436 <+43>:    leave
   0x08048437 <+44>:    lea    esp,[ecx-0x4]
   0x0804843a <+47>:    ret
End of assembler dump.

As you can see, printf() is now called rather than puts().

Now that we’ve gone over some basic disassembling and debugging of some simple programs, in the next post we will be able to go over some more advanced techniques in the next post.


x89k

Python, C, Reverse Engineering, Security