Welcome to the first lesson in the ‘Reverse Engineering Basics’ series. You should be working on Ubuntu 16.04 or later, or any *NIX platform that you are confident with. Ensure that you have gcc, g++ and appropriate compilers for 64 and 32 bit programs. I will mention some more specific programs needed when they come up in the series.
Lets first go over some definitions, and an introduction to the CPU.
Code consisting of a series of instructions that is directly processed by the CPU.
A basic command for the CPU. Some simple commands would include moving data between registers, working with memory, and performing basic arithmetic operations. As a general rule, each CPU has its own Instruction Set Architecture (ISA).
A low level symbolic programming language with strong correlations between the language and the Machine Code instructions for the given architecture. Code is converted between Machine Code and Assembly Code by an assembler to make the job of programming easier.
Each CPU has a set amount of general purpose registers. x86 typically has 8, x86_64 and ARM typically have 16. The simplest way to understand a register is to think of it as a temporary untyped variable.
Because higher level languages are easy for people to understand and low level native machine code is easy for CPUs to understand, most modern programming is done through a higher level programming language which is converted to machine code through a compiler.
People are typically accustomed to a decimal number system in base 10, likely because humans have 10 fingers. However, “10” has no inherent significance in mathematics, and so it makes sense that computers deal with numbers in binary, with a 1 or 0 representing the flow of electricity in a wire. If a number system has 10 digits, it has a radix (or base) of 10, binary has a radix of 2.
There are two important notes to remember:
A number is a number, while a digit is a term from writing systems, which is typically one character.
The value of a number does not change at all when converted to a different radix, only the notation used is changed.
Positional notation is used for the vast majority of number systems, with digits having a weight associated with its position within a number.
1234 really stands for:
103x1 + 102x2 + 101x3 + 100x4 = 1234
Following this logic, it’s easy to show how Binary works in a similar fashion.
0b10110 really stands for:
24x1 + 23x0 + 22x1 + 21x1 + 20x0 = 22
With a similar system, base 15, or hexadecimal numbers can be used to express larger numbers with less digits.
Much of the time, numbers with a different radix will look identical, and so conventions exist to differentiate different notations.
Decimal numbers are typically written without and extension or prefix. eg,
1234. Some assemblers allow an identifier on the decimal number, where the number would be represented as
1234d. Binary numbers are often appended with the
0b prefix. eg,
0b10110. Occasionally, binary numbers are also denoted with
b as a postfix. Eg,
Hexadecimal numbers are typically prefixed with
0x42EF. Sometimes they are given the postfix
42EFh. In most conventions,
h is given as a postfix if the number begins with a non-decimal character.
One other numerical system often used in computer programming is Octal Radix. In octal, there are 8 digits, 0 through 7. Each is mapped to 3 bits of data. One interesting modern use of octal is in the *NIX utility
chmod, where the arguments of which can be represented by 3 digits, representing read, write and execute. Each digit making up
chmod can be represented in binary form.
In a similar manner, floating point numbers can be distinguished from integers by appending
.0 to the end. Eg,
123.0 rather than
Now that we’ve gotten through some basic definitions and number theory, we can start getting into some more technical information in the next post.