diff --git a/content/CSE4303/CSE4303_L17.md b/content/CSE4303/CSE4303_L17.md index 67cccc5..0be9613 100644 --- a/content/CSE4303/CSE4303_L17.md +++ b/content/CSE4303/CSE4303_L17.md @@ -1,3 +1,430 @@ # CSE4303 Introduction to Computer Security (Lecture 17) -## Software security \ No newline at end of file +> Due to lack of my attention, this lecture note is generated by AI to create continuations of the previous lecture note. I kept this warning because the note was created by AI. + +#### Software security + +### Administrative notes + +#### Project details + +- Project plan + - Thursday, `4/9` at the end of class + - `5%` +- Written document and presentation recording + - Thursday, `4/30` at `11:30 AM` + - `15%` +- View peer presentations and provide feedback + - Wednesday, `5/6` at `11:59 PM` + - `5%` + +#### Upcoming schedule + +- This week (`3/20`) + - software security lecture + - studio + - some time for studio on Tuesday +- Next week (`4/6`) + - fuzzing + - some time to discuss project ideas +- `4/13` + - Web security +- `4/20` + - Privacy and ethics overview + - time to work on projects + - course wrap-up + +### Overview + +#### Outline + +- Context +- Prominent software vulnerabilities and exploits +- Buffer overflows + - Background: C code, compilation, memory layout, execution + - Baseline exploit + - Challenges + - Defenses, countermeasures, counter-countermeasures + +Sources: +- SEED lab book +- Gilbert/Tamassia book +- Slides from Bryant/O'Hallaron (CMU), Dan Boneh (Stanford), Michael Hicks (UMD) + +### Context + +#### Context: computing stack (informal) + +| Layer | Example | +| --- | --- | +| Application | web server, standalone app | +| Compiler / assembler | `gcc`, `clang` | +| OS: syscalls | `execve()`, `setuid()`, `write()`, `open()`, `fork()` | +| OS: processes, mem layout | Linux virtual memory layout | +| Architecture (ISA, execution) | x86, x86_64, ARM | +| Hardware | Intel Sky Lake processor | + +- User control is strongest near the application / compiler level. +- System control becomes more important as we move down toward OS, architecture, and hardware. + +### Prominent software vulnerabilities and exploits + +#### Software security: categories + +- Race conditions +- Privilege escalation +- Path traversal +- Environment variable modification +- Language-specific vulnerabilities + - Format string attack + - Buffer overflows + +#### Buffer Overflows (BoFs) + +- A buffer overflow is a bug that affects low-level code, typically in C and C++, with significant security implications. +- Normally, a program with this bug will simply crash. +- But an attacker can alter the situations that cause the program to do much worse. + - Steal private information + - e.g. Heartbleed + - Corrupt valuable information + - Run code of the attacker's choice + +#### Application behavior + +- Slide contains a figure only. +- Intended point: normal application behavior can become attacker-controlled if input handling is unsafe. + +#### BoFs: why do we care? + +- Reference from slide: [IEEE Spectrum top programming languages 2025](https://spectrum.ieee.org/top-programming-languages-2025) + +#### Critical systems in C/C++ + +- Most OS kernels and utilities + - `fingerd` + - X windows server + - shell +- Many high-performance servers + - Microsoft IIS + - Apache `httpd` + - `nginx` + - Microsoft SQL Server + - MySQL + - `redis` + - `memcached` +- Many embedded systems + - Mars rover + - industrial control systems + - automobiles + +A successful attack on these systems can be particularly dangerous. + +#### Morris Worm + +- Slide contains a figure / historical reference only. +- It is included as an example of how memory-corruption vulnerabilities mattered in practice. + +#### Why do we still care? + +- The slide references the NVD search page: [NVD vulnerability search](https://nvd.nist.gov/vuln/search) +- Why the drop? + - Memory-safe languages + - Rust + - Go + - Stronger defenses + - Fuzzing + - find bugs before release + - Change in development practices + - code review + - static analysis tools + - related engineering improvements + +#### MITRE Top 25 2025 + +- Reference from slide: [MITRE CWE Top 25](http://cwe.mitre.org/top25/) + +### Buffer overflows + +#### Outline + +- System Basics + - Application memory layout + - How does function call work under the hood + - `32-bit x86` only + - `64-bit x86_64` similar, but with important differences +- Buffer overflow + - Overwriting the return address pointer + - Point it to shell code injected + +#### Buffer Overflows (BoFs) + +- 2-minute version first, then all background / full version + +#### Process memory layout: virtual address space + +- Slide reference: [virtual address space reference](https://hungys.xyz/unix-prog-process-environment/) + +#### Process memory layout: function calls + +- Slide reference: [Tenouk function call figure 1](http://www.tenouk.com/Bufferoverflowc/Bufferoverflow2.html) +- Slide reference: [Tenouk function call figure 2](http://www.tenouk.com/Bufferoverflowc/Bufferoverflow4.html) + +#### Process memory layout: compromised frame + +- Slide reference: [Tenouk compromised frame figure](http://www.tenouk.com/Bufferoverflowc/Bufferoverflow4.html) + +#### Computer System + +High-level examples used in the slide: + +```c +car *c = malloc(sizeof(car)); +c->miles = 100; +c->gals = 17; +float mpg = get_mpg(c); +free(c); +``` + +```java +Car c = new Car(); +c.setMiles(100); +c.setGals(17); +float mpg = c.getMPG(); +``` + +Assembly-language example used in the slide: + +```asm +get_mpg: + pushq %rbp + movq %rsp, %rbp + ... + popq %rbp + ret +``` + +- The same computation can be viewed at multiple levels: + - C / Java source + - assembly language + - machine code + - operating system context + +#### Little Theme 1: Representation + +- All digital systems represent everything as `0`s and `1`s. + - The `0` and `1` are really two different voltage ranges in wires. + - Or magnetic positions on a disk, hole depths on a DVD, or even DNA. +- "Everything" includes: + - numbers + - integers and floating point + - characters + - building blocks of strings + - instructions + - directives to the CPU that make up a program + - pointers + - addresses of data objects stored in memory +- These encodings are stored throughout the computer system. + - registers + - caches + - memories + - disks +- They all need addresses. + - find an item + - find a place for a new item + - reclaim memory when data is no longer needed + +#### Little Theme 2: Translation + +- There is a big gap between how we think about programs / data and the `0`s and `1`s of computers. +- We need languages to describe what we mean. +- These languages must be translated one level at a time. +- Example point from the slide: + - we know Java as a programming language + - but we must work down to the `0`s and `1`s of computers + - we try not to lose anything in translation + - we encounter Java bytecode, C, assembly, and machine code + +#### Little Theme 3: Control Flow + +- How do computers orchestrate everything they are doing? +- Within one program: + - How are `if/else`, loops, and switches implemented? + - How do we track nested procedure calls? + - How do we know what to do upon `return`? +- At the operating-system level: + - library loading + - sharing system resources + - memory + - I/O + - disks + +#### HW/SW Interface: Code / Compile / Run Times + +- Code time + - user program in C + - `.c` file +- Compile time + - C compiler + - assembler +- Run time + - executable `.exe` file + - hardware executes it +- Note from slide: + - the compiler and assembler are themselves just programs developed using this same process + +#### Assembly Programmer's View + +- Programmer-visible CPU / memory state + - Program counter + - address of next instruction + - called `RIP` in x86-64 + - Named registers + - heavily used program data + - together called the register file + - Condition codes + - store status information about most recent arithmetic operation + - used for conditional branching +- Memory + - byte-addressable array + - contains code and user data + - includes the stack for supporting procedures + +#### Turning C into Object Code + +- Code in files `p1.c` and `p2.c` +- Compile with: + +```bash +gcc -Og p1.c p2.c -o p +``` + +- Notes from the slide + - `-Og` uses basic optimizations + - resulting machine code goes into file `p` +- Translation chain + - C program -> assembly program -> object program -> executable program +- Associated tools + - compiler + - assembler + - linker + - static libraries (`.a`) + +#### Machine Instruction Example + +- C code + +```c +*dest = t; +``` + +- Meaning + - store value `t` where designated by `dest` +- Assembly + +```asm +movq %rsi, (%rdx) +``` + +- Interpretation + - move 8-byte value to memory + - operands + - `t` is in register `%rsi` + - `dest` is in register `%rdx` + - `*dest` means memory `M[%rdx]` +- Object code + +```text +0x400539: 48 89 32 +``` + +- It is a 3-byte instruction stored at address `0x400539`. + +#### IA32 Registers - 32 bits wide + +- General-purpose register families shown in the slide + - `%eax`, `%ax`, `%ah`, `%al` + - `%ecx`, `%cx`, `%ch`, `%cl` + - `%edx`, `%dx`, `%dh`, `%dl` + - `%ebx`, `%bx`, `%bh`, `%bl` + - `%esi`, `%si` + - `%edi`, `%di` + - `%esp`, `%sp` + - `%ebp`, `%bp` +- Roles highlighted in the slide + - accumulate + - counter + - data + - base + - source index + - destination index + - stack pointer + - base pointer + +#### Data Sizes + +- Slide is primarily a figure summarizing common integer widths and sizes. + +#### Assembly Data Types + +- "Integer" data of `1`, `2`, `4`, or `8` bytes + - data values + - addresses / untyped pointers +- No aggregate types such as arrays or structures at the assembly level + - just contiguous bytes in memory +- Two common syntaxes + - `AT&T` + - used in the course, slides, textbook, GNU tools + - `Intel` + - used in Intel documentation and Intel tools +- Need to know which syntax you are reading because operand order may be reversed. + +#### Three Basic Kinds of Instructions + +- Transfer data between memory and register + - load + - `%reg = Mem[address]` + - store + - `Mem[address] = %reg` +- Perform arithmetic on register or memory data + - examples: addition, shifting, bitwise operations +- Control flow + - unconditional jumps to / from procedures + - conditional branches + +#### Abstract Memory Layout + +```text +High addresses +Stack <- local variables, procedure context +Dynamic Data <- heap, new / malloc +Static Data <- globals / static variables +Literals <- large constants such as strings +Instructions +Low addresses +``` + +#### The ELF File Format + +- ELF = Executable and Linkable Format +- One of the most widely used binary object formats +- ELF is architecture-independent +- ELF file types + - Relocatable + - must be fixed by the linker before execution + - Executable + - ready for execution + - Shared + - shared libraries with linking information + - Core + - core dumps created when a program terminates with a fault +- Tools mentioned on slide + - `readelf` + - `file` + - `objdump -D` + +#### Process Memory Layout (32-bit x86 machine) + +- This slide is primarily a diagram. +- Key idea: a `32-bit x86` process has a standard virtual memory layout with code, static data, heap, and stack arranged in distinct regions. + +We continue with the concrete runtime layout and the actual overflow mechanics in Lecture 18. diff --git a/content/CSE4303/CSE4303_L18.md b/content/CSE4303/CSE4303_L18.md new file mode 100644 index 0000000..d398332 --- /dev/null +++ b/content/CSE4303/CSE4303_L18.md @@ -0,0 +1,594 @@ +# CSE4303 Introduction to Computer Security (Lecture 18) + +> Due to lack of my attention, this lecture note is generated by AI to create continuations of the previous lecture note. I kept this warning because the note was created by AI. + +#### Software security + +### Overview + +#### Outline + +- Context +- Prominent software vulnerabilities and exploits +- Buffer overflows + - Background: C code, compilation, memory layout, execution + - Baseline exploit + - Challenges + - Defenses, countermeasures, counter-countermeasures + +### Buffer overflows + +#### All programs are stored in memory + +- The process's view of memory is that it owns all of it. +- For a `32-bit` process, the virtual address space runs from: + - `0x00000000` + - to `0xffffffff` +- In reality, these are virtual addresses. + - The OS and CPU map them to physical addresses. + +#### The instructions themselves are in memory + +- Program text is also stored in memory. +- The slide shows instructions such as: + +```asm +0x4c2 sub $0x224,%esp +0x4c1 push %ecx +0x4bf mov %esp,%ebp +0x4be push %ebp +``` + +- Important point: + - code and data are both memory-resident + - control flow therefore depends on values stored in memory + +#### Data's location depends on how it's created + +- Static initialized data example + +```c +static const int y = 10; +``` + +- Static uninitialized data example + +```c +static int x; +``` + +- Command-line arguments and environment are set when the process starts. +- Stack data appears when functions run. + +```c +int f() { + int x; + ... +} +``` + +- Heap data appears at runtime. + +```c +malloc(sizeof(long)); +``` + +- Summary from the slide + - Known at compile time + - text + - initialized data + - uninitialized data + - Set when process starts + - command line and environment + - Runtime + - stack + - heap + +#### We are going to focus on runtime attacks + +- Stack and heap grow in opposite directions. +- Compiler-generated instructions adjust the stack size at runtime. +- The stack pointer tracks the active top of the stack. +- Repeated `push` instructions place values onto the stack. +- The slides use the sequence: + - `push 1` + - `push 2` + - `push 3` + - `return` +- Heap allocation is apportioned by the OS and managed in-process by `malloc`. +- The lecture says: focusing on the stack for now. + +```text +0x00000000 0xffffffff +Heap ---------------------------------> <--------------------------------- Stack +``` + +#### Stack layout when calling functions + +Questions asked on the slide: + +- What do we do when we call a function? + - What data need to be stored? + - Where do they go? +- How do we return from a function? + - What data need to be restored? + - Where do they come from? + +Example used in the slide: + +```c +void func(char *arg1, int arg2, int arg3) +{ + char loc1[4]; + int loc2; + int loc3; +} +``` + +Important layout points: + +- Arguments are pushed in reverse order of code. +- Local variables are pushed in the same order as they appear in the code. +- The slide then introduces two unknown slots between locals and arguments. + +#### Accessing variables + +Example: + +```c +void func(char *arg1, int arg2, int arg3) +{ + char loc1[4]; + int loc2; + int loc3; + ... + loc2++; + ... +} +``` + +Question from the slide: +- Where is `loc2`? + +Step-by-step answer developed in the slides: + +- Its absolute address is undecidable at compile time. +- We do not know exactly where `loc2` is in absolute memory. +- We do not know how many arguments there are in general. +- But `loc2` is always a fixed offset before the frame metadata. +- This motivates the frame pointer. + +Definitions from the slide: + +- Stack frame + - the current function call's region on the stack +- Frame pointer + - `%ebp` +- Example answer + - `loc2` is at `-8(%ebp)` + +#### Notation + +- `%ebp` + - a memory address stored in the frame-pointer register +- `(%ebp)` + - the value at memory address `%ebp` + - like dereferencing a pointer + +The slide sequence then shows: + +```asm +pushl %ebp +movl %esp, %ebp +``` + +- Meaning: + - first save the old frame pointer on the stack + - then set the new frame pointer to the current stack pointer + +#### Returning from functions + +Example caller: + +```c +int main() +{ + ... + func("Hey", 10, -3); + ... +} +``` + +Questions from the slides: + +- How do we restore `%ebp`? +- How do we resume execution at the correct place? + +Slide answers: + +- Push `%ebp` before locals. +- Set `%ebp` to current `%esp`. +- Set `%ebp` to `(%ebp)` at return. +- Push next `%eip` before `call`. +- Set `%eip` to `4(%ebp)` at return. + +#### Stack and functions: Summary + +- Calling function + - push arguments onto the stack in reverse order + - push the return address + - the address of the instruction that should run after control returns + - jump to the function's address +- Called function + - push old frame pointer `%ebp` onto the stack + - set frame pointer `%ebp` to current `%esp` + - push local variables onto the stack + - access locals as offsets from `%ebp` +- Returning function + - reset previous stack frame + - `%ebp = (%ebp)` + - jump back to return address + - `%eip = 4(%ebp)` + +#### Quick overview (again) + +- Buffer + - contiguous set of a given data type + - common in C + - all strings are buffers of `char` +- Overflow + - put more into the buffer than it can hold +- Question + - where does the extra data go? +- Slide answer + - now that we know memory layouts, we can reason about where the overwrite lands + +#### A buffer overflow example + +Example 1 from the slide: + +```c +void func(char *arg1) +{ + char buffer[4]; + strcpy(buffer, arg1); + ... +} + +int main() +{ + char *mystr = "AuthMe!"; + func(mystr); + ... +} +``` + +Step-by-step effect shown in the slides: + +- Initial stack region includes: + - `buffer` + - saved `%ebp` + - saved `%eip` + - `&arg1` +- First 4 bytes copied: + - `A u t h` +- Remaining bytes continue writing: + - `M e ! \0` +- Because `strcpy` keeps copying until it sees `\0`, bytes go past the end of the buffer. +- In the example, upon return: + - `%ebp` becomes `0x0021654d` +- Result: + - segmentation fault + - shown as `SEGFAULT (0x00216551)` in the slide sequence + +#### A buffer overflow example: changing control data vs. changing program data + +Example 2 from the slide: + +```c +void func(char *arg1) +{ + int authenticated = 0; + char buffer[4]; + strcpy(buffer, arg1); + if (authenticated) { ... } +} + +int main() +{ + char *mystr = "AuthMe!"; + func(mystr); + ... +} +``` + +Step-by-step effect shown in the slides: + +- Initial stack contains: + - `buffer` + - `authenticated` + - saved `%ebp` + - saved `%eip` + - `&arg1` +- Overflow writes: + - `A u t h` into `buffer` + - `M e ! \0` into `authenticated` +- Result: + - code still runs + - user now appears "authenticated" + +Important lesson: +- A buffer overflow does not need to crash. +- It may silently change program data or logic. + +#### `gets` vs `fgets` + +Unsafe function shown in the slide: + +```c +void vulnerable() +{ + char buf[80]; + gets(buf); +} +``` + +Safer version shown in the slide: + +```c +void safe() +{ + char buf[80]; + fgets(buf, 64, stdin); +} +``` + +Even safer pattern from the next slide: + +```c +void safer() +{ + char buf[80]; + fgets(buf, sizeof(buf), stdin); +} +``` + +Reference from slide: +- [List of vulnerable C functions](https://security.web.cern.ch/security/recommendations/en/codetools/c.shtml) + +#### User-supplied strings + +- In the toy examples, the strings are constant. +- In reality they come from users in many ways: + - text input + - packets + - environment variables + - file input +- Validating assumptions about user input is extremely important. + +#### What's the worst that could happen? + +Using: + +```c +char buffer[4]; +strcpy(buffer, arg1); +``` + +- `strcpy` will let you write as much as you want until a `\0`. +- If attacker-controlled input is long enough, the memory past the buffer becomes "all ours" from the attacker's perspective. +- That raises the key question from the slide: + - what could you write to memory to wreak havoc? + +#### Code injection + +- Title-only transition slide. +- It introduces the move from accidental overwrite to deliberate attacker payloads. + +#### High-level idea + +Example used in the slide: + +```c +void func(char *arg1) +{ + char buffer[4]; + sprintf(buffer, arg1); + ... +} +``` + +Two-step plan shown in the slides: + +- 1. Load my own code into memory. +- 2. Somehow get `%eip` to point to it. + +The slide sequence draws this as: +- vulnerable buffer on stack +- attacker-controlled bytes placed in memory +- `%eip` redirected toward those bytes + +#### This is nontrivial + +- Pulling off this attack requires getting a few things really right, and some things only sorta right. +- The lecture says to think about what is tricky about the attack. +- Main security idea: + - the key to defending it is to make the hard parts really hard + +#### Challenge 1: Loading code into memory + +- The attacker payload must be machine-code instructions. + - already compiled + - ready to run +- We have to be careful in how we construct it. + - It cannot contain all-zero bytes. + - otherwise `sprintf`, `gets`, `scanf`, and similar routines stop copying + - It cannot make use of the loader. + - because we are injecting the bytes directly + - It cannot use the stack. + - because we are in the process of smashing it +- The lecture then gives the name: + - shellcode + +#### What kind of code would we want to run? + +- Goal: full-purpose shell + - code to launch a shell is called shellcode + - it is nontrivial to write shellcode that works as injected code + - no zeroes + - cannot use the stack + - no loader dependence + - there are many shellcodes already written + - there are even competitions for writing the smallest shellcode +- Goal: privilege escalation + - ideally, attacker goes from guest or non-user to root + +#### Shellcode + +High-level C version shown in the slides: + +```c +#include +int main() { + char *name[2]; + name[0] = "/bin/sh"; + name[1] = NULL; + execve(name[0], name, NULL); +} +``` + +Assembly version shown in the slides: + +```asm +xorl %eax, %eax +pushl %eax +pushl $0x68732f2f +pushl $0x6e69622f +movl %esp, %ebx +pushl %eax +... +``` + +Machine-code bytes shown in the slides: + +```text +"\x31\xc0" +"\x50" +"\x68""//sh" +"\x68""/bin" +"\x89\xe3" +"\x50" +... +``` + +Important point from the slide: +- those machine-code bytes can become part of the attacker's input + +#### Challenge 2: Getting our injected code to run + +- We cannot insert a fresh "jump into my code" instruction. +- We must use whatever code is already running. + +#### Hijacking the saved `%eip` + +- Strategy: + - overwrite the saved return address + - make it point into the injected bytes +- Core idea: + - when the function returns, the CPU loads the overwritten return address into `%eip` + +Question raised by the slides: +- But how do we know the address? + +Failure mode shown in the slide sequence: +- if the guessed address is wrong, the CPU tries to execute data bytes +- this is most likely not valid code +- result: + - invalid instruction + - CPU "panic" / crash + +#### Challenge 3: Finding the return address + +- If we do not have the code, we may not know how far the buffer is from the saved `%ebp`. +- One approach: + - try many different values +- Worst case: + - `2^32` possible addresses on `32-bit` + - `2^64` possible addresses on `64-bit` +- But without address randomization: + - the stack always starts from the same fixed address + - the stack grows, but usually not very deeply unless heavily recursive + +#### Improving our chances: nop sleds + +- `nop` is a single-byte instruction. +- Definition: + - it does nothing except move execution to the next instruction +- NOP sled idea: + - put a long sequence of `nop` bytes before the real malicious code + - now jumping anywhere in that region still works + - execution slides down into the payload + +Why this helps: +- it increases the chance that an approximate address guess still succeeds +- the slides explicitly state: + - now we improve our chances of guessing by a factor of `#nops` + +```text +[padding][saved return address guess][nop nop nop ...][malicious code] +``` + +#### Putting it all together + +- Payload components shown in the slides: + - padding + - guessed return address + - NOP sled + - malicious code +- Constraint noted by the lecture: + - input has to start wherever the vulnerable `gets` / similar function begins writing + +#### Buffer overflow defense #1: use secure bounds-checking functions + +- User-level protection +- Replace unbounded routines with bounded ones. +- Prefer secure languages where possible: + - Java + - Rust + - etc. + +#### Buffer overflow defense #2: Address Space Layout Randomization (ASLR) + +- Randomize starting address of program regions. +- Goal: + - prevent attacker from guessing / finding the correct address to put in the return-address slot +- OS-level protection + +#### Buffer overflow counter-technique: NOP sled + +- Counter-technique against uncertain addresses +- By jumping somewhere into a wide sled, exact address knowledge becomes less necessary + +#### Buffer overflow defense #3: Canary + +- Put a guard value between vulnerable local data and control-flow data. +- If overflow changes the canary, the program can detect corruption before returning. +- OS-level / compiler-assisted protection in the lecture framing + +#### Buffer overflow defense #4: No-execute bits (NX) + +- Mark the stack as not executable. +- Requires hardware support. +- OS / hardware-level protection + +#### Buffer overflow counter-technique: ret-to-libc and ROP + +- Code in the C library is already stored at consistent addresses. +- Attacker can find code in the C library that has the desired effect. + - possibly heavily fragmented +- Then return to the necessary address or addresses in the proper order. +- This is the motivation behind: + - `ret-to-libc` + - Return-Oriented Programming (ROP) + +We will continue from defenses / exploitation follow-ups in the next lecture. diff --git a/content/CSE4303/_meta.js b/content/CSE4303/_meta.js index ca3154f..d29475b 100644 --- a/content/CSE4303/_meta.js +++ b/content/CSE4303/_meta.js @@ -20,5 +20,7 @@ export default { CSE4303_L13: "Introduction to Computer Security (Lecture 13)", CSE4303_L14: "Introduction to Computer Security (Lecture 14)", CSE4303_L15: "Introduction to Computer Security (Lecture 15)", - CSE4303_L16: "Introduction to Computer Security (Lecture 16)" + CSE4303_L16: "Introduction to Computer Security (Lecture 16)", + CSE4303_L17: "Introduction to Computer Security (Lecture 17)", + CSE4303_L18: "Introduction to Computer Security (Lecture 18)" }