# OZZ: Identifying Kernel Out-of-Order Concurrency Bugs with In-Vivo Memory Access Reordering Authors: Dae R. Jeong, Yewon Cho, Byoungyoung Lee, Insik Shin, Youngjin Kwon Presented by Jiyang Wang - **□Why exist Out-of-order execution?** - \* Reduce pipeline stalls - Improve cache utilization - **□Why exist Out-of-order execution?** - Reduce pipeline stalls - Improve cache utilization - □Different with thread interleaving #### Thread interleaving #### **Out-of-order execution** Thread I Thread I Thread 2 #### **□**How to prevent Out-of-order execution? Memory barrier! | Туре | Memory barrier API | Precedent accesses | Subsequent accesses | |---------|--------------------------|--------------------|---------------------| | Full | smp_mb() | loads/stores | loads/stores | | Load | smp_rmb() | loads | loads | | Store | smp_wmb() | stores | stores | | Release | smp_store_release(&a, v) | load/stores | store to &a | | Acquire | smp_load_acquire(&a) | load from &a | loads/stores | | Relaxed | READ_ONCE()/WRITE_ONCE() | none | none | #### **□**How to prevent Out-of-order execution? **Obey Linux Kernel Memory Model!** > It defines reordering cases that would not occur #### **❖** Data dependency load-store ``` int r1; r1 = X; Y = r1 + 5; ``` #### □**How to prevent Out-of-order execution?** **Obey Linux Kernel Memory Model!** > It defines reordering cases that would not occur **❖** Data dependency load-store **❖**Control dependency load-store #### □**How to prevent Out-of-order execution?** **Obey Linux Kernel Memory Model!** > It defines reordering cases that would not occur **❖Data dependency** load-store **❖Control dependency** load-store Address dependency load-load / load-store X: load the value $i \leftarrow \text{should use READ\_ONCE()}$ or atomic\_read(). Y: load or store the value arr[i] #### ☐ Harm of Out-of-order execution ``` 1 /****** Thread A ******/ 2 /* kernel/watch_queue.c */ 3 void post_one_notification() 4 buf = &pipe->bufs[head]; 5 buf->len = len; 6 buf->ops = &wq_pipe_ops; 7 + smp_wmb(); 8 head += 1; 9 } ``` #### ☐ Harm of Out-of-order execution ``` 11 /****** Thread B ******/ 1 /****** Thread A ******/ 2 /* kernel/watch_queue.c */ 12 /* fs/pipe.c */ 3 void post_one_notification() 13 void pipe_read() { (2) 14 if (head > tail) { 4 buf = &pipe->bufs[head]; 5 buf->len = len; 15 + smp_rmb(); (4) 6 buf->ops = &wq_pipe_ops; buf = &pipe->bufs[tail]; 17 len = buf->len; + smp_wmb(); (3) 18 buf->ops->confirm(); head += 1; 20 } 10 ``` pipe\_read() access uninitialized function ! #### ☐ Harm of Out-of-order execution ``` 11 /****** Thread B ******/ 1 /****** Thread A ******/ 12 /* fs/pipe.c */ 2 /* kernel/watch_queue.c */ 3 void post_one_notification() 13 void pipe_read() { (4) 14 if (head > tail) { 4 buf = &pipe->bufs[head]; 5 buf->len = len; 15 + smp\_rmb(); buf->ops = &wq_pipe_ops; buf = &pipe->bufs[tail]; 17 len = buf->len; 7 + smp_wmb(); (1) 18 buf->ops->confirm(); head += 1: 20 } 10 ``` pipe\_read() access uninitialized function ! #### ☐ Hard to identify Out-of-order execution - Manual investigation kernel code is difficult - ❖ Different processors reorder differently (ARM more aggressive than x86\_64) - \*Existing testing tools (e.g. concurrency fuzzers) is impractical - > They assume memory accesses happened in order - > Control thread interleaving may impose an ordered execution - ❖ In-vitro testing is insufficient - Lost runtime contexts when analyzing behavior - ❖ Most of data race detector short in comprehending the Out-of-order execution - > What memory accesses should not be reordered - ➤ What will be the result of reordering - □Processor do Out-of-order execution - \* store-store, store-load, load-load - \* load-store - > Theoretically can cause, but provides little improve in practice #### □Processor do Out-of-order execution - Delay committing the store operation - Run a *Load* operation too early #### □Processor do Out-of-order execution - Delay committing the store operation - Run a *Load* operation too early - Exactly emulating processor's behavior? - Require simulate the full architecture, too expensive! #### □ Processor do Out-of-order execution - Delay committing the store operation - Run a *Load* operation too early - Exactly emulating processor's behavior? - Require simulate the full architecture, too expensive! - Controlling Out-of-order execution explicitly and deterministically? - **Execution order of instruction is decided in the processors!** #### □Processor do Out-of-order execution - Delay committing the store operation - Run a *Load* operation too early - Exactly emulating processor's behavior - Require simulate the full architecture, too expensive! - Controlling Out-of-order execution explicitly and deterministically - **Execution order of instruction is decided in the processors!** - We can change the order of memory access in the instructions #### □ Processor do Out-of-order execution Delay committing the store operation delayed store operation versioned load operation | System call interface | Description | |----------------------------------------|----------------------------------------| | delay_store_at(I) | When an instruction I is executed, | | C (C C C C C C C C C C C C C C C C C C | its store operation will be delayed. | | read_old_value_at(I) | When an instruction I is executed, its | | | load operation will read an old value. | Systems calls to instruct OEMU to control Out-of-order execution We can change the order of memory access in the instructions Variable on the stack #### Key idea of in-vivo OEMU #### □In-vivo emulation - > Functions are executed sequentially - Functions commute with OEMU to reorder memory access Reorder while kernel is running. Thus, can use all bug-detecting oracles ### **OEMU:** Delayed store operation - Virtual store buffer is a **per-thread**, **temporary** storage - Flush when full, encounter memory barrier, interrupt on kernel ... ### **OEMU: Versioned load operation** - 1 Read\_old\_value\_at( $I_1$ ) - $\bigcirc$ Read\_old\_value\_at( $I_3$ ) - (3) Syscall A() <del>-----</del> 4 ( (5) (6) (9) (10) $I_{\text{mb1}}: \text{smp\_rmb()}$ $I_1 : r1 = X$ $I_{\text{mb2}}: \text{smp\_rmb()}$ $I_2$ : r2 = Y $I_3$ : r3 = Z Syscall B() Userspace thread B Syscall B $I_1$ : Y = I $I_2$ : Z = 2 **2** 13 memory &Z &Y &r2 stack $< \&Y, 3 \Longrightarrow 1, t_7 >$ $< \&Z, 4 \Longrightarrow 2, t_8 >$ Store history (global) $\overline{7}$ (8) **OEMU** Thread A $(t_{6}, t_{cur}]$ Versioning window Syscall A $< I_p$ , &X, 1, $t_p >$ Virtual store buffer old\_r1 &r1 olb\_r2 ol**4**\_r3 &r3 #### Assume a hypothetical memory barrier is missing store-store, store-load CPU I CPU 2 R(d) R(c) R(b) R(a) : real memory barrier ---: hypothetical memory barrier ### - Assume a hypothetical memory barrier is missing 1 reordering : real memory barrier ----: hypothetical memory barrier ### Assume a hypothetical memory barrier is missing #### store-store, store-load 1 reordering 2 interleaving : real memory barrier ----: hypothetical memory barrier #### - Assume a hypothetical memory barrier is missing store-store, store-load : real memory barrier ---: hypothetical memory barrier #### - Assume a hypothetical memory barrier is missing ----: hypothetical memory barrier store-store, store-load ----- : real memory barrier : real memory barrier #### - Assume a hypothetical memory barrier is missing : hypothetical memory barrier : real memory barrier #### - Assume a hypothetical memory barrier is missing : hypothetical memory barrier ### Ozz: Key idea ### Ozz: Key idea - 1 Profiles memory accesses and memory barriers - 2 Calculate where is the hypothetical memory barrier, where doing the schedule, what memory access to reorder - 3 Use result of 2 to test and observer Out-of-order bugs #### Single Thread Input - **❖Ozz** make STIs preserve necessary resource dependencies - **❖Ozz** inserts callback function during LLVM compiler pass - **❖Ozz** make STIs preserve necessary resource dependencies - Ozz inserts callback function during LLVM compiler pass - **❖Ozz** do the execution and profiling and get the information ### **Ozz: Scheduling** #### Algorithm 1: Calculating scheduling hints ``` Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{ij} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_j = filter\_out(S_i, S_j) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then g = g \cup \{s\} else if s is a barrier & type of s = barrier\_type then G_t = G_t \cup \{q\} 10 g = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = g.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 21 22 Hij.sort(key:len(h.reorder)) 23 return Hij ``` ### **Ozz: Scheduling** ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{a\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 21 22 Hij.sort(key:len(h.reorder)) 23 return Hii ``` ``` Algorithm 2: Algorithm of the filter_out() function Input :S_i, S_j: Sequences of memory accesses and memory barriers executed by two system calls. Output :S'_i, S'_i: Sequences of memory accesses and memory barriers in which irrelevant memory accesses are filtered. shared mem = 0 for (a_i, a_i) \in S_i \times S_i do if either ai or ai is not a memory access then continue o = shared\_memory\_location(a_i, a_i) if o \neq \phi then shared\_mem = shared\_mem \cup \{o\} s for k \in \{i, j\} do for a \in S_k do if a is not a memory access then continue if a.addr ∉ shared_mem then 12 S_k = S_k \setminus \{a\} 14 S'_i, S'_i = S_i, S_j 15 return S', S' ``` ① Ozz finds out memory locations shared between two memory accesses ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints Step 1: Filter out memory accesses S_i, S_j = filter\_out(S_i, S_j) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 21 22 Hij.sort(key: len(h.reorder)) 23 return Hii ``` ``` Algorithm 2: Algorithm of the filter_out() function Input :S_i, S_j: Sequences of memory accesses and memory barriers executed by two system calls. Output :S'_i, S'_i: Sequences of memory accesses and memory barriers in which irrelevant memory accesses are filtered. 1 shared mem = 0 2 for (a_i, a_i) \in S_i \times S_i do if either ai or ai is not a memory access then continue o = shared\_memory\_location(a_i, a_i) if o \neq o then shared\_mem = shared\_mem \cup \{o\} s for k \in \{i, j\} do for a \in S_k do if a is not a memory access then continue if a.addr ∉ shared_mem then S_k = S_k \setminus \{a\} 14 S'_{i}, S'_{i} = S_{i}, S_{j} 15 return S', S' ``` - ① Ozz finds out memory locations shared between two memory accesses - ② Ozz excludes memory accesses don't visit shared\_mem ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do ▶ Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & 8 type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` k = i barrier\_type = st ``` Initial: $G_t$ , $g = \emptyset$ , $\emptyset$ st barrier l memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st\_barrier3 Syscall $S_i$ ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_i: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do ▶ Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 > Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` k = i barrier\_type = st G_t = \emptyset g = \{m1, m2, m3\} ``` ``` st barrier l memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st_barrier3 Syscall S_i ``` ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_i: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do ▶ Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 g = \emptyset 11 > Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` k = i barrier\_type = st G_t = \{g1\} g1 = \{m1, m2, m3, m4, m5\} g = \emptyset ``` ``` st barrier l memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st barrier3 Syscall S_i ``` ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{ij} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 > Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` k = i barrier\_type = st G_t = \{g1, g2\} g1 = \{m1, m2, m3, m4, m5\} g2 = \{m6, m7\} g = \emptyset ``` st barrier l memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st barrier3 Syscall $S_i$ ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier_type = st then sched = q.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` barrier_type = st g = \{m1, m2, m3, m4, m5\} sched = m5 First Test: H_{ij} = H_{ij} \cup \{h1\} h1.shed = m5 h1.reorder = \{m1, m2, m3, m4\} ``` ``` st barrier l memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st barrier3 Syscall S_i ``` ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier_type = st then sched = q.last . 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` barrier_type = st g = \{m1, m2, m3, m4, m5\} sched = m5 First Test: H_{ij} = H_{ij} \cup \{h1\} h1.shed = m5 h1.reorder = \{m1, m2, m3, m4\} Second Test: H_{ij} = H_{ij} \cup \{h1\} \cup \{h2\} h2.shed = m5 h2.reorder = \{m1, m2, m3\} 4.6.6 ``` ``` st_barrier1 memory memory2 memory3 ld barrier l memory4 memory5 st barrier2 memory6 memory7 st barrier3 Syscall S_i ``` ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 > Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier_type = st then sched = q.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` barrier\_type = ld g = \{m1, m2, m3, m4, m5\} sched = m1 First Test: H_{ij} = H_{ij} \cup \{h1\} h1.shed = m1 h1.reorder = \{m2, m3, m4, m5\} ``` ``` ld_barrier l memory memory2 memory3 st_barrier1 memory4 memory5 Id barrier2 memory6 memory7 st barrier3 Syscall S_i ``` ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier_type = st then sched = q.last . 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 22 Hij.sort(key: len(h.reorder)) 23 return Hij ``` ``` barrier\_type = ld g = \{m1, m2, m3, m4, m5\} sched = m1 First Test: H_{ij} = H_{ij} \cup \{h1\} h1.shed = m1 h1.reorder = \{m2, m3, m4, m5\} Second Test: H_{ij} = H_{ij} \cup \{h1\} \cup \{h2\} h2.shed = m1 h2.reorder = \{m3, m4, m5\} 4.6.6 ``` ``` ld barrier l memory memory2 memory3 st barrier l memory4 memory5 ld barrier2 memory6 memory7 st barrier3 Syscall S_i ``` 23 return Hij ### **Ozz: Scheduling** #### Algorithm 1: Calculating scheduling hints ``` Input : S_i, S_j: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints ▶ Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 21 22 Hij.sort(key:len(h.reorder)) ``` Greedy: When deviate more, bug arise with high possible. #### **❖Each STI** is translated into multiple MTIs - > MTI have the same set of syscalls with the STI - > MTIs are annotated with a pair of syscalls to run concurrently and schedule hints #### **❖Each STI** is translated into multiple MTIs - > MTI have the same set of syscalls with the STI - > MTIs are annotated with a pair of syscalls to run concurrently and schedule hints #### **♦•Ozz run MTIs monitor bugs** - > Ozz leverage bug-detecting oracles during runtime - > Report tells the reordered accesses and hypothetical memory barrier - **□**Hardware - ❖ Two-sockets 32 phsical-core Intel Xeon CPU E5-2683 v4 operating at 2.1 GHz - **♦512GB of RAM** - ☐ Host operating system - **❖Ubuntu 20.04.4 kernel 5.4.143** - OZZ - \*based on SYZKALLER (SOTA fuzzer developed by Google) - \*32 VMs each is equipped with 4 vCPUS and 8G memory - \*Kernel:6.5-rc6 to 6.8 (SYZKALLER use the same kernel) | ID | Kernel version | Subsystem | Summary | | | | | |---------|----------------------------------------------------------------------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|--|--|--| | Bug #1 | v6.7-rc8 | RDS | KASAN: slab-out-of-bounds Read in rds_loop_xmit | | | | | | Bug #2 | v6.5-rc6 | watchqueue | BUG: unable to handle kernel NULL pointer dereference in _find_first_bit | | | | | | Bug #3 | v6.5-rc6 | VMCI | general protection fault in add_wait_queue | Reported | | | | | Bug #4 | v6.6-rc2 | XDP | BUG: unable to handle kernel NULL pointer dereference in xsk_poll | | | | | | Bug #5 | | | BUG: unable to handle kernel NULL pointer dereference in tls_getsockopt BUG: unable to handle kernel NULL pointer dereference in sk_psock_verdict_data_ready | | | | | | Bug #6 | | | | | | | | | Bug #7 | v6.5-rc7 XDP BUG: unable to handle kernel NULL pointer dereference in xsk_generic_xmit | | BUG: unable to handle kernel NULL pointer dereference in xsk_generic_xmit | Fixed | | | | | Bug #8 | v6.7-rc8 SMC BUG: unable to handle kernel NULL pointer dereference in connect | | Confirmed | | | | | | Bug #9 | v6.7-rc2 TLS BUG: unable to handle kernel NULL pointer dereference in tls_setsockopt | | | Fixed | | | | | Bug #10 | v6.8-rc1 | SMC | KASAN: null-ptr-deref Write in fput | Confirmed | | | | | Bug #11 | v6.8 | GSM | BUG: unable to handle kernel NULL pointer dereference in gsm_dlci_config | Confirmed | | | | - ☐ Ozz discovers 6 I unique crashed, and I I new OoO bugs - ☐ SYZKALLER is impractical to identify - > x86-64 does not reorder s-s or I-I - > TCG does not reorder memory access #### □Improper adoption of memory barrier ``` 1 /****** Thread A ******/ 16 /****** Thread B ******/ 2 /* net/tls/tls_main.c */ 17 /* net/core/socket.c */ 3 int tls_init() { 18 int sock_common_setsockopt() { struct sock *sk = sock->sk; 4 ctx = kzalloc(); sk->data = ctx: return READ_ONCE(sk->sk_prot) 20 ctx->sk_proto = ->setsockopt(sk); 21 READ_ONCE(sk->sk_prot); + smp_wmb(); WRITE_ONCE(sk->sk_prot, 24 /* net/tls/tls_main.c */ &tls_prots); 25 int tls_setsockopt() { 10 struct tls_context *ctx = 11 } sk->data; 12 (3) 28 return ctx->sk_proto 13 struct proto_ops tls_prots = { .setsockopt = tls_setsockopt, ->setsockopt(sk); 29 15 }; 30 } ``` ctx->sk\_proto() uninitialized ! #### □Improper adoption of memory barrier ``` 1 /****** Thread A *******/ 2 /* net/tls/tls_main.c */ 3 int tls_init() { 4 ctx = kzalloc(); sk->data = ctx; ctx->sk_proto = READ_ONCE(sk->sk_prot); 8 + smp_wmb(); WRITE_ONCE(sk->sk_prot, &tls_prots); 10 11 } 12 13 struct proto_ops tls_prots = { .setsockopt = tls_setsockopt, 15 }; ``` Developers caught the data race (load/store tearing) However, these function suppress a data race detector from reporting #### □Incorrect customized lock ``` /* net/rds/send.c */ int acquire_in_xmit() { int acquired = !test_and_set_bit (IN_XMIT, &cp_flags); return acquired; } ``` #### □Incorrect customized lock ``` 1 /* net/rds/send.c */ 2 int acquire_in_xmit() 3 int acquired = !test_and_set_bit (IN_XMIT, &cp_flags); return acquired; 9 /* net/rds/send.c */ 10 void release_in_xmit() 11 { 12 - clear_bit(IN_XMIT, &cp_flags); 13 + clear_bit_unlock(IN_XMIT, &cp_flags); 14 15 } 16 ``` ``` if (acquire in xmit() == 0) { Critical section clear bit(); ``` ### **Evaluation: Known OoO bug** | Subsystem | Version | Reproduced? | # of tests | Type | |------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | vlan | 5.12-rc7 | <b>√</b> | 342 | S-S | | watchqueue | 5.17-rc7 | ✓ | 23 | S-S | | xsk | 4.17-rc4 | ✓ | 47 | S-S | | xsk | 5.3-rc3 | ✓ | 12 | S-S | | fs | 6.1-rc1 | ✓ | 17 | L-L | | sbitmap | 5.1-rc1 | × | - | S-S | | nbd | 6.7-rc1 | ✓ | 17 | L-L | | tls | 6.7-rc1 | <b>/</b> * | 42 | S-S | | unix | 5.0-rc7 | 1 | 23 | L-L | | | vlan<br>watchqueue<br>xsk<br>xsk<br>fs<br>sbitmap<br>nbd<br>tls | vlan 5.12-rc7 watchqueue 5.17-rc7 xsk 4.17-rc4 xsk 5.3-rc3 fs 6.1-rc1 sbitmap 5.1-rc1 nbd 6.7-rc1 tls 6.7-rc1 | vlan 5.12-rc7 ✓ watchqueue 5.17-rc7 ✓ xsk 4.17-rc4 ✓ xsk 5.3-rc3 ✓ fs 6.1-rc1 ✓ sbitmap 5.1-rc1 × nbd 6.7-rc1 ✓ tls 6.7-rc1 ✓* | vlan 5.12-rc7 ✓ 342 watchqueue 5.17-rc7 ✓ 23 xsk 4.17-rc4 ✓ 47 xsk 5.3-rc3 ✓ 12 fs 6.1-rc1 ✓ 17 sbitmap 5.1-rc1 × - nbd 6.7-rc1 ✓ 17 tls 6.7-rc1 ✓* 42 | - □ 8 OoO bugs can be reproduced, running tens of test run on average - ☐ #6 is caused by thread migration - > Ozz pins concurrent threads on specific CPUs ### **Evaluation: Performance overhead** #### **Benchmark suit:** LMBench (evaluating various OS operations) | Tests | Linux (µs) | Linux w/ OEMU (µs) | Overhead | | |-------------|------------|--------------------|---------------|--| | null | 1.74 | 43.3 | 24.9× | | | stat | 75.64 | 859.6 | 11.4× | | | open/close | 128 | 1369.2 | $10.7 \times$ | | | File create | 403.3 | 5623.5 | $13.9 \times$ | | | File delete | 207.8 | 3363 | 16.2× | | | ctxsw 2p/0k | 23.8 | 71.5 | $3.0 \times$ | | | pipe | 59.3 | 610.1 | 10.3× | | | unix | 173.8 | 2567.6 | 14.8× | | | fork | 7590 | 145.6k | 19.2× | | | mmap | 133.8k | 7896.1k | 59.0× | | - Developers can opt to selectively enable OEMU (lockless implementations) - ☐ OMEU has 7.9x lower throughout compared to SYZKALLER (0.92 test/s VS 7.33 test/s) - > OMEU can control Out-of-order execution - > Save the cost of buying new machines (ARM) ## **Evaluation: Compared with OFence** #### **□OFence** - ❖ Predifine likely-buggy patterns (mamory barriers are not in pair) - ❖Using static pattern matching analysis ## **Evaluation: Compared with OFence** #### **□OFence** - ❖ Predifine likely-buggy patterns (mamory barriers are not in pair) - ❖Using static pattern matching analysis #### **□Result** - \*Ozz (and SYZKALLER) is limited in generating inputs of bugs found by OFence - > A submodule requires specific hardware to run, inhibiting dynamic testing in such submodules - **❖Only 3** out of **I** I OoO bugs found by Ozz fit pattern of OFence #### □Pros: - The problem of Out-of-order execution is interesting - Change the order of memory access to emulate Out-of-order #### □Cons: - **❖**Assum work in two thread and only one thread do Out-of-order execution - Can't deal with load-store reorder - Can't tell what type memory barrier is the best to insert # Q&A ``` Algorithm 1: Calculating scheduling hints Input : S_i, S_i: Sequences of memory access and memory barriers executed by two system calls Output: H_{i,i} = \{h_1, h_2, ..., h_n\}: A set of scheduling hints Step 1: Filter out memory accesses S_i, S_i = filter\_out(S_i, S_i) 2 for k \in \{i, j\} do for barrier\_type \in \{st, ld\} do > Step 2: Group memory accesses between memory barriers of the same type G_t, g = \emptyset, \emptyset for s \in S_k do 5 if s is a memory access then q = q \cup \{s\} else if s is a barrier & type \ of \ s = barrier\_type \ then G_t = G_t \cup \{q\} 10 q = \emptyset 11 ▶ Step 3: Construct scheduling hints H_{ij} = \emptyset 12 for q \in G_t do 13 if barrier\_type = st then sched = g.last 14 else sched = q.first 15 while q \neq \emptyset do 16 h.sched = sched 17 h.reorder = g \setminus sched 18 H_{ij} = H_{ij} \cup \{h\} 19 if barrier\_type = st then g = g \setminus \{g.last\} 20 else g = g \setminus \{g.first\} 21 22 Hij.sort(key: len(h.reorder)) 23 return Hii ``` ``` Algorithm 2: Algorithm of the filter out() function Input :S_i, S_i: Sequences of memory accesses and memory barriers executed by two system calls. Output :S'_i, S'_i: Sequences of memory accesses and memory barriers in which irrelevant memory accesses are filtered. 1 shared mem = 0 2 for (a_i, a_j) \in S_i \times S_j do if either ai or ai is not a memory access then continue o = shared\_memory\_location(a_i, a_j) if o \neq o then shared\_mem = shared\_mem \cup \{o\} s for k \in \{i, j\} do for a \in S_{\nu} do if a is not a memory access then continue if a.addr ∉ shared_mem then S_k = S_k \setminus \{a\} 14 S'_{i}, S'_{i} = S_{i}, S_{j} 15 return S', S' ``` - ① Ozz finds out memory locations shared between two memory accesses - ② Ozz excludes memory accesses don't visit shared\_mem