This topic is way too vast to cover in one lecture
So today we will cover:
Donald Knuth 1974 - "Structured Programming With Go To Statements"
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%.
Takeaway?
Takeaway?
CPU data access
Typical Intel Xeon
Example:
Example:
This is what is called a "cache miss"!
So how do you choose what goes to which cache?
So how do you choose what goes to which cache?
Well, you don't, the CPU/Motherboard does
But you can help it along - contiguous memory and frequent use!
For more info - see videos in handout!
Also - the same way memory usage is "predicted"
so are code "branches"!
(for more, see video in handout!)
For example
For example
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1000000
int main() {
int *arr = malloc(SIZE * sizeof(int));
if (!arr) return 1;
srand(1234);
for (int i = 0; i < SIZE; ++i) {
arr[i] = rand() % 100;
}
int count = 0;
for (int i = 0; i < SIZE; ++i) {
if (arr[i] > 50) { // unpredictable branch
count++;
}
}
printf("Count: %d\n", count);
free(arr);
return 0;
}
For example
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1000000
int main() {
int *arr = malloc(SIZE * sizeof(int));
if (!arr) return 1;
srand(1234);
for (int i = 0; i < SIZE; ++i) {
arr[i] = rand() % 100;
}
int count = 0;
for (int i = 0; i < SIZE; ++i) {
count += (arr[i] > 50); // no actual branch
}
printf("Count: %d\n", count);
free(arr);
return 0;
}
For example
$ hyperfine -N ./branch_miss.o
Benchmark 1: ./branch_miss.o
Time (mean ± σ): 20.1 ms ± 1.1 ms [User: 19.0 ms, System: 0.8 ms]
Range (min … max): 19.0 ms … 28.4 ms 143 runs
$ hyperfine -N ./no_branch_miss.o
Benchmark 1: ./no_branch_miss.o
Time (mean ± σ): 15.5 ms ± 0.8 ms [User: 14.5 ms, System: 0.8 ms]
Range (min … max): 14.7 ms … 23.3 ms 190 runs
1.3 speedup just from that!
Why? Lets try perf stat
Why? Lets try perf stat
$ perf stat ./branch_miss.o
Count: 489858
Performance counter stats for './branch_miss.o':
19,08 msec task-clock:u # 0,982 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
527 page-faults:u # 27,627 K/sec
89 598 854 instructions:u # 1,08 insn per cycle
83 305 829 cycles:u # 4,367 GHz
18 029 879 branches:u # 945,179 M/sec
537 691 branch-misses:u # 2,98% of all branches
0,019415949 seconds time elapsed
0,019496000 seconds user
0,000000000 seconds sys
Why? Lets try perf stat
$ perf stat ./no_branch_miss.o
Count: 489858
Performance counter stats for './no_branch_miss.o':
15,18 msec task-clock:u # 0,979 CPUs utilized
0 context-switches:u # 0,000 /sec
0 cpu-migrations:u # 0,000 /sec
527 page-faults:u # 34,717 K/sec
91 108 993 instructions:u # 1,40 insn per cycle
64 947 318 cycles:u # 4,279 GHz
17 029 876 branches:u # 1,122 G/sec
34 884 branch-misses:u # 0,20% of all branches
0,015498122 seconds time elapsed
0,015587000 seconds user
0,000000000 seconds sys
Takeaways
Data vectorization
An example: Mandelbrot set calculation
import time
import numpy as np
def iter_fun(zn, C):
return zn**2 + C
def iter_n_loop(z0, C, max_num, limit):
z = z0 * np.ones(C.shape, dtype=C.dtype)
mat_shape = z.shape
mat_size = z.size
# we only iterate until the limit is passed, then it is considered divergant
# use the "iters" as a indication of this
iters = np.zeros((mat_size,), dtype=np.int32)
# flatten for only one loop
z.shape = (mat_size,)
C.shape = (mat_size,)
for it in range(mat_size):
for ind in range(max_num):
z[it] = iter_fun(z[it], C[it])
if z[it] * np.conj(z[it]) > limit**2:
iters[it] = ind
break
# all iterations that survived without passing limit went trough the entire loop
iters[iters == 0] = max_num
# change shape back to original
z.shape = mat_shape
C.shape = mat_shape
iters.shape = mat_shape
return z, iters
def iter_n_vec(z0, C, max_num, limit):
z = z0 * np.ones(C.shape, dtype=C.dtype)
# we only iterate until the limit is passed, then it is considered divergant
# use the "iters" as a indication of this
iters = np.zeros(C.shape, dtype=np.int32)
for ind in range(max_num):
z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])
# if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
z_norm = z * np.conj(z)
z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
iters[z_norm_check] = ind
# all iterations that survived so far went trough the entire loop
iters[iters == 0] = max_num
return z, iters
def mandelbrot_set(C_mat, max_num, limit, iter_fun):
z0 = 0.0 + 0.0j
Z, iters = iter_fun(z0, C_mat, max_num=max_num, limit=limit)
Z_norm = np.real(np.sqrt(Z * np.conj(Z)))
return Z_norm, iters
if __name__ == "__main__":
n_base = 300
xmin, xmax, xn = -0.45, -0.42, n_base
ymin, ymax, yn = -0.59, -0.55, n_base
xv = np.linspace(xmin, xmax, xn, dtype=np.float32)
yv = np.linspace(ymin, ymax, yn, dtype=np.float32)
X, Y = np.meshgrid(xv, yv)
max_num = 120
limit = 4.0
t0 = time.time()
Z_norm, iters = mandelbrot_set(
C_mat=X + Y * 1.0j, max_num=max_num, limit=limit, iter_fun=iter_n_loop
)
dtl = time.time() - t0
print(f"loop execution time {dtl} sec")
t0 = time.time()
Z_norm, iters = mandelbrot_set(
C_mat=X + Y * 1.0j, max_num=max_num, limit=limit, iter_fun=iter_n_vec
)
dtv = time.time() - t0
print(f"vectorized execution time {dtv} sec")
print(f"speedup {dtl/dtv}")
def iter_n_loop(z0, C, max_num, limit):
z = z0 * np.ones(C.shape, dtype=C.dtype)
mat_shape = z.shape
mat_size = z.size
# we only iterate until the limit is passed, then it is considered divergant
# use the "iters" as a indication of this
iters = np.zeros((mat_size,), dtype=np.int32)
# flatten for only one loop
z.shape = (mat_size,)
C.shape = (mat_size,)
for it in range(mat_size):
for ind in range(max_num):
z[it] = iter_fun(z[it], C[it])
if z[it] * np.conj(z[it]) > limit**2:
iters[it] = ind
break
# all iterations that survived without passing limit went trough the entire loop
iters[iters == 0] = max_num
# change shape back to original
z.shape = mat_shape
C.shape = mat_shape
iters.shape = mat_shape
return z, iters
def iter_n_vec(z0, C, max_num, limit):
z = z0 * np.ones(C.shape, dtype=C.dtype)
# we only iterate until the limit is passed, then it is considered divergant
# use the "iters" as a indication of this
iters = np.zeros(C.shape, dtype=np.int32)
for ind in range(max_num):
z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])
# if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
z_norm = z * np.conj(z)
z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
iters[z_norm_check] = ind
# all iterations that survived so far went trough the entire loop
iters[iters == 0] = max_num
return z, iters
def iter_n_vec(z0, C, max_num, limit):
z = z0 * np.ones(C.shape, dtype=C.dtype)
# we only iterate until the limit is passed, then it is considered divergant
# use the "iters" as a indication of this
iters = np.zeros(C.shape, dtype=np.int32)
for ind in range(max_num):
z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])
# if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
z_norm = z * np.conj(z)
z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
iters[z_norm_check] = ind
# all iterations that survived so far went trough the entire loop
iters[iters == 0] = max_num
return z, iters
$ python vectorization.py
loop execution time 6.237718343734741 sec
vectorized execution time 0.07863926887512207 sec
speedup 79.3206553540082
What about the second vectorization?
Single instruction, multiple data (SIMD)!
Usually these are done for you!
numpy automatically uses simd-O3 and -march in gcc also does itUsually these are done for you!
Anyway here is an example!
void add_float_arrays(float* a, float* b, float* result, int size) {
for (int i = 0; i < size; ++i) {
result[i] = a[i] + b[i];
}
}
int main() {
int size = 100000;
int steps = 1000;
float x[size], y[size], res[size];
for (int i = 0; i < size; i++) {
x[i] = (float)i;
y[i] = x[i] + 1;
}
add_float_arrays(x, y, res, size);
for (int i = 0; i < steps; i++) {
add_float_arrays(res, y, res, size);
}
return 0;
}
void add_float_arrays(float* a, float* b, float* result, int size) {
for (int i = 0; i < size; ++i) {
result[i] = a[i] + b[i];
}
}
int main() {
int size = 100000;
int steps = 1000;
float x[size], y[size], res[size];
for (int i = 0; i < size; i++) {
x[i] = (float)i;
y[i] = x[i] + 1;
}
add_float_arrays(x, y, res, size);
for (int i = 0; i < steps; i++) {
add_float_arrays(res, y, res, size);
}
return 0;
}
#include <stdio.h>
#include <immintrin.h>
void add_float_arrays_avx512(const float* a, const float* b, float* result, int size) {
int i;
for (i = 0; i <= size - 16; i += 16) {
__m512 va = _mm512_loadu_ps(&a[i]);
__m512 vb = _mm512_loadu_ps(&b[i]);
__m512 vsum = _mm512_add_ps(va, vb);
_mm512_storeu_ps(&result[i], vsum);
}
for (; i < size; ++i) {
result[i] = a[i] + b[i];
}
}
int main() {
int size = 100000;
int steps = 1000;
float x[size], y[size], res[size];
for (int i = 0; i < size; i++) {
x[i] = (float)i;
y[i] = x[i] + 1;
}
add_float_arrays_avx512(x, y, res, size);
for (int i = 0; i < steps; i++) {
add_float_arrays_avx512(res, y, res, size);
}
return 0;
}
$ hyperfine -N ./without_simd.o
Benchmark 1: ./without_simd.o
Time (mean ± σ): 120.5 ms ± 1.7 ms [User: 118.8 ms, System: 1.0 ms]
Range (min … max): 118.3 ms … 124.2 ms 24 runs
$ hyperfine -N ./with_simd.o
Benchmark 1: ./with_simd.o
Time (mean ± σ): 36.6 ms ± 2.4 ms [User: 35.2 ms, System: 1.1 ms]
Range (min … max): 33.1 ms … 43.9 ms 78 runs
But this is not fair, lets have a look at compiler optimization
So there is a reason most examples today are in C
last example: lets try -O3
$ hyperfine -N ./without_simd_o3.o
Benchmark 1: ./without_simd_o3.o
Time (mean ± σ): 498.4 µs ± 107.0 µs [User: 266.8 µs, System: 183.2 µs]
Range (min … max): 236.5 µs … 2306.9 µs 2866 runs
$ hyperfine -N ./with_simd_o3.o
Benchmark 1: ./with_simd_o3.o
Time (mean ± σ): 499.1 µs ± 91.7 µs [User: 267.6 µs, System: 181.9 µs]
Range (min … max): 239.2 µs … 1698.2 µs 3592 runs
So usually you do not need to hand-roll these, only in extreme scenarios
But now you know it exists and is a thing!
Throughout the lecture today - compiled with and without -O
and more detail at the end
This is a quick one:
An example!
#include <stdio.h>
#include <math.h>
int main() {
int size = 10000;
double x[size], y[size];
FILE *file = fopen("data.csv", "r");
double value;
int count;
while (!feof(file)) {
if (fscanf(file, "%lf,", &value) == 1) {
if (count < size) {
x[count] = value;
}
else {
y[count - size] = value;
}
count++;
}
}
fclose(file);
double r;
double dx, dy;
int total;
for(int i = 0; i < size; i++) {
for(int j = 0; j < size; j++) {
dx = x[i] - x[j];
dy = y[i] - y[j];
r = sqrt(dx*dx + dy*dy);
if (r < 1.0) {
total++;
}
}
}
printf("Prob %f\n", ((float)total)/(size*size));
return 0;
}
#include <stdio.h>
#include <math.h>
int main() {
// load data...
double r;
double dx, dy;
int total;
for(int i = 0; i < size; i++) {
for(int j = 0; j < size; j++) {
dx = x[i] - x[j];
dy = y[i] - y[j];
r = sqrt(dx*dx + dy*dy);
if (r < 1.0) {
total++;
}
}
}
printf("Prob %f\n", ((float)total)/(size*size));
return 0;
}
#include <stdio.h>
#include <math.h>
#include "data.h"
int main() {
// load data...
double r;
int total;
for(int i = 0; i < size; i++) {
for(int j = 0; j < size; j++) {
r = (x[i] - x[j])*(x[i] - x[j]) + (y[i] - y[j])*(y[i] - y[j]);
if (r < 1.0) {
total++;
}
}
}
printf("Prob %f\n", ((float)total)/(size*size));
return 0;
}
$ hyperfine ./shaving_op_with_o0.o -N -r 5
Benchmark 1: ./shaving_op_with_o0.o
Time (mean ± σ): 889.1 ms ± 3.4 ms [User: 883.9 ms, System: 0.6 ms]
Range (min … max): 886.3 ms … 894.0 ms 5 runs
$ hyperfine ./shaving_op_without_o0.o -N -r 5
Benchmark 1: ./shaving_op_without_o0.o
Time (mean ± σ): 591.2 ms ± 3.3 ms [User: 587.3 ms, System: 1.0 ms]
Range (min … max): 587.8 ms … 596.2 ms 5 runs
1.5 speedup!
$ hyperfine -N ./shaving_op_with_o3.o
Benchmark 1: ./shaving_op_with_o3.o
Time (mean ± σ): 96.4 ms ± 1.4 ms [User: 94.6 ms, System: 0.9 ms]
Range (min … max): 93.7 ms … 99.9 ms 30 runs
$ hyperfine -N ./shaving_op_without_o3.o
Benchmark 1: ./shaving_op_without_o3.o
Time (mean ± σ): 27.7 ms ± 1.1 ms [User: 26.7 ms, System: 0.7 ms]
Range (min … max): 26.2 ms … 30.9 ms 98 runs
21.3 speedup from -O3 and 3.48 vs with sqrt!
Example: 15 Sorting Algorithms in 6 Minutes
Data representation we already covered a bit
As a recap example:
Algorithms on a chess board - board representation?
Algorithms on a chess board - board representation?
class Board:
pieces: list[PlacedPiece]
class PlacedPiece:
kind: Piece
position: tuple[int, int]
class Piece(Enum):
rook = 1
...
VS.
uint64_t board[N_PIECES];
How?
uint64_t board[N_PIECES];
How?
0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
= 1152921504606846976
This you already know:
For example:
For example:
$$ Ei(x) = \int_{-x}^{\infty} \frac{e^{-t}}{t} \mathrm{d}t $$VS.
$$ Ei(x) \approx -(A^{-7.7} + B)^{-0.13} \\ A = \ln \left ( \left ( \frac{0.56146}{-x} + 0.65 \right )( 1 - x) \right ) \\ B = (-x)^4 e^{-7.7x} (2 - x)^{3.7} $$Again - do not optimize prematurely - but write performant code
To optimize you usually need to try A LOT of approaches
For digging deep, there is really only one recommendation I have:
https://godbolt.org/ - let's have a look