Linked handout: 13. Optimization

Optimization

This topic is way too vast to cover in one lecture

So today we will cover:

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • When to optimize

Donald Knuth 1974 - "Structured Programming With Go To Statements"

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.Yet we should not pass up our opportunities in that critical 3%.

Takeaway?

Optimization

  • When to optimize

Takeaway?

  • Optimize when you ACTUALLY need to
  • Then optimize the important parts
    - use profiling, don't guess!
  • Highly optimized code: usually harder to read and slower to write
  • Instead learn to write performant enough code by default
    - not fully optimized code
  • But write your code to allow for future optimization
    - don't paint yourself into a corner

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Computer architecture

CPU data access

Optimization

Optimization

  • Computer architecture

    Typical Intel Xeon

  • L1 cache - Up to 80 KB per core
  • L2 cache - Up to 2 MB per core
  • L3 cache - Up to 320 MB per socket

Example:

Optimization

  • Computer architecture

Example:

  1. CPU needs data
  2. looks in L1
  3. does not find
  4. looks in L2
  5. finds data
  6. loads data from L2 to L1
  7. CPU uses data

This is what is called a "cache miss"!

So how do you choose what goes to which cache?

Optimization

  • Computer architecture

So how do you choose what goes to which cache?

Well, you don't, the CPU/Motherboard does

But you can help it along - contiguous memory and frequent use!

For more info - see videos in handout!

Optimization

  • Computer architecture

Also - the same way memory usage is "predicted"

so are code "branches"!

(for more, see video in handout!)

For example

Optimization

  • Computer architecture

For example

                    
                        #include <stdio.h>
                        #include <stdlib.h>

                        #define SIZE 1000000

                        int main() {
                            int *arr = malloc(SIZE * sizeof(int));
                            if (!arr) return 1;
                            srand(1234);
                            for (int i = 0; i < SIZE; ++i) {
                                arr[i] = rand() % 100;
                            }

                            int count = 0;
                            for (int i = 0; i < SIZE; ++i) {
                                if (arr[i] > 50) { // unpredictable branch
                                    count++;
                                }
                            }

                            printf("Count: %d\n", count);
                            free(arr);
                            return 0;
                        }
                    
                

Optimization

  • Computer architecture

For example

                    
                        #include <stdio.h>
                        #include <stdlib.h>

                        #define SIZE 1000000

                        int main() {
                            int *arr = malloc(SIZE * sizeof(int));
                            if (!arr) return 1;
                            srand(1234);
                            for (int i = 0; i < SIZE; ++i) {
                                arr[i] = rand() % 100;
                            }

                            int count = 0;
                            for (int i = 0; i < SIZE; ++i) {
                                count += (arr[i] > 50);  // no actual branch
                            }

                            printf("Count: %d\n", count);
                            free(arr);
                            return 0;
                        }
                    
                

Optimization

  • Computer architecture

For example

                    
                        $ hyperfine -N ./branch_miss.o
                        Benchmark 1: ./branch_miss.o
                          Time (mean ± σ):      20.1 ms ±   1.1 ms    [User: 19.0 ms, System: 0.8 ms]
                          Range (min … max):    19.0 ms …  28.4 ms    143 runs
                         
                        $ hyperfine -N ./no_branch_miss.o
                        Benchmark 1: ./no_branch_miss.o
                          Time (mean ± σ):      15.5 ms ±   0.8 ms    [User: 14.5 ms, System: 0.8 ms]
                          Range (min … max):    14.7 ms …  23.3 ms    190 runs
                    
                

1.3 speedup just from that!

Why? Lets try perf stat

Optimization

  • Computer architecture

Why? Lets try perf stat

                    
                    $ perf stat ./branch_miss.o
                    Count: 489858

                     Performance counter stats for './branch_miss.o':

                                 19,08 msec task-clock:u                     #    0,982 CPUs utilized             
                                     0      context-switches:u               #    0,000 /sec                      
                                     0      cpu-migrations:u                 #    0,000 /sec                      
                                   527      page-faults:u                    #   27,627 K/sec                     
                            89 598 854      instructions:u                   #    1,08  insn per cycle            
                            83 305 829      cycles:u                         #    4,367 GHz                       
                            18 029 879      branches:u                       #  945,179 M/sec                     
                               537 691      branch-misses:u                  #    2,98% of all branches           

                           0,019415949 seconds time elapsed

                           0,019496000 seconds user
                           0,000000000 seconds sys
                    
                

Optimization

  • Computer architecture

Why? Lets try perf stat

                    
                    $  perf stat ./no_branch_miss.o 
                    Count: 489858

                     Performance counter stats for './no_branch_miss.o':

                                 15,18 msec task-clock:u                     #    0,979 CPUs utilized             
                                     0      context-switches:u               #    0,000 /sec                      
                                     0      cpu-migrations:u                 #    0,000 /sec                      
                                   527      page-faults:u                    #   34,717 K/sec                     
                            91 108 993      instructions:u                   #    1,40  insn per cycle            
                            64 947 318      cycles:u                         #    4,279 GHz                       
                            17 029 876      branches:u                       #    1,122 G/sec                     
                                34 884      branch-misses:u                  #    0,20% of all branches           

                           0,015498122 seconds time elapsed

                           0,015587000 seconds user
                           0,000000000 seconds sys
                    
                

Optimization

  • Computer architecture

    Takeaways

  • If you can load working data into RAM - do it
  • Work with the CPU optimizer rather than against it
    • Use contiguous memory
    • Process data all at once
    • Try to not have the CPU wait for work
    • Avoid branching
    • ...
  • We don't have time to cover GPU now - links in handout

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Vectorization vs Vectorization

Data vectorization

An example: Mandelbrot set calculation

$$ z_{n+1} = z_n^2 + c $$ $$ z_0 = 0,\; z_n \in \mathbb{C},\; z_n \mapsto \infty? $$

Optimization

  • Vectorization vs Vectorization

Optimization

                    
                        import time
                        import numpy as np


                        def iter_fun(zn, C):
                            return zn**2 + C


                        def iter_n_loop(z0, C, max_num, limit):
                            z = z0 * np.ones(C.shape, dtype=C.dtype)

                            mat_shape = z.shape
                            mat_size = z.size

                            # we only iterate until the limit is passed, then it is considered divergant
                            # use the "iters" as a indication of this
                            iters = np.zeros((mat_size,), dtype=np.int32)

                            # flatten for only one loop
                            z.shape = (mat_size,)
                            C.shape = (mat_size,)

                            for it in range(mat_size):
                                for ind in range(max_num):
                                    z[it] = iter_fun(z[it], C[it])

                                    if z[it] * np.conj(z[it]) > limit**2:
                                        iters[it] = ind
                                        break

                            # all iterations that survived without passing limit went trough the entire loop
                            iters[iters == 0] = max_num

                            # change shape back to original
                            z.shape = mat_shape
                            C.shape = mat_shape
                            iters.shape = mat_shape
                            return z, iters


                        def iter_n_vec(z0, C, max_num, limit):
                            z = z0 * np.ones(C.shape, dtype=C.dtype)

                            # we only iterate until the limit is passed, then it is considered divergant
                            # use the "iters" as a indication of this
                            iters = np.zeros(C.shape, dtype=np.int32)
                            for ind in range(max_num):
                                z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])

                                # if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
                                z_norm = z * np.conj(z)
                                z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
                                iters[z_norm_check] = ind

                            # all iterations that survived so far went trough the entire loop
                            iters[iters == 0] = max_num
                            return z, iters


                        def mandelbrot_set(C_mat, max_num, limit, iter_fun):
                            z0 = 0.0 + 0.0j
                            Z, iters = iter_fun(z0, C_mat, max_num=max_num, limit=limit)
                            Z_norm = np.real(np.sqrt(Z * np.conj(Z)))
                            return Z_norm, iters


                        if __name__ == "__main__":
                            n_base = 300

                            xmin, xmax, xn = -0.45, -0.42, n_base
                            ymin, ymax, yn = -0.59, -0.55, n_base
                            xv = np.linspace(xmin, xmax, xn, dtype=np.float32)
                            yv = np.linspace(ymin, ymax, yn, dtype=np.float32)
                            X, Y = np.meshgrid(xv, yv)

                            max_num = 120
                            limit = 4.0

                            t0 = time.time()
                            Z_norm, iters = mandelbrot_set(
                                C_mat=X + Y * 1.0j, max_num=max_num, limit=limit, iter_fun=iter_n_loop
                            )
                            dtl = time.time() - t0
                            print(f"loop execution time {dtl} sec")

                            t0 = time.time()
                            Z_norm, iters = mandelbrot_set(
                                C_mat=X + Y * 1.0j, max_num=max_num, limit=limit, iter_fun=iter_n_vec
                            )
                            dtv = time.time() - t0
                            print(f"vectorized execution time {dtv} sec")
                            print(f"speedup {dtl/dtv}")
                    
                

Optimization

                    
                        def iter_n_loop(z0, C, max_num, limit):
                            z = z0 * np.ones(C.shape, dtype=C.dtype)

                            mat_shape = z.shape
                            mat_size = z.size

                            # we only iterate until the limit is passed, then it is considered divergant
                            # use the "iters" as a indication of this
                            iters = np.zeros((mat_size,), dtype=np.int32)

                            # flatten for only one loop
                            z.shape = (mat_size,)
                            C.shape = (mat_size,)

                            for it in range(mat_size):
                                for ind in range(max_num):
                                    z[it] = iter_fun(z[it], C[it])

                                    if z[it] * np.conj(z[it]) > limit**2:
                                        iters[it] = ind
                                        break

                            # all iterations that survived without passing limit went trough the entire loop
                            iters[iters == 0] = max_num

                            # change shape back to original
                            z.shape = mat_shape
                            C.shape = mat_shape
                            iters.shape = mat_shape
                            return z, iters
                    
                

Optimization

                    
                        def iter_n_vec(z0, C, max_num, limit):
                            z = z0 * np.ones(C.shape, dtype=C.dtype)

                            # we only iterate until the limit is passed, then it is considered divergant
                            # use the "iters" as a indication of this
                            iters = np.zeros(C.shape, dtype=np.int32)
                            for ind in range(max_num):
                                z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])

                                # if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
                                z_norm = z * np.conj(z)
                                z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
                                iters[z_norm_check] = ind

                            # all iterations that survived so far went trough the entire loop
                            iters[iters == 0] = max_num
                            return z, iters
                    
                

Optimization

                    
                        def iter_n_vec(z0, C, max_num, limit):
                            z = z0 * np.ones(C.shape, dtype=C.dtype)

                            # we only iterate until the limit is passed, then it is considered divergant
                            # use the "iters" as a indication of this
                            iters = np.zeros(C.shape, dtype=np.int32)
                            for ind in range(max_num):
                                z[iters == 0] = iter_fun(z[iters == 0], C[iters == 0])

                                # if the norm of z passes the limit, stop iteration by setting the value to the iteration integer
                                z_norm = z * np.conj(z)
                                z_norm_check = np.logical_and(z_norm > limit**2, iters == 0)
                                iters[z_norm_check] = ind

                            # all iterations that survived so far went trough the entire loop
                            iters[iters == 0] = max_num
                            return z, iters
                    
                
                    
                        $ python vectorization.py
                        loop execution time 6.237718343734741 sec
                        vectorized execution time 0.07863926887512207 sec
                        speedup 79.3206553540082
                    
                

Optimization

  • Vectorization vs Vectorization

What about the second vectorization?

Single instruction, multiple data (SIMD)!

Usually these are done for you!

  • numpy automatically uses simd
  • -O3 and -march in gcc also does it
  • ...

Optimization

  • Vectorization vs Vectorization

Usually these are done for you!

Anyway here is an example!

                    
                        void add_float_arrays(float* a, float* b, float* result, int size) {
                            for (int i = 0; i < size; ++i) {
                                result[i] = a[i] + b[i];
                            }
                        }

                        int main() {
                            int size = 100000;
                            int steps = 1000;
                            float x[size], y[size], res[size];

                            for (int i = 0; i < size; i++) {
                                x[i] = (float)i;
                                y[i] = x[i] + 1;
                            }

                            add_float_arrays(x, y, res, size);
                            for (int i = 0; i < steps; i++) {
                                add_float_arrays(res, y, res, size);
                            }
                            return 0;
                        }
                    
                

Optimization

  • Vectorization vs Vectorization
                    
                        void add_float_arrays(float* a, float* b, float* result, int size) {
                            for (int i = 0; i < size; ++i) {
                                result[i] = a[i] + b[i];
                            }
                        }

                        int main() {
                            int size = 100000;
                            int steps = 1000;
                            float x[size], y[size], res[size];

                            for (int i = 0; i < size; i++) {
                                x[i] = (float)i;
                                y[i] = x[i] + 1;
                            }

                            add_float_arrays(x, y, res, size);
                            for (int i = 0; i < steps; i++) {
                                add_float_arrays(res, y, res, size);
                            }
                            return 0;
                        }
                    
                

Optimization

  • Vectorization vs Vectorization
                    
                        #include <stdio.h>
                        #include <immintrin.h>

                        void add_float_arrays_avx512(const float* a, const float* b, float* result, int size) {
                            int i;
                            for (i = 0; i <= size - 16; i += 16) {
                                __m512 va = _mm512_loadu_ps(&a[i]);
                                __m512 vb = _mm512_loadu_ps(&b[i]);
                                __m512 vsum = _mm512_add_ps(va, vb);
                                _mm512_storeu_ps(&result[i], vsum);
                            }
                            for (; i < size; ++i) {
                                result[i] = a[i] + b[i];
                            }
                        }

                        int main() {
                            int size = 100000;
                            int steps = 1000;
                            float x[size], y[size], res[size];

                            for (int i = 0; i < size; i++) {
                                x[i] = (float)i;
                                y[i] = x[i] + 1;
                            }

                            add_float_arrays_avx512(x, y, res, size);
                            for (int i = 0; i < steps; i++) {
                                add_float_arrays_avx512(res, y, res, size);
                            }
                            return 0;
                        }
                    
                

Optimization

  • Vectorization vs Vectorization
                    
                        $ hyperfine -N ./without_simd.o 
                        Benchmark 1: ./without_simd.o
                          Time (mean ± σ):     120.5 ms ±   1.7 ms    [User: 118.8 ms, System: 1.0 ms]
                          Range (min … max):   118.3 ms … 124.2 ms    24 runs
                         
                        $ hyperfine -N ./with_simd.o 
                        Benchmark 1: ./with_simd.o
                          Time (mean ± σ):      36.6 ms ±   2.4 ms    [User: 35.2 ms, System: 1.1 ms]
                          Range (min … max):    33.1 ms …  43.9 ms    78 runs
                    
                

But this is not fair, lets have a look at compiler optimization

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Compiled languages and automatic optimizers

So there is a reason most examples today are in C

  • Usually you need to go to a "low-level" compiled language to optimize
  • Full control over memory and CPU instructions
  • When the language is compiled - automatic optimization is easier!

Optimization

  • Compiled languages and automatic optimizers

last example: lets try -O3

                    
                        $ hyperfine -N ./without_simd_o3.o 
                        Benchmark 1: ./without_simd_o3.o
                          Time (mean ± σ):     498.4 µs ± 107.0 µs    [User: 266.8 µs, System: 183.2 µs]
                          Range (min … max):   236.5 µs … 2306.9 µs    2866 runs
                         
                        $ hyperfine -N ./with_simd_o3.o
                        Benchmark 1: ./with_simd_o3.o
                          Time (mean ± σ):     499.1 µs ±  91.7 µs    [User: 267.6 µs, System: 181.9 µs]
                          Range (min … max):   239.2 µs … 1698.2 µs    3592 runs
                    
                

So usually you do not need to hand-roll these, only in extreme scenarios

But now you know it exists and is a thing!

Optimization

  • Compiled languages and automatic optimizers

Throughout the lecture today - compiled with and without -O

and more detail at the end

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Shaving operations

This is a quick one:

  • Some operations take many cycles
  • Some operations can be skipped
  • Some operations can be exchanged to faster ones

An example!

Optimization

                    
                        #include <stdio.h>
                        #include <math.h>

                        int main() {
                            int size = 10000;
                            double x[size], y[size];

                            FILE *file = fopen("data.csv", "r");
                            double value;
                            int count;
                            while (!feof(file)) {
                                if (fscanf(file, "%lf,", &value) == 1) {
                                    if (count < size) {
                                        x[count] = value;
                                    }
                                    else {
                                        y[count - size] = value;
                                    }
                                    count++;
                                }
                            }
                            fclose(file);

                            double r;
                            double dx, dy;
                            int total;
                            for(int i = 0; i < size; i++) {
                                for(int j = 0; j < size; j++) {
                                    dx = x[i] - x[j];
                                    dy = y[i] - y[j];
                                    r = sqrt(dx*dx + dy*dy);
                                    if (r < 1.0) {
                                        total++;
                                    }
                                }
                            }

                            printf("Prob %f\n", ((float)total)/(size*size));
                            return 0;
                        }
                    
                

Optimization

                    
                        #include <stdio.h>
                        #include <math.h>

                        int main() {
                            //  load data... 
                            double r;
                            double dx, dy;
                            int total;
                            for(int i = 0; i < size; i++) {
                                for(int j = 0; j < size; j++) {
                                    dx = x[i] - x[j];
                                    dy = y[i] - y[j];
                                    r = sqrt(dx*dx + dy*dy);
                                    if (r < 1.0) {
                                        total++;
                                    }
                                }
                            }

                            printf("Prob %f\n", ((float)total)/(size*size));
                            return 0;
                        }
                    
                

Optimization

                    
                        #include <stdio.h>
                        #include <math.h>
                        #include "data.h"

                        int main() {
                            //  load data... 
                            double r;
                            int total;
                            for(int i = 0; i < size; i++) {
                                for(int j = 0; j < size; j++) {
                                    r = (x[i] - x[j])*(x[i] - x[j]) + (y[i] - y[j])*(y[i] - y[j]);
                                    if (r < 1.0) {
                                        total++;
                                    }
                                }
                            }

                            printf("Prob %f\n", ((float)total)/(size*size));
                            return 0;
                        }
                    
                

Optimization

  • Shaving operations
                    
                        $ hyperfine ./shaving_op_with_o0.o -N -r 5
                        Benchmark 1: ./shaving_op_with_o0.o
                          Time (mean ± σ):     889.1 ms ±   3.4 ms    [User: 883.9 ms, System: 0.6 ms]
                          Range (min … max):   886.3 ms … 894.0 ms    5 runs
                         
                        $ hyperfine ./shaving_op_without_o0.o -N -r 5
                        Benchmark 1: ./shaving_op_without_o0.o
                          Time (mean ± σ):     591.2 ms ±   3.3 ms    [User: 587.3 ms, System: 1.0 ms]
                          Range (min … max):   587.8 ms … 596.2 ms    5 runs
                    
                

1.5 speedup!

Optimization

  • Shaving operations
                    
                        $ hyperfine -N ./shaving_op_with_o3.o
                        Benchmark 1: ./shaving_op_with_o3.o
                          Time (mean ± σ):      96.4 ms ±   1.4 ms    [User: 94.6 ms, System: 0.9 ms]
                          Range (min … max):    93.7 ms …  99.9 ms    30 runs
                         
                        $ hyperfine -N ./shaving_op_without_o3.o
                        Benchmark 1: ./shaving_op_without_o3.o
                          Time (mean ± σ):      27.7 ms ±   1.1 ms    [User: 26.7 ms, System: 0.7 ms]
                          Range (min … max):    26.2 ms …  30.9 ms    98 runs
                    
                

21.3 speedup from -O3 and 3.48 vs with sqrt!

Optimization

  • Algorithms and data representation
  • Are you noticing almost all time is spend in one algorithm?
  • Have a look if that one is optimal - maybe there are faster ones!

Example: 15 Sorting Algorithms in 6 Minutes

Optimization

  • Algorithms and data representation

Data representation we already covered a bit

As a recap example:

Algorithms on a chess board - board representation?

Optimization

  • Algorithms and data representation

Algorithms on a chess board - board representation?

                    
                        class Board:
                            pieces: list[PlacedPiece]

                        class PlacedPiece:
                            kind: Piece
                            position: tuple[int, int]

                        class Piece(Enum):
                            rook = 1
                        ...
                    
                

VS.

                    
                        uint64_t board[N_PIECES];
                    
                

How?

Optimization

  • Algorithms and data representation
                    
                        uint64_t board[N_PIECES];
                    
                

How?

                    
                        0 0 1 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                        0 0 0 0 0 0 0 0
                    
                

= 1152921504606846976

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Approximations

This you already know:

  • Find out the accuracy you need
  • Can you use an approximation that's faster and sufficient?

For example:

Optimization

  • Approximations

For example:

$$ Ei(x) = \int_{-x}^{\infty} \frac{e^{-t}}{t} \mathrm{d}t $$

VS.

$$ Ei(x) \approx -(A^{-7.7} + B)^{-0.13} \\ A = \ln \left ( \left ( \frac{0.56146}{-x} + 0.65 \right )( 1 - x) \right ) \\ B = (-x)^4 e^{-7.7x} (2 - x)^{3.7} $$

Optimization

  • When to optimize
  • Computer architecture
  • Vectorization vs Vectorization
  • Compiled languages and automatic optimizers
  • Shaving operations
  • Algorithms and data representation
  • Approximations
  • Digging deep (compiled assembly)

Optimization

  • Digging deep (compiled assembly)

Again - do not optimize prematurely - but write performant code

To optimize you usually need to try A LOT of approaches

For digging deep, there is really only one recommendation I have:

https://godbolt.org/ - let's have a look