Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 258 additions & 17 deletions lectures/numba.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,17 @@ For GPU-based parallelization, see our {doc}`lectures on JAX <jax_intro>`.

## Exercises

{ref}`speed_ex1` and {ref}`numba_ex3` both estimate $\pi$ by Monte Carlo from random samples in the unit square.

We generate them here and store them in `u_draws` and `v_draws` so that we can use them in both exercises and compare results

```{code-cell} ipython3
n = 1_000_000
rng = np.random.default_rng()
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)
```

```{exercise}
:label: speed_ex1

Expand All @@ -455,10 +466,11 @@ Here is one solution:

```{code-cell} ipython3
@jit
def calculate_pi(n=1_000_000):
def calculate_pi(u_draws, v_draws):
n = len(u_draws)
count = 0
for i in range(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
u, v = u_draws[i], v_draws[i]
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
Expand All @@ -471,19 +483,61 @@ Now let's see how fast it runs:

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

If we switch off JIT compilation by removing `@jit`, the code takes around
150 times as long on our machine.

So we get a speed gain of 2 orders of magnitude by adding four characters.

The solution above takes one of two natural approaches: it *draws all the
random points first*, stores them in `u_draws` and `v_draws`, and then lets the
jitted function loop over them.

The other approach is to *draw each point inside the loop*.

To do this with a NumPy `Generator`, we pass `rng` in as an argument and call `rng.uniform()` inside the loop body

```{code-cell} ipython3
@jit
def calculate_pi_in_loop(rng, n):
count = 0
for i in range(n):
u, v = rng.uniform(), rng.uniform()
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi_in_loop(rng, n)
```

In this serial setting the two approaches give equally good estimates and run at
similar speed, but they are not equivalent in *memory use*.

The first approach
must hold all $2n$ draws in memory at once --- two arrays of `n` floating point
numbers, or about `16n` bytes (around $1.6$ GB when `n = 100_000_000`).

The
second draws each point on demand and discards it, so its memory footprint does
not grow with `n`.

This might suggest that drawing inside the loop is the better default.

But as we
will see in {ref}`numba_ex_race`, drawing inside the loop interacts
badly with parallelization.

```{solution-end}
```

Expand Down Expand Up @@ -550,10 +604,13 @@ p, q = 0.1, 0.2 # Prob of leaving low and high state respectively
Here's a pure Python version of the function

```{code-cell} ipython3
def compute_series(n):
n = 1_000_000
rng = np.random.default_rng()
U = rng.uniform(0, 1, size=n)

def compute_series(n, U):
x = np.empty(n, dtype=np.int64)
x[0] = 1 # Start in state 1
U = np.random.uniform(0, 1, size=n)
for t in range(1, n):
current_x = x[t-1]
if current_x == 0:
Comment thread
HumphreyYang marked this conversation as resolved.
Expand All @@ -567,8 +624,7 @@ Let's run this code and check that the fraction of time spent in the low
state is about 0.666

```{code-cell} ipython3
n = 1_000_000
x = compute_series(n)
x = compute_series(n, U)
print(np.mean(x == 0)) # Fraction of time x is in state 0
```

Expand All @@ -578,7 +634,7 @@ Now let's time it:

```{code-cell} ipython3
with qe.Timer():
compute_series(n)
compute_series(n, U)
```

Next let's implement a Numba version, which is easy
Expand All @@ -590,15 +646,15 @@ compute_series_numba = jit(compute_series)
Let's check we still get the right numbers

```{code-cell} ipython3
x = compute_series_numba(n)
x = compute_series_numba(n, U)
print(np.mean(x == 0))
```

Let's see the time

```{code-cell} ipython3
with qe.Timer():
compute_series_numba(n)
compute_series_numba(n, U)
```

This is a nice speed improvement for one line of code!
Expand Down Expand Up @@ -637,10 +693,11 @@ Here is one solution:

```{code-cell} ipython3
@jit(parallel=True)
def calculate_pi(n=1_000_000):
def calculate_pi(u_draws, v_draws):
n = len(u_draws)
count = 0
for i in prange(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
u, v = u_draws[i], v_draws[i]
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
Expand All @@ -653,12 +710,12 @@ Now let's see how fast it runs:

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

By switching parallelization on and off (selecting `True` or
Expand All @@ -671,6 +728,187 @@ a factor of 2 or 3.
(If you are executing locally, you will get different numbers, depending mainly
on the number of CPUs on your machine.)

Notice that we drew all of the random points *before* the loop and passed them in
as arrays, so the parallel loop only *reads* from memory.

Drawing the points *inside* the parallel loop instead is surprisingly delicate.


We investigate why, and how to do it safely, in
{ref}`numba_ex_race`.

```{solution-end}
```


```{exercise}
:label: numba_ex_race

In {ref}`numba_ex3` we drew all of the random points *before* the parallel loop.

It is tempting to instead draw each point *inside* the `prange` loop, by passing a generator `rng` in as an argument and calling `rng.uniform()` in the loop body.

Try it: the code should run and return a number close to $\pi$, yet there is a subtle bug in this approach.

Investigate as follows:

1. Call your function a few times with the *same* seed and check whether the result is reproducible.
2. Repeat the estimate many times across a range of sample sizes and compare its spread against a correct parallel version.

Then explain what goes wrong and give a correct way to draw inside a parallel loop.

Hint: try to use a legacy random function such as `np.random.uniform()` instead of a `Generator` and see what happens.
```

```{solution-start} numba_ex_race
:class: dropdown
```

Here is the tempting version.

We pass `rng` in as an argument and call it inside the `prange` loop.

```{code-cell} ipython3
n = 1_000_000
rng = np.random.default_rng()

@jit(parallel=True)
def calculate_pi_in_loop(rng, n):
count = 0
for i in prange(n):
u, v = rng.uniform(), rng.uniform()
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4

calculate_pi_in_loop(rng, n)
```

The code runs without error and returns something close to $\pi$.

But something is silently wrong with the results.

Here, every thread draws from the *same* generator `rng`.

A generator produces each number by updating an internal state.

Under `prange`, many threads read and update that single state at once, with no coordination between them.

This is a [**data race**](https://docs.oracle.com/cd/E19205-01/820-0619/geojs/index.html).

It creates correlations across the draws and can even cause some draws to be duplicated in an
unpredictable way.

Two symptoms reveal the problem.

*Symptom 1: the result is no longer reproducible.*

A correct generator returns the same answer whenever it is given the same seed.

Because of the data race, the order in which threads happen to touch the shared state affects the stream of draws, so the answer is not reproducible even when the seed is fixed.

```{code-cell} ipython3
for seed in (1, 1, 1):
print(calculate_pi_in_loop(np.random.default_rng(seed), n))
```

Each call uses the same seed, yet the answers differ.

*Symptom 2: the estimator is far noisier than it should be.*

The duplicated and correlated draws carry less information than $n$ independent draws, so the *effective* sample size is much smaller than $n$.

The fix is to give each thread its own random state, which NumPy's legacy functions such as `np.random.uniform()` do automatically under Numba.

```{code-cell} ipython3
@jit(parallel=True)
def calculate_pi_legacy(n):
count = 0
for i in prange(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4
```

To see the cost of the race, we repeat each estimate many times and plot its spread against the correct version as the sample size grows.

```{code-cell} ipython3
sample_sizes = np.logspace(3, 6, 10).astype(int)
num_reps = 20

methods = [("per-thread state (correct)",
lambda n: calculate_pi_legacy(n), 'C0'),
("shared generator in prange (data race)",
lambda n: calculate_pi_in_loop(np.random.default_rng(), n), 'C1')]

fig, ax = plt.subplots()
for label, estimate, color in methods:
draws = np.array([[estimate(int(m)) for _ in range(num_reps)]
for m in sample_sizes])
means, stds = draws.mean(axis=1), draws.std(axis=1)
ax.plot(sample_sizes, means, color=color, marker='o', ms=3, label=label)
ax.fill_between(sample_sizes, means - 2 * stds, means + 2 * stds,
color=color, alpha=0.2)
ax.axhline(np.pi, color='k', lw=0.8, ls='--', label=r'$\pi$')
ax.set_xscale('log')
ax.set_xlabel('number of samples')
ax.set_ylabel(r'estimate of $\pi$')
ax.legend()
plt.show()
```

Both bands are centered on $\pi$, but the band associated with the data race is much wider than the other one and narrows very slowly as the sample size grows.

The other safe option is the one from {ref}`numba_ex3`: draw the points before the loop so that the parallel loop only reads from memory.

```{solution-end}
```


```{exercise}
:label: numba_ex_draw_speed

We now have two correct ways to estimate $\pi$ in parallel.

One draws all the points *before* the loop, as in {ref}`numba_ex3`.

The other draws them *inside* the loop with legacy functions, as in {ref}`numba_ex_race`.

Compare their speed at `n = 100_000_000`, including the time spent generating the random points.
```

```{solution-start} numba_ex_draw_speed
:class: dropdown
```

We time each approach from start to finish, so the pre-draw version pays for building its arrays.

```{code-cell} ipython3
n = 100_000_000
rng = np.random.default_rng()

with qe.Timer():
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)
calculate_pi(u_draws, v_draws)
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi_legacy(n)
```

Drawing inside the loop is much faster.

The pre-draw version generates its two arrays on a single thread before the loop begins.

The in-loop version spreads the random number generation across all threads instead.

It also avoids allocating two arrays of `n` numbers, so it saves both time and memory.

```{solution-end}
```

Expand All @@ -694,8 +932,8 @@ where

1. $\beta$ is a discount factor,
2. $n$ is the expiry date,
2. $K$ is the strike price and
3. $\{S_t\}$ is the price of the underlying asset at each time $t$.
3. $K$ is the strike price and
4. $\{S_t\}$ is the price of the underlying asset at each time $t$.

Suppose that `n, β, K = 20, 0.99, 100`.

Expand Down Expand Up @@ -750,6 +988,9 @@ $$

Using this fact, the solution can be written as follows.

Note that random draws are kept inside the inner loop rather than pre-allocated,
to avoid creating large shock arrays of size `M * n`.

Comment thread
HumphreyYang marked this conversation as resolved.

```{code-cell} ipython3
M = 10_000_000
Expand Down
Loading