Acceleration with Numba

We explore how the computation of cost functions can be dramatically accelerated with numba’s JIT compiler.

The run-time of iminuit is usually dominated by the execution time of the cost function. To get good performance, it recommended to use array arthimetic and scipy and numpy functions in the body of the cost function. Python loops should be avoided, but if they are unavoidable, numba can help. Numba can also parallelize numerical calculations to make full use of multi-core CPUs and even do computations on the GPU.

Note: This tutorial shows how one can generate faster pdfs with Numba. Before you start to write your own pdf, please check whether one is already implemented in the numba_stats library. If you have a pdf that is not included there, please consider contributing it to numba_stats.

[1]:
# !pip install matplotlib numpy numba scipy iminuit
from iminuit import Minuit
import numpy as np
import numba as nb
import math
from scipy.stats import expon, norm
from matplotlib import pyplot as plt
from argparse import Namespace

The standard fit in particle physics is the fit of a peak over some smooth background. We generate a Gaussian peak over exponential background, using scipy.

[2]:
np.random.seed(1)  # fix seed

# true parameters for signal and background
truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
n_bkg = truth.n_sig * truth.f_bkg

# make a data set
x = np.empty(truth.n_sig + n_bkg)

# fill m variables
x[: truth.n_sig] = norm(*truth.sig).rvs(truth.n_sig)
x[truth.n_sig :] = expon(*truth.bkg).rvs(n_bkg)

# cut a range in x
xrange = np.array((1.0, 9.0))
ma = (xrange[0] < x) & (x < xrange[1])
x = x[ma]

plt.hist(
    (x[truth.n_sig :], x[: truth.n_sig]),
    bins=50,
    stacked=True,
    label=("background", "signal"),
)
plt.xlabel("x")
plt.legend();
../_images/notebooks_numba_3_0.png
[3]:
# ideal starting values for iminuit
start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))


# iminuit instance factory, will be called a lot in the benchmarks blow
def m_init(fcn):
    m = Minuit(fcn, start, name=("ns", "nb", "mu", "sigma", "lambd"))
    m.limits = ((0, None), (0, None), None, (0, None), (0, None))
    m.errordef = Minuit.LIKELIHOOD
    return m
[4]:
# extended likelihood (https://doi.org/10.1016/0168-9002(90)91334-8)
# this version uses numpy and scipy and array arithmetic
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    s = norm(mu, sigma)
    b = expon(0, lambd)
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = s.cdf(xrange)
    bn = b.cdf(xrange)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    return (n_sig + n_bkg) - np.sum(
        np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
    )


nll(start)
[4]:
-103168.78482586428
[5]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
916 ms ± 231 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Let’s see whether we can beat that. The code above is already pretty fast, because numpy and scipy routines are fast, and we spend most of the time in those. But these implementations do not parallelize the execution and are not optimised for this particular CPU, unlike numba-jitted functions.

To use numba, in theory we just need to put the njit decorator on top of the function, but often that doesn’t work out of the box. numba understands many numpy functions, but no scipy. We must evaluate the code that uses scipy in ‘object mode’, which is numba-speak for calling into the Python interpreter.

[6]:
# first attempt to use numba
@nb.njit(parallel=True)
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    with nb.objmode(spdf="float64[:]", bpdf="float64[:]", sn="float64", bn="float64"):
        s = norm(mu, sigma)
        b = expon(0, lambd)
        # normalisation factors are needed for pdfs, since x range is restricted
        sn = np.diff(s.cdf(xrange))[0]
        bn = np.diff(b.cdf(xrange))[0]
        spdf = s.pdf(x)
        bpdf = b.pdf(x)
    no = n_sig + n_bkg
    return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))


nll(start)  # test and warm-up JIT
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[6]:
-103168.78482586429
[7]:
%%timeit -r 3 -n 1 m = m_init(nll)
m.migrad()
398 ms ± 23.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

It is even a bit slower. :( Let’s break the original function down by parts to see why.

[8]:
# let's time the body of the function
n_sig, n_bkg, mu, sigma, lambd = start
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
# no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))

%timeit -r 3 -n 100 norm(*start[2:4]).pdf(x)
%timeit -r 3 -n 500 expon(0, start[4]).pdf(x)
%timeit -r 3 -n 1000 np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
1.37 ms ± 69.4 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
1.63 ms ± 114 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
224 µs ± 25 µs per loop (mean ± std. dev. of 3 runs, 1,000 loops each)

Most of the time is spend in norm and expon which numba could not accelerate and the total time is dominated by the slowest part.

This, unfortunately, means we have to do much more manual work to make the function faster, since we have to replace the scipy routines with Python code that numba can accelerate and run in parallel.

[9]:
kwd = {"parallel": True, "fastmath": True}


@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
    return np.sum(np.log(fs * spdf + fb * bpdf))


@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
    invs = 1.0 / sigma
    z = (x - mu) * invs
    invnorm = 1 / np.sqrt(2 * np.pi) * invs
    return np.exp(-0.5 * z ** 2) * invnorm


@nb.njit(**kwd)
def nb_erf(x):
    y = np.empty_like(x)
    for i in nb.prange(len(x)):
        y[i] = math.erf(x[i])
    return y


@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
    invs = 1.0 / (sigma * np.sqrt(2))
    z = (x - mu) * invs
    return 0.5 * (1 + nb_erf(z))


@nb.njit(**kwd)
def expon_pdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return inv_lambd * np.exp(-inv_lambd * x)


@nb.njit(**kwd)
def expon_cdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return 1.0 - np.exp(-inv_lambd * x)


def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = norm_cdf(xrange, mu, sigma)
    bn = expon_cdf(xrange, lambd)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    spdf = norm_pdf(x, mu, sigma)
    bpdf = expon_pdf(x, lambd)
    no = n_sig + n_bkg
    return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)


nll(start)  # test and warm-up JIT
[9]:
-103168.78482586428

Let’s see how well these versions do:

[10]:
%timeit -r 5 -n 100 norm_pdf(x, *start[2:4])
%timeit -r 5 -n 500 expon_pdf(x, start[4])
%timeit -r 5 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
126 µs ± 13.9 µs per loop (mean ± std. dev. of 5 runs, 100 loops each)
107 µs ± 2.08 µs per loop (mean ± std. dev. of 5 runs, 500 loops each)
94.2 µs ± 499 ns per loop (mean ± std. dev. of 5 runs, 1,000 loops each)

Only a minor improvement for sum_log, but the pdf calculation was drastically accelerated. Since this was the bottleneck before, we expect also Migrad to finish faster now.

[11]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
40.7 ms ± 858 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)

Success! We managed to get a big speed improvement over the initial code. This is impressive, but it cost us a lot of developer time. This is not always a good trade-off, especially if you consider that library routines are heavily tested, while you always need to test your own code in addition to writing it.

By putting these faster functions into a library, however, we would only have to pay the developer cost once. You can find those in the numba_stats library.

Try to compile the functions again with parallel=False to see how much of the speed increase came from the parallelization and how much from the generally optimized code that numba generated for our specific CPU. On my machine, the gain was entirely due to numba.

In general, it is good advice to not automatically add parallel=True, because this comes with an overhead of breaking data into chunks, copy chunks to the individual CPUs and finally merging everything back together. For large arrays, this overhead is negligible, but for small arrays, it can be a net loss.

So why is numba so fast even without parallelization? We can look at the assembly code generated.

[12]:
for signature, code in norm_pdf.inspect_asm().items():
    print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}\n{code}")
signature: (array(float64, 1d, C), float64, float64)
----------------------------------------------------
        .section        __TEXT,__text,regular,pure_instructions
        .build_version macos, 12, 0
        .section        __TEXT,__literal8,8byte_literals
        .p2align        3
LCPI0_0:
        .quad   0x3ff0000000000000
LCPI0_1:
        .quad   0x3fd9884533d43651
        .section        __TEXT,__literal16,16byte_literals
        .p2align        4
LCPI0_2:
        .quad   8
        .quad   8
        .section        __TEXT,__text,regular,pure_instructions
        .globl  __ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        pushq   %r15
        .cfi_def_cfa_offset 24
        pushq   %r14
        .cfi_def_cfa_offset 32
        pushq   %r13
        .cfi_def_cfa_offset 40
        pushq   %r12
        .cfi_def_cfa_offset 48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        subq    $632, %rsp
        .cfi_def_cfa_offset 688
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        .cfi_offset %rbp, -16
        movq    $0, 104(%rsp)
        movq    $0, 96(%rsp)
        movq    $0, 496(%rsp)
        movq    $0, 208(%rsp)
        movq    $0, 88(%rsp)
        movq    $0, 80(%rsp)
        movq    $0, 152(%rsp)
        movq    $0, 304(%rsp)
        movq    $0, 72(%rsp)
        movq    $0, 64(%rsp)
        movq    $0, 368(%rsp)
        movq    $0, 176(%rsp)
        movq    $0, 56(%rsp)
        movq    $0, 128(%rsp)
        movq    $0, 248(%rsp)
        vxorpd  %xmm2, %xmm2, %xmm2
        vucomisd        %xmm2, %xmm1
        je      LBB0_1
        movq    696(%rsp), %r14
        testq   %r14, %r14
        js      LBB0_3
        imulq   $8, %r14, %r12
        jo      LBB0_5
        movq    %rdi, %r13
        vmovsd  %xmm1, 32(%rsp)
        vmovsd  %xmm0, 120(%rsp)
        movq    %rsi, 40(%rsp)
        movabsq $_NRT_MemInfo_alloc_safe_aligned, %rax
        movq    %r12, %rdi
        movl    $32, %esi
        callq   *%rax
        movq    %rax, %rbp
        movq    24(%rax), %rax
        movq    %rax, 48(%rsp)
        leaq    -1(%r14), %r15
        movq    $0, 104(%rsp)
        movq    %r15, 96(%rsp)
        movabsq $_get_num_threads, %rax
        callq   *%rax
        movq    %rax, %rbx
        testq   %rax, %rax
        jle     LBB0_9
        movq    %rbp, 112(%rsp)
        movabsq $LCPI0_0, %rax
        vmovsd  (%rax), %xmm0
        vdivsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 32(%rsp)
        movabsq $_do_scheduling_unsigned, %rax
        leaq    104(%rsp), %rsi
        leaq    96(%rsp), %rdx
        leaq    496(%rsp), %rbp
        movl    $1, %edi
        movq    %rbx, %rcx
        movq    %rbp, %r8
        xorl    %r9d, %r9d
        callq   *%rax
        movq    %rbp, 208(%rsp)
        vmovsd  32(%rsp), %xmm0
        vmovsd  %xmm0, 88(%rsp)
        leaq    88(%rsp), %rax
        movq    %rax, 216(%rsp)
        vmovsd  120(%rsp), %xmm0
        vmovsd  %xmm0, 80(%rsp)
        leaq    80(%rsp), %rax
        movq    %rax, 224(%rsp)
        movq    688(%rsp), %rax
        movq    %rax, 232(%rsp)
        movq    48(%rsp), %rax
        movq    %rax, 240(%rsp)
        movq    %rbx, 152(%rsp)
        movq    $2, 160(%rsp)
        movq    %r14, 168(%rsp)
        movq    $16, 304(%rsp)
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 312(%rsp)
        movq    $8, 344(%rsp)
        movq    704(%rsp), %rax
        movq    %rax, 352(%rsp)
        movq    $8, 360(%rsp)
        movl    $0, 28(%rsp)
        movabsq $_numba_gil_ensure, %rax
        leaq    28(%rsp), %r14
        movq    %r14, %rdi
        vzeroupper
        callq   *%rax
        movabsq $_PyEval_SaveThread, %rax
        callq   *%rax
        movq    %rax, %rbp
        movabsq $_get_num_threads, %rbx
        callq   *%rbx
        movq    %rax, 8(%rsp)
        movq    $6, (%rsp)
        movabsq $___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE, %rdi
        movabsq $_numba_parallel_for, %rax
        leaq    208(%rsp), %rsi
        leaq    152(%rsp), %rdx
        leaq    304(%rsp), %rcx
        movl    $2, %r9d
        xorl    %r8d, %r8d
        callq   *%rax
        movabsq $_PyEval_RestoreThread, %rax
        movq    %rbp, %rdi
        callq   *%rax
        movabsq $_numba_gil_release, %rax
        movq    %r14, %rdi
        callq   *%rax
        movq    %r12, %rdi
        movl    $32, %esi
        movabsq $_NRT_MemInfo_alloc_safe_aligned, %rax
        callq   *%rax
        movq    %rax, %r12
        movq    24(%rax), %r14
        movq    $0, 72(%rsp)
        movq    %r15, 64(%rsp)
        callq   *%rbx
        movq    %rax, %rbx
        testq   %rax, %rax
        jle     LBB0_13
        movabsq $LCPI0_1, %rax
        vmovsd  32(%rsp), %xmm0
        vmulsd  (%rax), %xmm0, %xmm0
        vmovsd  %xmm0, 32(%rsp)
        leaq    72(%rsp), %rsi
        leaq    64(%rsp), %rdx
        leaq    368(%rsp), %rbp
        movl    $1, %edi
        movq    %rbx, %rcx
        movq    %rbp, %r8
        xorl    %r9d, %r9d
        movabsq $_do_scheduling_unsigned, %rax
        callq   *%rax
        movq    %rbp, 176(%rsp)
        vmovsd  32(%rsp), %xmm0
        vmovsd  %xmm0, 56(%rsp)
        leaq    56(%rsp), %rax
        movq    %rax, 184(%rsp)
        movq    48(%rsp), %rax
        movq    %rax, 192(%rsp)
        movq    %r14, 200(%rsp)
        movq    %rbx, 128(%rsp)
        movq    $2, 136(%rsp)
        movq    696(%rsp), %rbx
        movq    %rbx, 144(%rsp)
        movq    $16, 248(%rsp)
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %xmm0, 256(%rsp)
        movq    $0, 272(%rsp)
        movabsq $LCPI0_2, %rax
        vmovaps (%rax), %xmm0
        vmovups %xmm0, 280(%rsp)
        movq    $8, 296(%rsp)
        movl    $0, 28(%rsp)
        leaq    28(%rsp), %r15
        movq    %r15, %rdi
        movabsq $_numba_gil_ensure, %rax
        callq   *%rax
        movabsq $_PyEval_SaveThread, %rax
        callq   *%rax
        movq    %rax, %rbp
        movabsq $_get_num_threads, %rax
        callq   *%rax
        movq    %rax, 8(%rsp)
        movq    $5, (%rsp)
        movabsq $___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE, %rdi
        leaq    176(%rsp), %rsi
        leaq    128(%rsp), %rdx
        leaq    248(%rsp), %rcx
        movl    $2, %r9d
        xorl    %r8d, %r8d
        movabsq $_numba_parallel_for, %rax
        callq   *%rax
        movq    %rbp, %rdi
        movabsq $_PyEval_RestoreThread, %rax
        callq   *%rax
        movq    %r15, %rdi
        movabsq $_numba_gil_release, %rax
        callq   *%rax
        movq    %r12, (%r13)
        movq    $0, 8(%r13)
        movq    %rbx, 16(%r13)
        movq    $8, 24(%r13)
        movq    %r14, 32(%r13)
        movq    %rbx, 40(%r13)
        movq    $8, 48(%r13)
        movabsq $_NRT_decref, %rax
        movq    112(%rsp), %rdi
        callq   *%rax
        xorl    %eax, %eax
LBB0_8:
        addq    $632, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
LBB0_1:
        movabsq $_.const.picklebuf.5129668672, %rax
        jmp     LBB0_6
LBB0_3:
        movabsq $_.const.picklebuf.5131375424, %rax
        jmp     LBB0_6
LBB0_5:
        movabsq $_.const.picklebuf.5131380544, %rax
LBB0_6:
        movq    %rax, (%rsi)
        jmp     LBB0_7
LBB0_9:
        movabsq $_printf_format, %rdi
        jmp     LBB0_10
LBB0_13:
        movabsq $_printf_format.1, %rdi
LBB0_10:
        movabsq $_printf, %rcx
        movq    %rbx, %rsi
        xorl    %eax, %eax
        callq   *%rcx
        movabsq $_.const.picklebuf.5129402880, %rax
        movq    40(%rsp), %rcx
        movq    %rax, (%rcx)
LBB0_7:
        movl    $1, %eax
        jmp     LBB0_8
        .cfi_endproc

        .globl  __ZN7cpython8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
__ZN7cpython8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-32, %rsp
        subq    $384, %rsp
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        movq    %rsi, %rdi
        subq    $8, %rsp
        leaq    112(%rsp), %r10
        movabsq $_.const.norm_pdf, %rsi
        movabsq $_PyArg_UnpackTuple, %rbx
        leaq    128(%rsp), %r8
        leaq    120(%rsp), %r9
        movl    $3, %edx
        movl    $3, %ecx
        xorl    %eax, %eax
        pushq   %r10
        callq   *%rbx
        addq    $16, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 128(%rsp)
        vmovups %ymm0, 152(%rsp)
        vmovaps %ymm0, 192(%rsp)
        vmovups %ymm0, 216(%rsp)
        movq    $0, 24(%rsp)
        vmovaps %ymm0, 288(%rsp)
        vmovups %ymm0, 312(%rsp)
        testl   %eax, %eax
        je      LBB1_1
        movabsq $__ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %rax
        movq    (%rax), %rbx
        testq   %rbx, %rbx
        je      LBB1_4
        movq    120(%rsp), %rdi
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 128(%rsp)
        vmovups %ymm0, 152(%rsp)
        movabsq $_NRT_adapt_ndarray_from_python, %rax
        leaq    128(%rsp), %rsi
        vzeroupper
        callq   *%rax
        testl   %eax, %eax
        jne     LBB1_8
        cmpq    $8, 152(%rsp)
        jne     LBB1_8
        movq    %rbx, 56(%rsp)
        movq    128(%rsp), %rax
        movq    %rax, 16(%rsp)
        movq    136(%rsp), %rax
        movq    %rax, 32(%rsp)
        movq    144(%rsp), %rax
        movq    %rax, 88(%rsp)
        movq    160(%rsp), %rax
        movq    %rax, 256(%rsp)
        movq    168(%rsp), %rax
        movq    %rax, 96(%rsp)
        movq    176(%rsp), %rax
        movq    %rax, 80(%rsp)
        movq    112(%rsp), %rdi
        movabsq $_PyNumber_Float, %r13
        callq   *%r13
        movq    %rax, %rbx
        movabsq $_PyFloat_AsDouble, %r14
        movq    %rax, %rdi
        callq   *%r14
        vmovsd  %xmm0, 72(%rsp)
        movabsq $_Py_DecRef, %r15
        movq    %rbx, %rdi
        callq   *%r15
        movabsq $_PyErr_Occurred, %r12
        callq   *%r12
        testq   %rax, %rax
        jne     LBB1_10
        movq    104(%rsp), %rdi
        callq   *%r13
        movq    %rax, %rbx
        movq    %rax, %rdi
        callq   *%r14
        vmovsd  %xmm0, 64(%rsp)
        movq    %rbx, %rdi
        callq   *%r15
        callq   *%r12
        testq   %rax, %rax
        jne     LBB1_10
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 216(%rsp)
        vmovaps %ymm0, 192(%rsp)
        subq    $8, %rsp
        movabsq $__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %rax
        leaq    200(%rsp), %rdi
        leaq    32(%rsp), %rsi
        movl    $8, %r9d
        movq    96(%rsp), %r8
        movq    24(%rsp), %rbx
        movq    %rbx, %rdx
        movq    40(%rsp), %rcx
        vmovsd  80(%rsp), %xmm0
        vmovsd  72(%rsp), %xmm1
        pushq   88(%rsp)
        pushq   112(%rsp)
        pushq   280(%rsp)
        vzeroupper
        callq   *%rax
        addq    $32, %rsp
        movl    %eax, %r12d
        movq    24(%rsp), %r13
        movq    192(%rsp), %r14
        vmovups 200(%rsp), %ymm0
        vmovaps %ymm0, 256(%rsp)
        vmovups 232(%rsp), %xmm0
        vmovaps %xmm0, 32(%rsp)
        movabsq $_NRT_decref, %r15
        movq    %rbx, %rdi
        vzeroupper
        callq   *%r15
        cmpl    $-2, %r12d
        je      LBB1_17
        testl   %r12d, %r12d
        jne     LBB1_14
LBB1_17:
        movq    56(%rsp), %rax
        movq    24(%rax), %rdi
        testq   %rdi, %rdi
        je      LBB1_19
        movabsq $_PyList_GetItem, %rax
        xorl    %esi, %esi
        callq   *%rax
        movq    %rax, %rbx
        jmp     LBB1_20
LBB1_14:
        jle     LBB1_21
        movabsq $_PyErr_Clear, %rax
        callq   *%rax
        movq    16(%r13), %rdx
        movl    8(%r13), %esi
        movq    (%r13), %rdi
        movabsq $_numba_unpickle, %rax
        callq   *%rax
        testq   %rax, %rax
        je      LBB1_1
        movabsq $_numba_do_raise, %rcx
        movq    %rax, %rdi
        callq   *%rcx
        jmp     LBB1_1
LBB1_19:
        movabsq $_PyExc_RuntimeError, %rdi
        movabsq $"_.const.`env.consts` is NULL in `read_const`", %rsi
        movabsq $_PyErr_SetString, %rax
        callq   *%rax
        xorl    %ebx, %ebx
LBB1_20:
        movabsq $_.const.pickledata.4576487424, %rdi
        movabsq $_.const.pickledata.4576487424.sha1, %rdx
        movabsq $_numba_unpickle, %rax
        movl    $32, %esi
        callq   *%rax
        movq    %r14, 288(%rsp)
        vmovaps 256(%rsp), %ymm0
        vmovups %ymm0, 296(%rsp)
        vmovaps 32(%rsp), %xmm0
        vmovups %xmm0, 328(%rsp)
        movabsq $_NRT_adapt_ndarray_to_python_acqref, %r9
        leaq    288(%rsp), %rdi
        movq    %rax, %rsi
        movl    $1, %edx
        movl    $1, %ecx
        movq    %rbx, %r8
        vzeroupper
        callq   *%r9
        movq    %rax, %rbx
        movq    %r14, %rdi
        callq   *%r15
        movq    %rbx, %rax
        jmp     LBB1_2
LBB1_21:
        movabsq $_PyExc_SystemError, %rdi
        movabsq $"_.const.unknown error when calling native function", %rsi
LBB1_5:
        movabsq $_PyErr_SetString, %rax
        vzeroupper
        callq   *%rax
LBB1_1:
        xorl    %eax, %eax
LBB1_2:
        leaq    -40(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
LBB1_10:
        movabsq $_NRT_decref, %rax
        movq    16(%rsp), %rdi
        callq   *%rax
        jmp     LBB1_1
LBB1_4:
        movabsq $_PyExc_RuntimeError, %rdi
        movabsq $"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd", %rsi
        jmp     LBB1_5
LBB1_8:
        movabsq $_PyExc_TypeError, %rdi
        movabsq $"_.const.can't unbox array from PyObject into native value.  The object maybe of a different type", %rsi
        jmp     LBB1_5
        .cfi_endproc

        .globl  _cfunc._ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
_cfunc._ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-32, %rsp
        subq    $192, %rsp
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        movq    %r8, %rax
        movq    %rcx, %r8
        movq    %rdx, %rcx
        movq    %rsi, %rdx
        movq    %rdi, %rbx
        vmovaps 16(%rbp), %xmm2
        vxorps  %xmm3, %xmm3, %xmm3
        vmovups %ymm3, 120(%rsp)
        vmovaps %ymm3, 96(%rsp)
        movq    $0, 48(%rsp)
        vmovups %xmm2, 8(%rsp)
        movq    %r9, (%rsp)
        movabsq $__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %r10
        leaq    96(%rsp), %rdi
        leaq    48(%rsp), %rsi
        movq    %rax, %r9
        vzeroupper
        callq   *%r10
        movl    %eax, %r14d
        movq    48(%rsp), %rdi
        movq    96(%rsp), %rax
        movq    104(%rsp), %rcx
        movq    112(%rsp), %rdx
        movq    120(%rsp), %rsi
        movq    128(%rsp), %r12
        movq    136(%rsp), %r13
        movq    144(%rsp), %r15
        movl    $0, 44(%rsp)
        testl   %r14d, %r14d
        jne     LBB2_1
LBB2_4:
        movq    %r15, 48(%rbx)
        movq    %r13, 40(%rbx)
        movq    %r12, 32(%rbx)
        movq    %rsi, 24(%rbx)
        movq    %rdx, 16(%rbx)
        movq    %rcx, 8(%rbx)
        movq    %rax, (%rbx)
        movq    %rbx, %rax
        leaq    -40(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
LBB2_1:
        movq    %rdi, 56(%rsp)
        movq    %rsi, 64(%rsp)
        movq    %rdx, 72(%rsp)
        movq    %rcx, 80(%rsp)
        movq    %rax, 88(%rsp)
        movabsq $_numba_gil_ensure, %rax
        leaq    44(%rsp), %rdi
        callq   *%rax
        testl   %r14d, %r14d
        jle     LBB2_6
        movabsq $_PyErr_Clear, %rax
        callq   *%rax
        movq    56(%rsp), %rax
        movq    16(%rax), %rdx
        movl    8(%rax), %esi
        movq    (%rax), %rdi
        movabsq $_numba_unpickle, %rax
        callq   *%rax
        testq   %rax, %rax
        je      LBB2_3
        movabsq $_numba_do_raise, %rcx
        movq    %rax, %rdi
        callq   *%rcx
        jmp     LBB2_3
LBB2_6:
        movabsq $_PyExc_SystemError, %rdi
        movabsq $"_.const.unknown error when calling native function.1", %rsi
        movabsq $_PyErr_SetString, %rax
        callq   *%rax
LBB2_3:
        movabsq $"_.const.<numba.core.cpu.CPUContext object at 0x131b86580>", %rdi
        movabsq $_PyUnicode_FromString, %rax
        callq   *%rax
        movq    %rax, %r14
        movabsq $_PyErr_WriteUnraisable, %rax
        movq    %r14, %rdi
        callq   *%rax
        movabsq $_Py_DecRef, %rax
        movq    %r14, %rdi
        callq   *%rax
        movabsq $_numba_gil_release, %rax
        leaq    44(%rsp), %rdi
        callq   *%rax
        movq    88(%rsp), %rax
        movq    80(%rsp), %rcx
        movq    72(%rsp), %rdx
        movq    64(%rsp), %rsi
        jmp     LBB2_4
        .cfi_endproc

        .globl  ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
        .weak_definition        ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
        .p2align        4, 0x90
___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        pushq   %r15
        .cfi_def_cfa_offset 24
        pushq   %r14
        .cfi_def_cfa_offset 32
        pushq   %r13
        .cfi_def_cfa_offset 40
        pushq   %r12
        .cfi_def_cfa_offset 48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        .cfi_offset %rbp, -16
        movq    (%rsi), %rcx
        testq   %rcx, %rcx
        jle     LBB3_16
        movq    (%rdx), %rsi
        movq    8(%rdx), %r10
        movq    8(%rdi), %r11
        movq    16(%rdi), %r14
        movq    16(%rdx), %r15
        movq    24(%rdx), %r12
        movq    32(%rdx), %rax
        movq    %rax, -8(%rsp)
        movq    32(%rdi), %rax
        movq    %rax, -16(%rsp)
        movq    (%rdi), %r13
        movq    24(%rdi), %rax
        movq    %rax, -24(%rsp)
        xorl    %r8d, %r8d
        movq    %rcx, -32(%rsp)
        movq    %rsi, -40(%rsp)
        movq    %r10, -48(%rsp)
        movq    %r11, -56(%rsp)
        movq    %r14, -64(%rsp)
        movq    %r15, -72(%rsp)
        movq    %r12, -80(%rsp)
        jmp     LBB3_2
        .p2align        4, 0x90
LBB3_15:
        incq    %r8
        cmpq    %rcx, %r8
        je      LBB3_16
LBB3_2:
        movq    %r8, %rax
        imulq   %rsi, %rax
        movq    (%rax,%r13), %rbx
        movq    8(%rax,%r13), %rdi
        subq    %rbx, %rdi
        incq    %rdi
        testq   %rdi, %rdi
        jle     LBB3_15
        movq    %r8, %rax
        imulq   %r10, %rax
        vmovsd  (%r11,%rax), %xmm0
        movq    %r8, %rax
        imulq   %r15, %rax
        vmovsd  (%r14,%rax), %xmm1
        movq    %r8, %rbp
        imulq   %r12, %rbp
        addq    -24(%rsp), %rbp
        movq    %r8, %rdx
        imulq   -8(%rsp), %rdx
        addq    -16(%rsp), %rdx
        cmpq    $8, %rdi
        jb      LBB3_13
        movq    %rdi, %r9
        andq    $-8, %r9
        vbroadcastsd    %xmm1, %ymm2
        vbroadcastsd    %xmm0, %ymm3
        leaq    -8(%r9), %rax
        movq    %rax, %r10
        shrq    $3, %r10
        incq    %r10
        movl    %r10d, %r11d
        andl    $3, %r11d
        cmpq    $24, %rax
        jae     LBB3_6
        xorl    %r14d, %r14d
        jmp     LBB3_8
LBB3_6:
        leaq    (%rdx,%rbx,8), %r15
        addq    $224, %r15
        leaq    224(,%rbx,8), %r12
        addq    %rbp, %r12
        andq    $-4, %r10
        negq    %r10
        xorl    %r14d, %r14d
        .p2align        4, 0x90
LBB3_7:
        vmovupd -224(%r12,%r14,8), %ymm4
        vmovupd -192(%r12,%r14,8), %ymm5
        vsubpd  %ymm2, %ymm4, %ymm4
        vsubpd  %ymm2, %ymm5, %ymm5
        vmulpd  %ymm3, %ymm4, %ymm4
        vmulpd  %ymm3, %ymm5, %ymm5
        vmovupd %ymm4, -224(%r15,%r14,8)
        vmovupd %ymm5, -192(%r15,%r14,8)
        vmovupd -160(%r12,%r14,8), %ymm4
        vmovupd -128(%r12,%r14,8), %ymm5
        vsubpd  %ymm2, %ymm4, %ymm4
        vsubpd  %ymm2, %ymm5, %ymm5
        vmulpd  %ymm3, %ymm4, %ymm4
        vmulpd  %ymm3, %ymm5, %ymm5
        vmovupd %ymm4, -160(%r15,%r14,8)
        vmovupd %ymm5, -128(%r15,%r14,8)
        vmovupd -96(%r12,%r14,8), %ymm4
        vmovupd -64(%r12,%r14,8), %ymm5
        vsubpd  %ymm2, %ymm4, %ymm4
        vsubpd  %ymm2, %ymm5, %ymm5
        vmulpd  %ymm3, %ymm4, %ymm4
        vmulpd  %ymm3, %ymm5, %ymm5
        vmovupd %ymm4, -96(%r15,%r14,8)
        vmovupd %ymm5, -64(%r15,%r14,8)
        vmovupd -32(%r12,%r14,8), %ymm4
        vmovupd (%r12,%r14,8), %ymm5
        vsubpd  %ymm2, %ymm4, %ymm4
        vsubpd  %ymm2, %ymm5, %ymm5
        vmulpd  %ymm3, %ymm4, %ymm4
        vmulpd  %ymm3, %ymm5, %ymm5
        vmovupd %ymm4, -32(%r15,%r14,8)
        vmovupd %ymm5, (%r15,%r14,8)
        addq    $32, %r14
        addq    $4, %r10
        jne     LBB3_7
LBB3_8:
        testq   %r11, %r11
        movq    -72(%rsp), %r15
        movq    -80(%rsp), %r12
        je      LBB3_11
        addq    %rbx, %r14
        shlq    $6, %r11
        leaq    (%rdx,%r14,8), %rcx
        addq    $32, %rcx
        leaq    32(,%r14,8), %rax
        addq    %rbp, %rax
        xorl    %esi, %esi
        .p2align        4, 0x90
LBB3_10:
        vmovupd -32(%rax,%rsi), %ymm4
        vmovupd (%rax,%rsi), %ymm5
        vsubpd  %ymm2, %ymm4, %ymm4
        vsubpd  %ymm2, %ymm5, %ymm5
        vmulpd  %ymm3, %ymm4, %ymm4
        vmulpd  %ymm3, %ymm5, %ymm5
        vmovupd %ymm4, -32(%rcx,%rsi)
        vmovupd %ymm5, (%rcx,%rsi)
        addq    $64, %rsi
        cmpq    %rsi, %r11
        jne     LBB3_10
LBB3_11:
        cmpq    %r9, %rdi
        movq    -32(%rsp), %rcx
        movq    -40(%rsp), %rsi
        movq    -48(%rsp), %r10
        movq    -56(%rsp), %r11
        movq    -64(%rsp), %r14
        je      LBB3_15
        andl    $7, %edi
        addq    %r9, %rbx
LBB3_13:
        incq    %rdi
        shlq    $3, %rbx
        .p2align        4, 0x90
LBB3_14:
        vmovsd  (%rbp,%rbx), %xmm2
        vsubsd  %xmm1, %xmm2, %xmm2
        vmulsd  %xmm0, %xmm2, %xmm2
        vmovsd  %xmm2, (%rdx,%rbx)
        decq    %rdi
        addq    $8, %rbx
        cmpq    $1, %rdi
        jg      LBB3_14
        jmp     LBB3_15
LBB3_16:
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        .cfi_endproc

        .globl  _NRT_decref
        .weak_def_can_be_hidden _NRT_decref
        .p2align        4, 0x90
_NRT_decref:
        .cfi_startproc
        testq   %rdi, %rdi
        je      LBB4_2
        ##MEMBARRIER
        lock            decq    (%rdi)
        je      LBB4_3
LBB4_2:
        retq
LBB4_3:
        ##MEMBARRIER
        movabsq $_NRT_MemInfo_call_dtor, %rax
        jmpq    *%rax
        .cfi_endproc

        .section        __TEXT,__literal8,8byte_literals
        .p2align        3
LCPI5_0:
        .quad   0xbfe0000000000000
        .section        __TEXT,__text,regular,pure_instructions
        .globl  ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
        .weak_definition        ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
        .p2align        4, 0x90
___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        pushq   %r15
        .cfi_def_cfa_offset 24
        pushq   %r14
        .cfi_def_cfa_offset 32
        pushq   %r13
        .cfi_def_cfa_offset 40
        pushq   %r12
        .cfi_def_cfa_offset 48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        subq    $360, %rsp
        .cfi_def_cfa_offset 416
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        .cfi_offset %rbp, -16
        movq    (%rsi), %rax
        movq    %rax, 264(%rsp)
        testq   %rax, %rax
        jle     LBB5_17
        movq    (%rdx), %rsi
        movq    8(%rdx), %rax
        movq    %rax, 248(%rsp)
        movq    8(%rdi), %rax
        movq    %rax, 240(%rsp)
        movq    16(%rdx), %rax
        movq    %rax, 232(%rsp)
        movq    24(%rdx), %rax
        movq    %rax, 224(%rsp)
        movq    24(%rdi), %rax
        movq    %rax, 216(%rsp)
        movq    (%rdi), %rax
        movq    %rax, 256(%rsp)
        movq    16(%rdi), %rax
        movq    %rax, 208(%rsp)
        xorl    %ecx, %ecx
        movabsq $LCPI5_0, %rax
        vmovsd  (%rax), %xmm1
        movabsq $_exp, %r15
        vbroadcastsd    (%rax), %ymm0
        vmovupd %ymm0, 128(%rsp)
        movq    %rsi, %rbx
        movq    %rsi, 184(%rsp)
        vmovsd  %xmm1, 120(%rsp)
        jmp     LBB5_2
        .p2align        4, 0x90
LBB5_16:
        movq    112(%rsp), %rcx
        incq    %rcx
        cmpq    264(%rsp), %rcx
        je      LBB5_17
LBB5_2:
        movq    %rcx, 112(%rsp)
        movq    %rcx, %rax
        imulq   %rbx, %rax
        movq    256(%rsp), %rcx
        movq    (%rax,%rcx), %r13
        movq    8(%rax,%rcx), %r14
        subq    %r13, %r14
        incq    %r14
        testq   %r14, %r14
        jle     LBB5_16
        movq    112(%rsp), %rdx
        movq    %rdx, %rax
        movq    248(%rsp), %rcx
        imulq   %rcx, %rax
        movq    240(%rsp), %rcx
        vmovsd  (%rcx,%rax), %xmm0
        movq    %rdx, %r12
        imulq   232(%rsp), %r12
        addq    208(%rsp), %r12
        imulq   224(%rsp), %rdx
        addq    216(%rsp), %rdx
        cmpq    $4, %r14
        vmovapd %xmm0, 336(%rsp)
        jae     LBB5_5
        movq    %rdx, %rbp
        jmp     LBB5_14
        .p2align        4, 0x90
LBB5_5:
        movq    %r14, %rsi
        andq    $-4, %rsi
        vbroadcastsd    %xmm0, %ymm1
        vmovupd %ymm1, 288(%rsp)
        leaq    -4(%rsi), %rax
        movq    %rax, %rdi
        shrq    $2, %rdi
        incq    %rdi
        movl    %edi, %ebp
        andl    $3, %ebp
        cmpq    $12, %rax
        movq    %r12, 104(%rsp)
        movq    %rdx, 96(%rsp)
        movq    %rsi, 200(%rsp)
        movq    %rbp, 192(%rsp)
        jae     LBB5_7
        xorl    %ebp, %ebp
        vmovupd 128(%rsp), %ymm1
        jmp     LBB5_9
LBB5_7:
        leaq    (%rdx,%r13,8), %rax
        addq    $96, %rax
        movq    %rax, 280(%rsp)
        leaq    (%r12,%r13,8), %rax
        addq    $96, %rax
        movq    %rax, 272(%rsp)
        andq    $-4, %rdi
        negq    %rdi
        xorl    %ebp, %ebp
        vmovupd 128(%rsp), %ymm1
        .p2align        4, 0x90
LBB5_8:
        movq    %rdi, 80(%rsp)
        movq    272(%rsp), %r12
        vmovupd -96(%r12,%rbp,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  %ymm1, %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 16(%rsp)
        vzeroupper
        callq   *%r15
        vmovapd %xmm0, (%rsp)
        vpermilpd       $1, 16(%rsp), %xmm0
        callq   *%r15
        vmovapd (%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, (%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r15
        vmovaps %xmm0, 16(%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r15
        vmovapd 16(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, (%rsp), %ymm0, %ymm0
        vmulpd  288(%rsp), %ymm0, %ymm0
        movq    280(%rsp), %rbx
        vmovupd %ymm0, -96(%rbx,%rbp,8)
        vmovupd -64(%r12,%rbp,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  128(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 16(%rsp)
        vzeroupper
        callq   *%r15
        vmovapd %xmm0, (%rsp)
        vpermilpd       $1, 16(%rsp), %xmm0
        callq   *%r15
        vmovapd (%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, (%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r15
        vmovaps %xmm0, 16(%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r15
        vmovapd 16(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, (%rsp), %ymm0, %ymm0
        vmulpd  288(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, -64(%rbx,%rbp,8)
        vmovupd -32(%r12,%rbp,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  128(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 16(%rsp)
        vzeroupper
        callq   *%r15
        vmovapd %xmm0, (%rsp)
        vpermilpd       $1, 16(%rsp), %xmm0
        callq   *%r15
        vmovapd (%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, (%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r15
        vmovaps %xmm0, 16(%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r15
        vmovapd 16(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, (%rsp), %ymm0, %ymm0
        vmulpd  288(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, -32(%rbx,%rbp,8)
        vmovupd (%r12,%rbp,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  128(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 16(%rsp)
        vzeroupper
        callq   *%r15
        vmovapd %xmm0, (%rsp)
        vpermilpd       $1, 16(%rsp), %xmm0
        callq   *%r15
        vmovapd (%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, (%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r15
        vmovaps %xmm0, 16(%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r15
        movq    80(%rsp), %rdi
        vmovupd 128(%rsp), %ymm1
        vmovapd 16(%rsp), %xmm2
        vunpcklpd       %xmm0, %xmm2, %xmm0
        vinsertf128     $1, (%rsp), %ymm0, %ymm0
        vmulpd  288(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, (%rbx,%rbp,8)
        addq    $16, %rbp
        addq    $4, %rdi
        jne     LBB5_8
LBB5_9:
        movq    192(%rsp), %r12
        testq   %r12, %r12
        je      LBB5_12
        addq    %r13, %rbp
        shlq    $5, %r12
        movq    96(%rsp), %rax
        leaq    (%rax,%rbp,8), %rax
        movq    %rax, 16(%rsp)
        movq    104(%rsp), %rax
        leaq    (%rax,%rbp,8), %rbp
        xorl    %ebx, %ebx
        .p2align        4, 0x90
LBB5_11:
        vmovupd (%rbp,%rbx), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  %ymm1, %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, (%rsp)
        vzeroupper
        callq   *%r15
        vmovapd %xmm0, 80(%rsp)
        vpermilpd       $1, (%rsp), %xmm0
        callq   *%r15
        vmovapd 80(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 80(%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r15
        vmovaps %xmm0, (%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r15
        vmovupd 128(%rsp), %ymm1
        vmovapd (%rsp), %xmm2
        vunpcklpd       %xmm0, %xmm2, %xmm0
        vinsertf128     $1, 80(%rsp), %ymm0, %ymm0
        vmulpd  288(%rsp), %ymm0, %ymm0
        movq    16(%rsp), %rax
        vmovupd %ymm0, (%rax,%rbx)
        addq    $32, %rbx
        cmpq    %rbx, %r12
        jne     LBB5_11
LBB5_12:
        movq    200(%rsp), %rax
        cmpq    %rax, %r14
        movq    184(%rsp), %rbx
        vmovsd  120(%rsp), %xmm1
        movq    104(%rsp), %r12
        movq    96(%rsp), %rbp
        je      LBB5_16
        andl    $3, %r14d
        addq    %rax, %r13
LBB5_14:
        incq    %r14
        shlq    $3, %r13
        .p2align        4, 0x90
LBB5_15:
        vmovsd  (%r12,%r13), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        callq   *%r15
        vmovsd  120(%rsp), %xmm1
        vmulsd  336(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, (%rbp,%r13)
        decq    %r14
        addq    $8, %r13
        cmpq    $1, %r14
        jg      LBB5_15
        jmp     LBB5_16
LBB5_17:
        addq    $360, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        .cfi_endproc

        .globl  _NRT_incref
        .weak_def_can_be_hidden _NRT_incref
        .p2align        4, 0x90
_NRT_incref:
        testq   %rdi, %rdi
        je      LBB6_1
        lock            incq    (%rdi)
        retq
LBB6_1:
        retq

        .comm   __ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd,8,3
        .section        __DATA,__const
        .p2align        4
_.const.picklebuf.5129668672:
        .quad   _.const.pickledata.5129668672
        .long   69
        .space  4
        .quad   _.const.pickledata.5129668672.sha1

        .section        __TEXT,__const
        .p2align        4
_printf_format:
        .asciz  "num_threads: %d\n"

        .section        __DATA,__const
        .p2align        4
_.const.picklebuf.5129402880:
        .quad   _.const.pickledata.5129402880
        .long   112
        .space  4
        .quad   _.const.pickledata.5129402880.sha1

        .section        __TEXT,__const
        .p2align        4
_printf_format.1:
        .asciz  "num_threads: %d\n"

        .p2align        4
_.const.pickledata.5129402880:
        .ascii  "\200\004\225e\000\000\000\000\000\000\000\214\bbuiltins\224\214\fRuntimeError\224\223\224\214@Invalid number of threads. This likely indicates a bug in Numba.\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.5129402880.sha1:
        .ascii  "\235\213\326\325A\263\3436\375y\027\231I@x\033\306\212:\212"

        .p2align        4
_.const.pickledata.5129668672:
        .ascii  "\200\004\225:\000\000\000\000\000\000\000\214\bbuiltins\224\214\021ZeroDivisionError\224\223\224\214\020division by zero\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.5129668672.sha1:
        .ascii  "\262\200\b\240\370\213\255_\360\360$>\204\332\271\f\253\031\263f"

_.const.norm_pdf:
        .asciz  "norm_pdf"

        .p2align        4
"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd":
        .asciz  "missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd"

        .p2align        4
"_.const.can't unbox array from PyObject into native value.  The object maybe of a different type":
        .asciz  "can't unbox array from PyObject into native value.  The object maybe of a different type"

        .p2align        4
"_.const.`env.consts` is NULL in `read_const`":
        .asciz  "`env.consts` is NULL in `read_const`"

        .p2align        4
_.const.pickledata.4576487424:
        .ascii  "\200\004\225\025\000\000\000\000\000\000\000\214\005numpy\224\214\007ndarray\224\223\224."

        .p2align        4
_.const.pickledata.4576487424.sha1:
        .ascii  "\337\274\375\323\237\313&\364\320\306\200\225D\207\270\300\265;\270\243"

        .p2align        4
"_.const.unknown error when calling native function":
        .asciz  "unknown error when calling native function"

        .p2align        4
"_.const.<numba.core.cpu.CPUContext object at 0x131b86580>":
        .asciz  "<numba.core.cpu.CPUContext object at 0x131b86580>"

        .p2align        4
"_.const.unknown error when calling native function.1":
        .asciz  "unknown error when calling native function"

        .comm   __ZN08NumbaEnv13_3cdynamic_3e42jit_wrapper__built_in_function_empty__2416B66c8tJTIeFCjyCbUFRqqOAK_2f6h0jAX2aI7qVodJKVeqwlUg4hHqC7MmIRJFEEM1gQAEx18class_28float64_29,8,3
        .comm   __ZN08NumbaEnv5numba2np8arrayobj19_call_allocator_247B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,3
        .comm   __ZN08NumbaEnv5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_248B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,3
        .section        __DATA,__const
        .p2align        4
_.const.picklebuf.5131375424:
        .quad   _.const.pickledata.5131375424
        .long   77
        .space  4
        .quad   _.const.pickledata.5131375424.sha1

        .p2align        4
_.const.picklebuf.5131380544:
        .quad   _.const.pickledata.5131380544
        .long   137
        .space  4
        .quad   _.const.pickledata.5131380544.sha1

        .section        __TEXT,__const
        .p2align        4
_.const.pickledata.5131380544:
        .ascii  "\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.5131380544.sha1:
        .ascii  "X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"

        .p2align        4
_.const.pickledata.5131375424:
        .ascii  "\200\004\225B\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214\037negative dimensions not allowed\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.5131375424.sha1:
        .ascii  "3\033\205c\275\271\332\310\0338B\"s\005,Ho\301pk"

        .comm   __ZN08NumbaEnv13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE,8,3
        .comm   __ZN08NumbaEnv13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE,8,3
        .comm   __ZN08NumbaEnv5numba7cpython7numbers14int_power_impl12_3clocals_3e14int_power_2424B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdx,8,3
.subsections_via_symbols

This code section is very long, but the assembly grammar is very simple. Constants starts with . and SOMETHING: is a jump label for the assembly equivalent of goto. Everything else is an instruction with its name on the left and the arguments are on the right.

You can google all the commands, the interesting ones are those that end with pd, those are SIMD instructions that operate on up to eight doubles at once. This is where the speed comes from. There is a lot of repetition, because the optimizer partially unrolled some loops to make them faster. Using unrolled loops only works if the remaining chunk of data is large enough. Since the compiler does not know the length of the incoming array, it also generates sections which handle shorter chunks and all the code to select which section to use. Finally, there is some code which does the translation from and to Python objects with corresponding error handling.

We don’t need to write SIMD instructions by hand, the optimizer does it for us and in a very sophisticated way.