Acceleration with Numba

We show how the computation of cost functions can be dramatically accelerated with numba’s JIT compiler.

The run-time of iminuit is usually dominated by the execution time of the cost function. To get good performance, it recommended to use array arthimetic and scipy and numpy functions in the body of the cost function. Python loops should be avoided, but if they are unavoidable, numba can help. Numba can also parallelize numerical calculations to make full use of multi-core CPUs and even do computations on the GPU.

[1]:
# !pip install matplotlib numpy numba scipy iminuit
from iminuit import Minuit
import numpy as np
import numba as nb
import math
from scipy.stats import expon, norm
from matplotlib import pyplot as plt
from argparse import Namespace

The standard fit in particle physics is the fit of a peak over some smooth background. We generate a Gaussian peak over exponential background, using scipy.

[2]:
np.random.seed(1)  # fix seed

# true parameters for signal and background
truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
n_bkg = truth.n_sig * truth.f_bkg

# make a data set
x = np.empty(truth.n_sig + n_bkg)

# fill m variables
x[: truth.n_sig] = norm(*truth.sig).rvs(truth.n_sig)
x[truth.n_sig :] = expon(*truth.bkg).rvs(n_bkg)

# cut a range in x
xrange = np.array((1.0, 9.0))
ma = (xrange[0] < x) & (x < xrange[1])
x = x[ma]

plt.hist(
    (x[truth.n_sig :], x[: truth.n_sig]),
    bins=50,
    stacked=True,
    label=("background", "signal"),
)
plt.xlabel("x")
plt.legend();
../_images/tutorial_numba_3_0.png
[3]:
# ideal starting values for iminuit
start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))


# iminuit instance factory, will be called a lot in the benchmarks blow
def m_init(fcn):
    m = Minuit(fcn, start, name=("ns", "nb", "mu", "sigma", "lambd"))
    m.limits = ((0, None), (0, None), None, (0, None), (0, None))
    m.errordef = Minuit.LIKELIHOOD
    return m
[4]:
# extended likelihood (https://doi.org/10.1016/0168-9002(90)91334-8)
# this version uses numpy and scipy and array arithmetic
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    s = norm(mu, sigma)
    b = expon(0, lambd)
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = s.cdf(xrange)
    bn = b.cdf(xrange)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    return (n_sig + n_bkg) - np.sum(
        np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
    )


nll(start)
[4]:
-103168.78482586428
[5]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
304 ms ± 1.96 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Let’s see whether we can beat that. The code above is already pretty fast, because numpy and scipy routines are fast, and we spend most of the time in those. But these implementations do not parallelize the execution and are not optimised for this particular CPU, unlike numba-jitted functions.

To use numba, in theory we just need to put the njit decorator on top of the function, but often that doesn’t work out of the box. numba understands many numpy functions, but no scipy. We must evaluate the code that uses scipy in ‘object mode’, which is numba-speak for calling into the Python interpreter.

[6]:
# first attempt to use numba
@nb.njit(parallel=True)
def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    with nb.objmode(spdf="float64[:]", bpdf="float64[:]", sn="float64", bn="float64"):
        s = norm(mu, sigma)
        b = expon(0, lambd)
        # normalisation factors are needed for pdfs, since x range is restricted
        sn = np.diff(s.cdf(xrange))[0]
        bn = np.diff(b.cdf(xrange))[0]
        spdf = s.pdf(x)
        bpdf = b.pdf(x)
    no = n_sig + n_bkg
    return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))


nll(start)  # test and warm-up JIT
[6]:
-103168.78482586429
[7]:
%%timeit -r 3 -n 1 m = m_init(nll)
m.migrad()
347 ms ± 18.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

It is even a bit slower. :( Let’s break the original function down by parts to see why.

[8]:
# let's time the body of the function
n_sig, n_bkg, mu, sigma, lambd = start
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
# no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))

%timeit -r 3 -n 100 norm(*start[2:4]).pdf(x)
%timeit -r 3 -n 500 expon(0, start[4]).pdf(x)
%timeit -r 3 -n 1000 np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
1.29 ms ± 66 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
1.14 ms ± 46.2 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
134 µs ± 2.94 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

Most of the time is spend in norm and expon which numba could not accelerate and the total time is dominated by the slowest part.

This, unfortunately, means we have to do much more manual work to make the function faster, since we have to replace the scipy routines with Python code that numba can accelerate and run in parallel.

[9]:
kwd = {"parallel": True, "fastmath": True}


@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
    return np.sum(np.log(fs * spdf + fb * bpdf))


@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
    invs = 1.0 / sigma
    z = (x - mu) * invs
    invnorm = 1 / np.sqrt(2 * np.pi) * invs
    return np.exp(-0.5 * z ** 2) * invnorm


@nb.njit(**kwd)
def nb_erf(x):
    y = np.empty_like(x)
    for i in nb.prange(len(x)):
        y[i] = math.erf(x[i])
    return y


@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
    invs = 1.0 / (sigma * np.sqrt(2))
    z = (x - mu) * invs
    return 0.5 * (1 + nb_erf(z))


@nb.njit(**kwd)
def expon_pdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return inv_lambd * np.exp(-inv_lambd * x)


@nb.njit(**kwd)
def expon_cdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return 1.0 - np.exp(-inv_lambd * x)


def nll(par):
    n_sig, n_bkg, mu, sigma, lambd = par
    # normalisation factors are needed for pdfs, since x range is restricted
    sn = norm_cdf(xrange, mu, sigma)
    bn = expon_cdf(xrange, lambd)
    sn = sn[1] - sn[0]
    bn = bn[1] - bn[0]
    spdf = norm_pdf(x, mu, sigma)
    bpdf = expon_pdf(x, lambd)
    no = n_sig + n_bkg
    return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)


nll(start)  # test and warm-up JIT
[9]:
-103168.78482586428

Let’s see how well these versions do:

[10]:
%timeit -r 3 -n 100 norm_pdf(x, *start[2:4])
%timeit -r 3 -n 500 expon_pdf(x, start[4])
%timeit -r 3 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
209 µs ± 13.5 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
115 µs ± 2.57 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
115 µs ± 2.01 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)

Basically no improvement for sum_log, but the pdf calculation was drastically accelerated. Since this was the bottleneck before, we expect also Migrad to finish faster now.

[11]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
96.1 ms ± 1.13 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Success! We managed to get roughly 3x speed improvement over the initial code. This is impressive, but it cost us a lot of developer time. This is not always a good trade-off, especially if you consider that library routines are heavily tested, while you always need to test your own code in addition to writing it.

By putting these faster functions into a library, however, we would only have to pay the developer cost once.

The final question is how much of the speed increase came from the parallelization and how much from the generally optimized code that numba generated for our specific CPU. Let’s turn off parallelization and see fast the functions are then.

[12]:
kwd = {"parallel": False, "fastmath": True}


@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
    return np.sum(np.log(fs * spdf + fb * bpdf))


@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
    invs = 1.0 / sigma
    z = (x - mu) * invs
    invnorm = 1 / np.sqrt(2 * np.pi) * invs
    return np.exp(-0.5 * z ** 2) * invnorm


@nb.njit(**kwd)
def nb_erf(x):
    y = np.empty_like(x)
    for i in nb.prange(len(x)):
        y[i] = math.erf(x[i])
    return y


@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
    invs = 1.0 / (sigma * np.sqrt(2))
    z = (x - mu) * invs
    return 0.5 * (1 + nb_erf(z))


@nb.njit(**kwd)
def expon_pdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return inv_lambd * np.exp(-inv_lambd * x)


@nb.njit(**kwd)
def expon_cdf(x, lambd):
    inv_lambd = 1.0 / lambd
    return 1.0 - np.exp(-inv_lambd * x)


nll(start)  # test and warm-up JIT
[12]:
-103168.78482586423
[13]:
%%timeit -r 3 -n 1
m = m_init(nll)  # setup time is negligible
m.migrad();
35.2 ms ± 1.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)

Migrad runs about 3x faster now than the version with parallelization or 9x faster than original version! I hope you are surprised, this just shows how difficult it is reason about performance.

Why was parallelization bad for performance? The arrays in this example are too small to benefit from running parallel, the overhead of breaking the data into chunks that are processed and then merging them back together is too large. This should become better if we increase the sizes of the arrays.

So why is numba so fast even without parallelization? We can look at the assembly code generated.

[14]:
for signature, code in norm_pdf.inspect_asm().items():
    print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}\n{code}")
signature: (array(float64, 1d, C), float64, float64)
----------------------------------------------------
        .section        __TEXT,__text,regular,pure_instructions
        .macosx_version_min 10, 15
        .section        __TEXT,__literal8,8byte_literals
        .p2align        3
LCPI0_0:
        .quad   4607182418800017408
LCPI0_1:
        .quad   -4620693217682128896
LCPI0_2:
        .quad   4600858325139338833
        .section        __TEXT,__text,regular,pure_instructions
        .globl  __ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        pushq   %r15
        .cfi_def_cfa_offset 24
        pushq   %r14
        .cfi_def_cfa_offset 32
        pushq   %r13
        .cfi_def_cfa_offset 40
        pushq   %r12
        .cfi_def_cfa_offset 48
        pushq   %rbx
        .cfi_def_cfa_offset 56
        subq    $264, %rsp
        .cfi_def_cfa_offset 320
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        .cfi_offset %rbp, -16
        vmovsd  %xmm1, 16(%rsp)
        vmovapd %xmm0, 64(%rsp)
        movq    %rdx, %r12
        movq    %rsi, %r14
        movq    %rdi, %rbx
        movabsq $_NRT_incref, %rax
        movq    %rdx, %rdi
        callq   *%rax
        vmovsd  16(%rsp), %xmm1
        vxorpd  %xmm0, %xmm0, %xmm0
        vucomisd        %xmm0, %xmm1
        je      LBB0_59
        movq    328(%rsp), %rbp
        imulq   $8, %rbp, %rdi
        jo      LBB0_58
        movq    %rbx, 256(%rsp)
        movabsq $LCPI0_0, %rax
        vmovsd  (%rax), %xmm0
        vdivsd  %xmm1, %xmm0, %xmm0
        vmovapd %xmm0, 16(%rsp)
        movabsq $_NRT_MemInfo_alloc_safe_aligned, %r13
        movl    $32, %esi
        callq   *%r13
        vmovapd 16(%rsp), %xmm6
        movq    %rax, %rbx
        movq    24(%rax), %r15
        testq   %rbp, %rbp
        vmovapd 64(%rsp), %xmm7
        jle     LBB0_26
        movq    320(%rsp), %rax
        cmpq    $1, %rbp
        jne     LBB0_6
        leaq    -1(%rbp), %rdx
        movl    %ebp, %ecx
        andl    $7, %ecx
        cmpq    $7, %rdx
        jae     LBB0_8
        xorl    %edx, %edx
        jmp     LBB0_10
LBB0_6:
        cmpq    $16, %rbp
        jb      LBB0_7
        leaq    (%rax,%rbp,8), %rcx
        cmpq    %rcx, %r15
        jae     LBB0_16
        leaq    (%r15,%rbp,8), %rcx
        cmpq    %rax, %rcx
        jbe     LBB0_16
LBB0_7:
        xorl    %ecx, %ecx
LBB0_22:
        movq    %rcx, %rdx
        notq    %rdx
        addq    %rbp, %rdx
        movq    %rbp, %rsi
        andq    $7, %rsi
        je      LBB0_24
        .p2align        4, 0x90
LBB0_23:
        vmovsd  (%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, (%r15,%rcx,8)
        incq    %rcx
        decq    %rsi
        jne     LBB0_23
LBB0_24:
        cmpq    $7, %rdx
        jb      LBB0_26
        .p2align        4, 0x90
LBB0_25:
        vmovsd  (%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, (%r15,%rcx,8)
        vmovsd  8(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 8(%r15,%rcx,8)
        vmovsd  16(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 16(%r15,%rcx,8)
        vmovsd  24(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 24(%r15,%rcx,8)
        vmovsd  32(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 32(%r15,%rcx,8)
        vmovsd  40(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 40(%r15,%rcx,8)
        vmovsd  48(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 48(%r15,%rcx,8)
        vmovsd  56(%rax,%rcx,8), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 56(%r15,%rcx,8)
        addq    $8, %rcx
        cmpq    %rcx, %rbp
        jne     LBB0_25
        jmp     LBB0_26
LBB0_8:
        movq    %rbp, %rsi
        subq    %rcx, %rsi
        xorl    %edx, %edx
        .p2align        4, 0x90
LBB0_9:
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, (%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 8(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 16(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 24(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 32(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 40(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 48(%r15,%rdx,8)
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, 56(%r15,%rdx,8)
        addq    $8, %rdx
        cmpq    %rdx, %rsi
        jne     LBB0_9
LBB0_10:
        testq   %rcx, %rcx
        je      LBB0_26
        leaq    (%r15,%rdx,8), %rdx
        xorl    %esi, %esi
        .p2align        4, 0x90
LBB0_12:
        vmovsd  (%rax), %xmm0
        vsubsd  %xmm7, %xmm0, %xmm0
        vmulsd  %xmm6, %xmm0, %xmm0
        vmovsd  %xmm0, (%rdx,%rsi,8)
        incq    %rsi
        cmpq    %rsi, %rcx
        jne     LBB0_12
LBB0_26:
        movabsq $_NRT_decref, %rax
        movq    %r12, %rdi
        vzeroupper
        callq   *%rax
        imulq   $8, %rbp, %rdi
        jo      LBB0_58
        movq    %rbx, 248(%rsp)
        movl    $32, %esi
        callq   *%r13
        movq    %rax, 240(%rsp)
        movq    24(%rax), %r12
        testq   %rbp, %rbp
        vmovapd 16(%rsp), %xmm0
        jle     LBB0_55
        movq    328(%rsp), %rbp
        cmpq    $1, %rbp
        jne     LBB0_31
        leaq    -1(%rbp), %rax
        movl    %ebp, %r14d
        andl    $3, %r14d
        cmpq    $3, %rax
        jae     LBB0_33
        xorl    %ebx, %ebx
        jmp     LBB0_35
LBB0_31:
        cmpq    $4, %rbp
        jb      LBB0_32
        leaq    (%r15,%rbp,8), %rax
        cmpq    %rax, %r12
        jae     LBB0_41
        leaq    (%r12,%rbp,8), %rax
        cmpq    %r15, %rax
        jbe     LBB0_41
LBB0_32:
        xorl    %ebx, %ebx
LBB0_49:
        movq    %rbx, %r14
        notq    %r14
        movq    328(%rsp), %rbp
        addq    %rbp, %r14
        andq    $3, %rbp
        je      LBB0_52
        movabsq $LCPI0_1, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 64(%rsp)
        movabsq $_exp, %r13
        movabsq $LCPI0_2, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 32(%rsp)
        .p2align        4, 0x90
LBB0_51:
        vmovsd  (%r15,%rbx,8), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        vzeroupper
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, (%r12,%rbx,8)
        incq    %rbx
        decq    %rbp
        jne     LBB0_51
LBB0_52:
        cmpq    $3, %r14
        jb      LBB0_55
        movabsq $LCPI0_1, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 64(%rsp)
        movabsq $_exp, %r14
        movabsq $LCPI0_2, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 32(%rsp)
        .p2align        4, 0x90
LBB0_54:
        vmovsd  (%r15,%rbx,8), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        vzeroupper
        callq   *%r14
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, (%r12,%rbx,8)
        vmovsd  8(%r15,%rbx,8), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r14
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 8(%r12,%rbx,8)
        vmovsd  16(%r15,%rbx,8), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r14
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%r12,%rbx,8)
        vmovsd  24(%r15,%rbx,8), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r14
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 24(%r12,%rbx,8)
        addq    $4, %rbx
        cmpq    %rbx, 328(%rsp)
        jne     LBB0_54
        jmp     LBB0_55
LBB0_33:
        subq    %r14, %rbp
        xorl    %ebx, %ebx
        movabsq $LCPI0_1, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 64(%rsp)
        movabsq $_exp, %r13
        movabsq $LCPI0_2, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 32(%rsp)
        .p2align        4, 0x90
LBB0_34:
        vmovsd  (%r15), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, (%r12,%rbx,8)
        vmovsd  (%r15), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 8(%r12,%rbx,8)
        vmovsd  (%r15), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 16(%r12,%rbx,8)
        vmovsd  (%r15), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, 24(%r12,%rbx,8)
        addq    $4, %rbx
        cmpq    %rbx, %rbp
        jne     LBB0_34
LBB0_35:
        testq   %r14, %r14
        je      LBB0_55
        leaq    (%r12,%rbx,8), %rbx
        xorl    %ebp, %ebp
        movabsq $LCPI0_1, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 64(%rsp)
        movabsq $_exp, %r13
        movabsq $LCPI0_2, %rax
        vmovsd  (%rax), %xmm0
        vmovsd  %xmm0, 32(%rsp)
        .p2align        4, 0x90
LBB0_37:
        vmovsd  (%r15), %xmm0
        vmulsd  %xmm0, %xmm0, %xmm0
        vmulsd  64(%rsp), %xmm0, %xmm0
        callq   *%r13
        vmulsd  16(%rsp), %xmm0, %xmm0
        vmulsd  32(%rsp), %xmm0, %xmm0
        vmovsd  %xmm0, (%rbx,%rbp,8)
        incq    %rbp
        cmpq    %rbp, %r14
        jne     LBB0_37
LBB0_55:
        movq    248(%rsp), %rdi
        movabsq $_NRT_decref, %rax
        vzeroupper
        callq   *%rax
        movq    256(%rsp), %rax
        movq    240(%rsp), %rcx
        movq    %rcx, (%rax)
        movq    $0, 8(%rax)
        movq    328(%rsp), %rcx
        movq    %rcx, 16(%rax)
        movq    $8, 24(%rax)
        movq    %r12, 32(%rax)
        movq    %rcx, 40(%rax)
        movq    $8, 48(%rax)
        xorl    %eax, %eax
LBB0_56:
        addq    $264, %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
LBB0_16:
        movq    %rbp, %rcx
        andq    $-16, %rcx
        vbroadcastsd    %xmm7, %ymm1
        vbroadcastsd    %xmm6, %ymm0
        leaq    -16(%rcx), %rdx
        movq    %rdx, %rdi
        shrq    $4, %rdi
        incq    %rdi
        movl    %edi, %esi
        andl    $1, %esi
        testq   %rdx, %rdx
        je      LBB0_57
        subq    %rsi, %rdi
        xorl    %edx, %edx
        .p2align        4, 0x90
LBB0_18:
        vmovupd (%rax,%rdx,8), %ymm2
        vmovupd 32(%rax,%rdx,8), %ymm3
        vmovupd 64(%rax,%rdx,8), %ymm4
        vmovupd 96(%rax,%rdx,8), %ymm5
        vsubpd  %ymm1, %ymm2, %ymm2
        vsubpd  %ymm1, %ymm3, %ymm3
        vsubpd  %ymm1, %ymm4, %ymm4
        vsubpd  %ymm1, %ymm5, %ymm5
        vmulpd  %ymm0, %ymm2, %ymm2
        vmulpd  %ymm0, %ymm3, %ymm3
        vmulpd  %ymm0, %ymm4, %ymm4
        vmulpd  %ymm0, %ymm5, %ymm5
        vmovupd %ymm2, (%r15,%rdx,8)
        vmovupd %ymm3, 32(%r15,%rdx,8)
        vmovupd %ymm4, 64(%r15,%rdx,8)
        vmovupd %ymm5, 96(%r15,%rdx,8)
        vmovupd 128(%rax,%rdx,8), %ymm2
        vmovupd 160(%rax,%rdx,8), %ymm3
        vmovupd 192(%rax,%rdx,8), %ymm4
        vmovupd 224(%rax,%rdx,8), %ymm5
        vsubpd  %ymm1, %ymm2, %ymm2
        vsubpd  %ymm1, %ymm3, %ymm3
        vsubpd  %ymm1, %ymm4, %ymm4
        vsubpd  %ymm1, %ymm5, %ymm5
        vmulpd  %ymm0, %ymm2, %ymm2
        vmulpd  %ymm0, %ymm3, %ymm3
        vmulpd  %ymm0, %ymm4, %ymm4
        vmulpd  %ymm0, %ymm5, %ymm5
        vmovupd %ymm2, 128(%r15,%rdx,8)
        vmovupd %ymm3, 160(%r15,%rdx,8)
        vmovupd %ymm4, 192(%r15,%rdx,8)
        vmovupd %ymm5, 224(%r15,%rdx,8)
        addq    $32, %rdx
        addq    $-2, %rdi
        jne     LBB0_18
        testq   %rsi, %rsi
        je      LBB0_21
LBB0_20:
        vmovupd (%rax,%rdx,8), %ymm2
        vmovupd 32(%rax,%rdx,8), %ymm3
        vmovupd 64(%rax,%rdx,8), %ymm4
        vmovupd 96(%rax,%rdx,8), %ymm5
        vsubpd  %ymm1, %ymm2, %ymm2
        vsubpd  %ymm1, %ymm3, %ymm3
        vsubpd  %ymm1, %ymm4, %ymm4
        vsubpd  %ymm1, %ymm5, %ymm1
        vmulpd  %ymm0, %ymm2, %ymm2
        vmulpd  %ymm0, %ymm3, %ymm3
        vmulpd  %ymm0, %ymm4, %ymm4
        vmulpd  %ymm0, %ymm1, %ymm0
        vmovupd %ymm2, (%r15,%rdx,8)
        vmovupd %ymm3, 32(%r15,%rdx,8)
        vmovupd %ymm4, 64(%r15,%rdx,8)
        vmovupd %ymm0, 96(%r15,%rdx,8)
LBB0_21:
        cmpq    %rbp, %rcx
        je      LBB0_26
        jmp     LBB0_22
LBB0_41:
        movq    %rbp, %rbx
        andq    $-4, %rbx
        vbroadcastsd    %xmm0, %ymm0
        leaq    -4(%rbx), %rax
        movq    %rax, %rbp
        shrq    $2, %rbp
        incq    %rbp
        movl    %ebp, %ecx
        andl    $3, %ecx
        movabsq $LCPI0_1, %rdx
        movabsq $_exp, %r13
        movabsq $LCPI0_2, %rsi
        cmpq    $12, %rax
        movq    %rcx, 232(%rsp)
        vmovupd %ymm0, 64(%rsp)
        jae     LBB0_43
        xorl    %r14d, %r14d
        jmp     LBB0_45
LBB0_43:
        subq    %rcx, %rbp
        xorl    %r14d, %r14d
        vbroadcastsd    (%rdx), %ymm1
        vmovups %ymm1, 32(%rsp)
        vbroadcastsd    (%rsi), %ymm1
        vmovupd %ymm1, 192(%rsp)
        .p2align        4, 0x90
LBB0_44:
        vmovupd (%r15,%r14,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  32(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 160(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 96(%rsp)
        vzeroupper
        callq   *%r13
        vmovapd %xmm0, 128(%rsp)
        vpermilpd       $1, 96(%rsp), %xmm0
        callq   *%r13
        vmovapd 128(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 128(%rsp)
        vmovups 160(%rsp), %ymm0
        vzeroupper
        callq   *%r13
        vmovaps %xmm0, 96(%rsp)
        vpermilpd       $1, 160(%rsp), %xmm0
        callq   *%r13
        vmovapd 96(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, 128(%rsp), %ymm0, %ymm0
        vmulpd  64(%rsp), %ymm0, %ymm0
        vmulpd  192(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, (%r12,%r14,8)
        vmovupd 32(%r15,%r14,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  32(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 160(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 96(%rsp)
        vzeroupper
        callq   *%r13
        vmovapd %xmm0, 128(%rsp)
        vpermilpd       $1, 96(%rsp), %xmm0
        callq   *%r13
        vmovapd 128(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 128(%rsp)
        vmovups 160(%rsp), %ymm0
        vzeroupper
        callq   *%r13
        vmovaps %xmm0, 96(%rsp)
        vpermilpd       $1, 160(%rsp), %xmm0
        callq   *%r13
        vmovapd 96(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, 128(%rsp), %ymm0, %ymm0
        vmulpd  64(%rsp), %ymm0, %ymm0
        vmulpd  192(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 32(%r12,%r14,8)
        vmovupd 64(%r15,%r14,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  32(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 160(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 96(%rsp)
        vzeroupper
        callq   *%r13
        vmovapd %xmm0, 128(%rsp)
        vpermilpd       $1, 96(%rsp), %xmm0
        callq   *%r13
        vmovapd 128(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 128(%rsp)
        vmovups 160(%rsp), %ymm0
        vzeroupper
        callq   *%r13
        vmovaps %xmm0, 96(%rsp)
        vpermilpd       $1, 160(%rsp), %xmm0
        callq   *%r13
        vmovapd 96(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vinsertf128     $1, 128(%rsp), %ymm0, %ymm0
        vmulpd  64(%rsp), %ymm0, %ymm0
        vmulpd  192(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 64(%r12,%r14,8)
        vmovupd 96(%r15,%r14,8), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  32(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 160(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 96(%rsp)
        vzeroupper
        callq   *%r13
        vmovapd %xmm0, 128(%rsp)
        vpermilpd       $1, 96(%rsp), %xmm0
        callq   *%r13
        vmovapd 128(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 128(%rsp)
        vmovups 160(%rsp), %ymm0
        vzeroupper
        callq   *%r13
        vmovaps %xmm0, 96(%rsp)
        vpermilpd       $1, 160(%rsp), %xmm0
        callq   *%r13
        vmovupd 64(%rsp), %ymm1
        vmovapd 96(%rsp), %xmm2
        vunpcklpd       %xmm0, %xmm2, %xmm0
        vinsertf128     $1, 128(%rsp), %ymm0, %ymm0
        vmulpd  %ymm1, %ymm0, %ymm0
        vmulpd  192(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 96(%r12,%r14,8)
        addq    $16, %r14
        addq    $-4, %rbp
        jne     LBB0_44
LBB0_45:
        movq    232(%rsp), %rbp
        testq   %rbp, %rbp
        je      LBB0_48
        shlq    $3, %r14
        negq    %rbp
        movabsq $LCPI0_1, %rax
        vbroadcastsd    (%rax), %ymm0
        vmovups %ymm0, 128(%rsp)
        movabsq $LCPI0_2, %rax
        vbroadcastsd    (%rax), %ymm0
        vmovupd %ymm0, 96(%rsp)
        .p2align        4, 0x90
LBB0_47:
        vmovupd (%r15,%r14), %ymm0
        vmulpd  %ymm0, %ymm0, %ymm0
        vmulpd  128(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, 32(%rsp)
        vextractf128    $1, %ymm0, %xmm0
        vmovapd %xmm0, 160(%rsp)
        vzeroupper
        callq   *%r13
        vmovapd %xmm0, 192(%rsp)
        vpermilpd       $1, 160(%rsp), %xmm0
        callq   *%r13
        vmovapd 192(%rsp), %xmm1
        vunpcklpd       %xmm0, %xmm1, %xmm0
        vmovapd %xmm0, 192(%rsp)
        vmovups 32(%rsp), %ymm0
        vzeroupper
        callq   *%r13
        vmovaps %xmm0, 160(%rsp)
        vpermilpd       $1, 32(%rsp), %xmm0
        callq   *%r13
        vmovupd 64(%rsp), %ymm1
        vmovapd 160(%rsp), %xmm2
        vunpcklpd       %xmm0, %xmm2, %xmm0
        vinsertf128     $1, 192(%rsp), %ymm0, %ymm0
        vmulpd  %ymm1, %ymm0, %ymm0
        vmulpd  96(%rsp), %ymm0, %ymm0
        vmovupd %ymm0, (%r12,%r14)
        addq    $32, %r14
        incq    %rbp
        jne     LBB0_47
LBB0_48:
        cmpq    328(%rsp), %rbx
        je      LBB0_55
        jmp     LBB0_49
LBB0_57:
        xorl    %edx, %edx
        testq   %rsi, %rsi
        jne     LBB0_20
        jmp     LBB0_21
LBB0_58:
        movabsq $_.const.picklebuf.4970103488, %rax
        jmp     LBB0_60
LBB0_59:
        movabsq $_.const.picklebuf.4979668416, %rax
LBB0_60:
        movq    %rax, (%r14)
        movl    $1, %eax
        jmp     LBB0_56
        .cfi_endproc

        .globl  __ZN7cpython8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
__ZN7cpython8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-32, %rsp
        subq    $352, %rsp
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        movq    %rsi, %rdi
        subq    $8, %rsp
        leaq    112(%rsp), %r10
        movabsq $_.const.norm_pdf, %rsi
        movabsq $_PyArg_UnpackTuple, %rbx
        leaq    128(%rsp), %r8
        leaq    120(%rsp), %r9
        movl    $3, %edx
        movl    $3, %ecx
        xorl    %eax, %eax
        pushq   %r10
        callq   *%rbx
        addq    $16, %rsp
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 192(%rsp)
        vmovups %ymm0, 216(%rsp)
        vmovaps %ymm0, 128(%rsp)
        vmovups %ymm0, 152(%rsp)
        movq    $0, 72(%rsp)
        vmovaps %ymm0, 256(%rsp)
        vmovups %ymm0, 280(%rsp)
        testl   %eax, %eax
        je      LBB1_1
        movabsq $__ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rax
        movq    (%rax), %rbx
        testq   %rbx, %rbx
        je      LBB1_4
        movq    120(%rsp), %rdi
        vxorps  %xmm0, %xmm0, %xmm0
        vmovaps %ymm0, 192(%rsp)
        vmovups %ymm0, 216(%rsp)
        movabsq $_NRT_adapt_ndarray_from_python, %rax
        leaq    192(%rsp), %rsi
        vzeroupper
        callq   *%rax
        testl   %eax, %eax
        jne     LBB1_8
        cmpq    $8, 216(%rsp)
        jne     LBB1_8
        movq    %rbx, 80(%rsp)
        movq    192(%rsp), %rax
        movq    %rax, 64(%rsp)
        movq    200(%rsp), %rax
        movq    %rax, 48(%rsp)
        movq    208(%rsp), %rax
        movq    %rax, 32(%rsp)
        movq    224(%rsp), %rax
        movq    %rax, 56(%rsp)
        movq    232(%rsp), %rax
        movq    %rax, 40(%rsp)
        movq    240(%rsp), %rax
        movq    %rax, 24(%rsp)
        movq    112(%rsp), %rdi
        movabsq $_PyNumber_Float, %r13
        callq   *%r13
        movq    %rax, %rbx
        movabsq $_PyFloat_AsDouble, %r14
        movq    %rax, %rdi
        callq   *%r14
        vmovsd  %xmm0, 96(%rsp)
        movabsq $_Py_DecRef, %r15
        movq    %rbx, %rdi
        callq   *%r15
        movabsq $_PyErr_Occurred, %r12
        callq   *%r12
        testq   %rax, %rax
        jne     LBB1_10
        movq    104(%rsp), %rdi
        callq   *%r13
        movq    %rax, %rbx
        movq    %rax, %rdi
        callq   *%r14
        vmovsd  %xmm0, 88(%rsp)
        movq    %rbx, %rdi
        callq   *%r15
        callq   *%r12
        testq   %rax, %rax
        jne     LBB1_10
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 152(%rsp)
        vmovaps %ymm0, 128(%rsp)
        subq    $8, %rsp
        movabsq $__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rax
        leaq    136(%rsp), %rdi
        leaq    80(%rsp), %rsi
        movl    $8, %r9d
        movq    40(%rsp), %r8
        movq    72(%rsp), %rbx
        movq    %rbx, %rdx
        movq    56(%rsp), %rcx
        vmovsd  104(%rsp), %xmm0
        vmovsd  96(%rsp), %xmm1
        pushq   32(%rsp)
        pushq   56(%rsp)
        pushq   80(%rsp)
        vzeroupper
        callq   *%rax
        addq    $32, %rsp
        movl    %eax, %r12d
        movq    72(%rsp), %r14
        movq    128(%rsp), %rax
        movq    %rax, 56(%rsp)
        movq    136(%rsp), %r15
        movq    144(%rsp), %r13
        movq    152(%rsp), %rax
        movq    %rax, 48(%rsp)
        movq    160(%rsp), %rax
        movq    %rax, 40(%rsp)
        movq    168(%rsp), %rax
        movq    %rax, 32(%rsp)
        movq    176(%rsp), %rax
        movq    %rax, 24(%rsp)
        movabsq $_NRT_decref, %rax
        movq    %rbx, %rdi
        callq   *%rax
        cmpl    $-2, %r12d
        je      LBB1_17
        testl   %r12d, %r12d
        jne     LBB1_14
        movq    80(%rsp), %rax
        movq    24(%rax), %rdi
        testq   %rdi, %rdi
        je      LBB1_20
        movabsq $_PyList_GetItem, %rax
        xorl    %esi, %esi
        callq   *%rax
        movq    %rax, %rcx
        jmp     LBB1_21
LBB1_17:
        movabsq $__Py_NoneStruct, %rbx
        movabsq $_Py_IncRef, %rax
        movq    %rbx, %rdi
        callq   *%rax
        movq    %rbx, %rax
        jmp     LBB1_2
LBB1_14:
        jle     LBB1_22
        movabsq $_PyErr_Clear, %rax
        callq   *%rax
        movq    16(%r14), %rdx
        movl    8(%r14), %esi
        movq    (%r14), %rdi
        movabsq $_numba_unpickle, %rax
        callq   *%rax
        testq   %rax, %rax
        je      LBB1_1
        movabsq $_numba_do_raise, %rcx
        movq    %rax, %rdi
        callq   *%rcx
        jmp     LBB1_1
LBB1_20:
        movabsq $_PyExc_RuntimeError, %rdi
        movabsq $"_.const.`env.consts` is NULL in `read_const`", %rsi
        movabsq $_PyErr_SetString, %rax
        callq   *%rax
        xorl    %ecx, %ecx
LBB1_21:
        movq    56(%rsp), %rax
        movq    %rax, 256(%rsp)
        movq    %r15, 264(%rsp)
        movq    %r13, 272(%rsp)
        movq    48(%rsp), %rax
        movq    %rax, 280(%rsp)
        movq    40(%rsp), %rax
        movq    %rax, 288(%rsp)
        movq    32(%rsp), %rax
        movq    %rax, 296(%rsp)
        movq    24(%rsp), %rax
        movq    %rax, 304(%rsp)
        movabsq $_NRT_adapt_ndarray_to_python, %rax
        leaq    256(%rsp), %rdi
        movl    $1, %esi
        movl    $1, %edx
        callq   *%rax
        jmp     LBB1_2
LBB1_22:
        cmpl    $-3, %r12d
        je      LBB1_25
        cmpl    $-1, %r12d
        je      LBB1_1
        movabsq $_PyExc_SystemError, %rdi
        movabsq $"_.const.unknown error when calling native function", %rsi
        jmp     LBB1_5
LBB1_25:
        movabsq $_PyExc_StopIteration, %rdi
        movabsq $_PyErr_SetNone, %rax
        callq   *%rax
        jmp     LBB1_1
LBB1_10:
        movabsq $_NRT_decref, %rax
        movq    64(%rsp), %rdi
        callq   *%rax
        jmp     LBB1_1
LBB1_4:
        movabsq $_PyExc_RuntimeError, %rdi
        movabsq $"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd", %rsi
        jmp     LBB1_5
LBB1_8:
        movabsq $_PyExc_TypeError, %rdi
        movabsq $"_.const.can't unbox array from PyObject into native value.  The object maybe of a different type", %rsi
LBB1_5:
        movabsq $_PyErr_SetString, %rax
        vzeroupper
        callq   *%rax
LBB1_1:
        xorl    %eax, %eax
LBB1_2:
        leaq    -40(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        vzeroupper
        retq
        .cfi_endproc

        .globl  _cfunc._ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
        .p2align        4, 0x90
_cfunc._ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        pushq   %r15
        pushq   %r14
        pushq   %r13
        pushq   %r12
        pushq   %rbx
        andq    $-32, %rsp
        subq    $192, %rsp
        .cfi_offset %rbx, -56
        .cfi_offset %r12, -48
        .cfi_offset %r13, -40
        .cfi_offset %r14, -32
        .cfi_offset %r15, -24
        movq    %r8, %rax
        movq    %rcx, %r8
        movq    %rdx, %rcx
        movq    %rsi, %rdx
        movq    %rdi, %r14
        vmovaps 16(%rbp), %xmm2
        vxorps  %xmm3, %xmm3, %xmm3
        vmovups %ymm3, 120(%rsp)
        vmovaps %ymm3, 96(%rsp)
        movq    $0, 48(%rsp)
        vmovups %xmm2, 8(%rsp)
        movq    %r9, (%rsp)
        movabsq $__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rbx
        leaq    96(%rsp), %rdi
        leaq    48(%rsp), %rsi
        movq    %rax, %r9
        vzeroupper
        callq   *%rbx
        movl    %eax, %ebx
        movq    48(%rsp), %r15
        movq    96(%rsp), %rax
        movq    104(%rsp), %r12
        movq    112(%rsp), %r13
        movq    120(%rsp), %rcx
        movq    128(%rsp), %rdx
        movq    136(%rsp), %rsi
        movq    144(%rsp), %rdi
        movl    $0, 44(%rsp)
        cmpl    $-2, %ebx
        je      LBB2_5
        testl   %ebx, %ebx
        je      LBB2_5
        movq    %rdi, 56(%rsp)
        movq    %rsi, 64(%rsp)
        movq    %rdx, 72(%rsp)
        movq    %rcx, 80(%rsp)
        movq    %rax, 88(%rsp)
        movabsq $_numba_gil_ensure, %rax
        leaq    44(%rsp), %rdi
        callq   *%rax
        testl   %ebx, %ebx
        jle     LBB2_6
        movabsq $_PyErr_Clear, %rax
        callq   *%rax
        movq    16(%r15), %rdx
        movl    8(%r15), %esi
        movq    (%r15), %rdi
        movabsq $_numba_unpickle, %rax
        callq   *%rax
        testq   %rax, %rax
        je      LBB2_4
        movabsq $_numba_do_raise, %rcx
        movq    %rax, %rdi
        callq   *%rcx
LBB2_4:
        movabsq $"_.const.<numba.core.cpu.CPUContext object at 0x12899e6d0>", %rdi
        movabsq $_PyUnicode_FromString, %rax
        callq   *%rax
        movq    %rax, %rbx
        movabsq $_PyErr_WriteUnraisable, %rax
        movq    %rbx, %rdi
        callq   *%rax
        movabsq $_Py_DecRef, %rax
        movq    %rbx, %rdi
        callq   *%rax
        movabsq $_numba_gil_release, %rax
        leaq    44(%rsp), %rdi
        callq   *%rax
        movq    88(%rsp), %rax
        movq    80(%rsp), %rcx
        movq    72(%rsp), %rdx
        movq    64(%rsp), %rsi
        movq    56(%rsp), %rdi
LBB2_5:
        movq    %rax, (%r14)
        movq    %r12, 8(%r14)
        movq    %r13, 16(%r14)
        movq    %rcx, 24(%r14)
        movq    %rdx, 32(%r14)
        movq    %rsi, 40(%r14)
        movq    %rdi, 48(%r14)
        movq    %r14, %rax
        leaq    -40(%rbp), %rsp
        popq    %rbx
        popq    %r12
        popq    %r13
        popq    %r14
        popq    %r15
        popq    %rbp
        retq
LBB2_6:
        cmpl    $-3, %ebx
        je      LBB2_10
        cmpl    $-1, %ebx
        je      LBB2_4
        movabsq $_PyExc_SystemError, %rdi
        movabsq $"_.const.unknown error when calling native function.1", %rsi
        movabsq $_PyErr_SetString, %rax
        callq   *%rax
        jmp     LBB2_4
LBB2_10:
        movabsq $_PyExc_StopIteration, %rdi
        movabsq $_PyErr_SetNone, %rax
        callq   *%rax
        jmp     LBB2_4
        .cfi_endproc

        .globl  _NRT_incref
        .weak_def_can_be_hidden _NRT_incref
        .p2align        4, 0x90
_NRT_incref:
        testq   %rdi, %rdi
        je      LBB3_1
        lock            incq    (%rdi)
        retq
LBB3_1:
        retq

        .globl  _NRT_decref
        .weak_def_can_be_hidden _NRT_decref
        .p2align        4, 0x90
_NRT_decref:
        .cfi_startproc
        testq   %rdi, %rdi
        je      LBB4_2
        ##MEMBARRIER
        lock            decq    (%rdi)
        je      LBB4_3
LBB4_2:
        retq
LBB4_3:
        ##MEMBARRIER
        movabsq $_NRT_MemInfo_call_dtor, %rax
        jmpq    *%rax
        .cfi_endproc

        .comm   __ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd,8,3
        .section        __DATA,__const
        .p2align        4
_.const.picklebuf.4979668416:
        .quad   _.const.pickledata.4979668416
        .long   69
        .space  4
        .quad   _.const.pickledata.4979668416.sha1

        .p2align        4
_.const.picklebuf.4970103488:
        .quad   _.const.pickledata.4970103488
        .long   137
        .space  4
        .quad   _.const.pickledata.4970103488.sha1

        .section        __TEXT,__const
        .p2align        4
_.const.pickledata.4970103488:
        .ascii  "\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.4970103488.sha1:
        .ascii  "X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"

        .p2align        4
_.const.pickledata.4979668416:
        .ascii  "\200\004\225:\000\000\000\000\000\000\000\214\bbuiltins\224\214\021ZeroDivisionError\224\223\224\214\020division by zero\224\205\224N\207\224."

        .p2align        4
_.const.pickledata.4979668416.sha1:
        .ascii  "\262\200\b\240\370\213\255_\360\360$>\204\332\271\f\253\031\263f"

_.const.norm_pdf:
        .asciz  "norm_pdf"

        .p2align        4
"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd":
        .asciz  "missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd"

        .p2align        4
"_.const.can't unbox array from PyObject into native value.  The object maybe of a different type":
        .asciz  "can't unbox array from PyObject into native value.  The object maybe of a different type"

        .p2align        4
"_.const.`env.consts` is NULL in `read_const`":
        .asciz  "`env.consts` is NULL in `read_const`"

        .p2align        4
"_.const.unknown error when calling native function":
        .asciz  "unknown error when calling native function"

        .p2align        4
"_.const.<numba.core.cpu.CPUContext object at 0x12899e6d0>":
        .asciz  "<numba.core.cpu.CPUContext object at 0x12899e6d0>"

        .p2align        4
"_.const.unknown error when calling native function.1":
        .asciz  "unknown error when calling native function"

        .comm   __ZN08NumbaEnv13$3cdynamic$3e35__numba_array_expr_0x12834bb20$2435Eddd,8,3
        .comm   __ZN08NumbaEnv5numba2np7npyimpl20_broadcast_onto$2430Ex8int64$2ax8int64$2a,8,3
        .comm   __ZN08NumbaEnv13$3cdynamic$3e35__numba_array_expr_0x128dc18b0$2436Edd,8,3
        .comm   __ZN08NumbaEnv5numba7cpython7numbers14int_power_impl12$3clocals$3e14int_power$2421Edx,8,3
.subsections_via_symbols

This code section is very long, but the assembly grammar is very simple. Constants starts with . and SOMETHING: is a jump label for the assembly equivalent of goto. Everything else is an instruction with its name on the left and the arguments are on the right.

You can google all the commands, the interesting ones are those that end with pd, those are SIMD instructions that operate on up to eight doubles at once. This is where the speed comes from. There is a lot of repetition, because the optimizer partially unrolled some loops to make them faster. Using unrolled loops only works if the remaining chunk of data is large enough. Since the compiler does not know the length of the incoming array, it also generates sections which handle shorter chunks and all the code to select which section to use. Finally, there is some code which does the translation from and to Python objects with corresponding error handling.

We don’t need to write SIMD instructions by hand, the optimizer does it for us and in a very sophisticated way.