Acceleration with Numba¶
We show how the computation of cost functions can be dramatically accelerated with numba’s JIT compiler.
The run-time of iminuit is usually dominated by the execution time of the cost function. To get good performance, it recommended to use array arthimetic and scipy and numpy functions in the body of the cost function. Python loops should be avoided, but if they are unavoidable, numba can help. Numba can also parallelize numerical calculations to make full use of multi-core CPUs and even do computations on the GPU.
[1]:
# !pip install matplotlib numpy numba scipy iminuit
from iminuit import Minuit
import numpy as np
import numba as nb
import math
from scipy.stats import expon, norm
from matplotlib import pyplot as plt
from argparse import Namespace
The standard fit in particle physics is the fit of a peak over some smooth background. We generate a Gaussian peak over exponential background, using scipy.
[2]:
np.random.seed(1) # fix seed
# true parameters for signal and background
truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
n_bkg = truth.n_sig * truth.f_bkg
# make a data set
x = np.empty(truth.n_sig + n_bkg)
# fill m variables
x[: truth.n_sig] = norm(*truth.sig).rvs(truth.n_sig)
x[truth.n_sig :] = expon(*truth.bkg).rvs(n_bkg)
# cut a range in x
xrange = np.array((1.0, 9.0))
ma = (xrange[0] < x) & (x < xrange[1])
x = x[ma]
plt.hist(
(x[truth.n_sig :], x[: truth.n_sig]),
bins=50,
stacked=True,
label=("background", "signal"),
)
plt.xlabel("x")
plt.legend();

[3]:
# ideal starting values for iminuit
start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))
# iminuit instance factory, will be called a lot in the benchmarks blow
def m_init(fcn):
m = Minuit(fcn, start, name=("ns", "nb", "mu", "sigma", "lambd"))
m.limits = ((0, None), (0, None), None, (0, None), (0, None))
m.errordef = Minuit.LIKELIHOOD
return m
[4]:
# extended likelihood (https://doi.org/10.1016/0168-9002(90)91334-8)
# this version uses numpy and scipy and array arithmetic
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = s.cdf(xrange)
bn = b.cdf(xrange)
sn = sn[1] - sn[0]
bn = bn[1] - bn[0]
return (n_sig + n_bkg) - np.sum(
np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
)
nll(start)
[4]:
-103168.78482586428
[5]:
%%timeit -r 3 -n 1
m = m_init(nll) # setup time is negligible
m.migrad();
304 ms ± 1.96 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Let’s see whether we can beat that. The code above is already pretty fast, because numpy and scipy routines are fast, and we spend most of the time in those. But these implementations do not parallelize the execution and are not optimised for this particular CPU, unlike numba-jitted functions.
To use numba, in theory we just need to put the njit
decorator on top of the function, but often that doesn’t work out of the box. numba understands many numpy functions, but no scipy. We must evaluate the code that uses scipy in ‘object mode’, which is numba-speak for calling into the Python interpreter.
[6]:
# first attempt to use numba
@nb.njit(parallel=True)
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
with nb.objmode(spdf="float64[:]", bpdf="float64[:]", sn="float64", bn="float64"):
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
nll(start) # test and warm-up JIT
[6]:
-103168.78482586429
[7]:
%%timeit -r 3 -n 1 m = m_init(nll)
m.migrad()
347 ms ± 18.6 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
It is even a bit slower. :( Let’s break the original function down by parts to see why.
[8]:
# let's time the body of the function
n_sig, n_bkg, mu, sigma, lambd = start
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
# no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
%timeit -r 3 -n 100 norm(*start[2:4]).pdf(x)
%timeit -r 3 -n 500 expon(0, start[4]).pdf(x)
%timeit -r 3 -n 1000 np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
1.29 ms ± 66 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
1.14 ms ± 46.2 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
134 µs ± 2.94 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)
Most of the time is spend in norm
and expon
which numba could not accelerate and the total time is dominated by the slowest part.
This, unfortunately, means we have to do much more manual work to make the function faster, since we have to replace the scipy routines with Python code that numba can accelerate and run in parallel.
[9]:
kwd = {"parallel": True, "fastmath": True}
@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
return np.sum(np.log(fs * spdf + fb * bpdf))
@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
invs = 1.0 / sigma
z = (x - mu) * invs
invnorm = 1 / np.sqrt(2 * np.pi) * invs
return np.exp(-0.5 * z ** 2) * invnorm
@nb.njit(**kwd)
def nb_erf(x):
y = np.empty_like(x)
for i in nb.prange(len(x)):
y[i] = math.erf(x[i])
return y
@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
invs = 1.0 / (sigma * np.sqrt(2))
z = (x - mu) * invs
return 0.5 * (1 + nb_erf(z))
@nb.njit(**kwd)
def expon_pdf(x, lambd):
inv_lambd = 1.0 / lambd
return inv_lambd * np.exp(-inv_lambd * x)
@nb.njit(**kwd)
def expon_cdf(x, lambd):
inv_lambd = 1.0 / lambd
return 1.0 - np.exp(-inv_lambd * x)
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
# normalisation factors are needed for pdfs, since x range is restricted
sn = norm_cdf(xrange, mu, sigma)
bn = expon_cdf(xrange, lambd)
sn = sn[1] - sn[0]
bn = bn[1] - bn[0]
spdf = norm_pdf(x, mu, sigma)
bpdf = expon_pdf(x, lambd)
no = n_sig + n_bkg
return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
nll(start) # test and warm-up JIT
[9]:
-103168.78482586428
Let’s see how well these versions do:
[10]:
%timeit -r 3 -n 100 norm_pdf(x, *start[2:4])
%timeit -r 3 -n 500 expon_pdf(x, start[4])
%timeit -r 3 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
209 µs ± 13.5 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
115 µs ± 2.57 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
115 µs ± 2.01 µs per loop (mean ± std. dev. of 3 runs, 1000 loops each)
Basically no improvement for sum_log
, but the pdf calculation was drastically accelerated. Since this was the bottleneck before, we expect also Migrad to finish faster now.
[11]:
%%timeit -r 3 -n 1
m = m_init(nll) # setup time is negligible
m.migrad();
96.1 ms ± 1.13 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Success! We managed to get roughly 3x speed improvement over the initial code. This is impressive, but it cost us a lot of developer time. This is not always a good trade-off, especially if you consider that library routines are heavily tested, while you always need to test your own code in addition to writing it.
By putting these faster functions into a library, however, we would only have to pay the developer cost once.
The final question is how much of the speed increase came from the parallelization and how much from the generally optimized code that numba
generated for our specific CPU. Let’s turn off parallelization and see fast the functions are then.
[12]:
kwd = {"parallel": False, "fastmath": True}
@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
return np.sum(np.log(fs * spdf + fb * bpdf))
@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
invs = 1.0 / sigma
z = (x - mu) * invs
invnorm = 1 / np.sqrt(2 * np.pi) * invs
return np.exp(-0.5 * z ** 2) * invnorm
@nb.njit(**kwd)
def nb_erf(x):
y = np.empty_like(x)
for i in nb.prange(len(x)):
y[i] = math.erf(x[i])
return y
@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
invs = 1.0 / (sigma * np.sqrt(2))
z = (x - mu) * invs
return 0.5 * (1 + nb_erf(z))
@nb.njit(**kwd)
def expon_pdf(x, lambd):
inv_lambd = 1.0 / lambd
return inv_lambd * np.exp(-inv_lambd * x)
@nb.njit(**kwd)
def expon_cdf(x, lambd):
inv_lambd = 1.0 / lambd
return 1.0 - np.exp(-inv_lambd * x)
nll(start) # test and warm-up JIT
[12]:
-103168.78482586423
[13]:
%%timeit -r 3 -n 1
m = m_init(nll) # setup time is negligible
m.migrad();
35.2 ms ± 1.4 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Migrad runs about 3x faster now than the version with parallelization or 9x faster than original version! I hope you are surprised, this just shows how difficult it is reason about performance.
Why was parallelization bad for performance? The arrays in this example are too small to benefit from running parallel, the overhead of breaking the data into chunks that are processed and then merging them back together is too large. This should become better if we increase the sizes of the arrays.
So why is numba
so fast even without parallelization? We can look at the assembly code generated.
[14]:
for signature, code in norm_pdf.inspect_asm().items():
print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}\n{code}")
signature: (array(float64, 1d, C), float64, float64)
----------------------------------------------------
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 15
.section __TEXT,__literal8,8byte_literals
.p2align 3
LCPI0_0:
.quad 4607182418800017408
LCPI0_1:
.quad -4620693217682128896
LCPI0_2:
.quad 4600858325139338833
.section __TEXT,__text,regular,pure_instructions
.globl __ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
subq $264, %rsp
.cfi_def_cfa_offset 320
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
vmovsd %xmm1, 16(%rsp)
vmovapd %xmm0, 64(%rsp)
movq %rdx, %r12
movq %rsi, %r14
movq %rdi, %rbx
movabsq $_NRT_incref, %rax
movq %rdx, %rdi
callq *%rax
vmovsd 16(%rsp), %xmm1
vxorpd %xmm0, %xmm0, %xmm0
vucomisd %xmm0, %xmm1
je LBB0_59
movq 328(%rsp), %rbp
imulq $8, %rbp, %rdi
jo LBB0_58
movq %rbx, 256(%rsp)
movabsq $LCPI0_0, %rax
vmovsd (%rax), %xmm0
vdivsd %xmm1, %xmm0, %xmm0
vmovapd %xmm0, 16(%rsp)
movabsq $_NRT_MemInfo_alloc_safe_aligned, %r13
movl $32, %esi
callq *%r13
vmovapd 16(%rsp), %xmm6
movq %rax, %rbx
movq 24(%rax), %r15
testq %rbp, %rbp
vmovapd 64(%rsp), %xmm7
jle LBB0_26
movq 320(%rsp), %rax
cmpq $1, %rbp
jne LBB0_6
leaq -1(%rbp), %rdx
movl %ebp, %ecx
andl $7, %ecx
cmpq $7, %rdx
jae LBB0_8
xorl %edx, %edx
jmp LBB0_10
LBB0_6:
cmpq $16, %rbp
jb LBB0_7
leaq (%rax,%rbp,8), %rcx
cmpq %rcx, %r15
jae LBB0_16
leaq (%r15,%rbp,8), %rcx
cmpq %rax, %rcx
jbe LBB0_16
LBB0_7:
xorl %ecx, %ecx
LBB0_22:
movq %rcx, %rdx
notq %rdx
addq %rbp, %rdx
movq %rbp, %rsi
andq $7, %rsi
je LBB0_24
.p2align 4, 0x90
LBB0_23:
vmovsd (%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, (%r15,%rcx,8)
incq %rcx
decq %rsi
jne LBB0_23
LBB0_24:
cmpq $7, %rdx
jb LBB0_26
.p2align 4, 0x90
LBB0_25:
vmovsd (%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, (%r15,%rcx,8)
vmovsd 8(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 8(%r15,%rcx,8)
vmovsd 16(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 16(%r15,%rcx,8)
vmovsd 24(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 24(%r15,%rcx,8)
vmovsd 32(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 32(%r15,%rcx,8)
vmovsd 40(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 40(%r15,%rcx,8)
vmovsd 48(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 48(%r15,%rcx,8)
vmovsd 56(%rax,%rcx,8), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 56(%r15,%rcx,8)
addq $8, %rcx
cmpq %rcx, %rbp
jne LBB0_25
jmp LBB0_26
LBB0_8:
movq %rbp, %rsi
subq %rcx, %rsi
xorl %edx, %edx
.p2align 4, 0x90
LBB0_9:
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, (%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 8(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 16(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 24(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 32(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 40(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 48(%r15,%rdx,8)
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, 56(%r15,%rdx,8)
addq $8, %rdx
cmpq %rdx, %rsi
jne LBB0_9
LBB0_10:
testq %rcx, %rcx
je LBB0_26
leaq (%r15,%rdx,8), %rdx
xorl %esi, %esi
.p2align 4, 0x90
LBB0_12:
vmovsd (%rax), %xmm0
vsubsd %xmm7, %xmm0, %xmm0
vmulsd %xmm6, %xmm0, %xmm0
vmovsd %xmm0, (%rdx,%rsi,8)
incq %rsi
cmpq %rsi, %rcx
jne LBB0_12
LBB0_26:
movabsq $_NRT_decref, %rax
movq %r12, %rdi
vzeroupper
callq *%rax
imulq $8, %rbp, %rdi
jo LBB0_58
movq %rbx, 248(%rsp)
movl $32, %esi
callq *%r13
movq %rax, 240(%rsp)
movq 24(%rax), %r12
testq %rbp, %rbp
vmovapd 16(%rsp), %xmm0
jle LBB0_55
movq 328(%rsp), %rbp
cmpq $1, %rbp
jne LBB0_31
leaq -1(%rbp), %rax
movl %ebp, %r14d
andl $3, %r14d
cmpq $3, %rax
jae LBB0_33
xorl %ebx, %ebx
jmp LBB0_35
LBB0_31:
cmpq $4, %rbp
jb LBB0_32
leaq (%r15,%rbp,8), %rax
cmpq %rax, %r12
jae LBB0_41
leaq (%r12,%rbp,8), %rax
cmpq %r15, %rax
jbe LBB0_41
LBB0_32:
xorl %ebx, %ebx
LBB0_49:
movq %rbx, %r14
notq %r14
movq 328(%rsp), %rbp
addq %rbp, %r14
andq $3, %rbp
je LBB0_52
movabsq $LCPI0_1, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 64(%rsp)
movabsq $_exp, %r13
movabsq $LCPI0_2, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 32(%rsp)
.p2align 4, 0x90
LBB0_51:
vmovsd (%r15,%rbx,8), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
vzeroupper
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, (%r12,%rbx,8)
incq %rbx
decq %rbp
jne LBB0_51
LBB0_52:
cmpq $3, %r14
jb LBB0_55
movabsq $LCPI0_1, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 64(%rsp)
movabsq $_exp, %r14
movabsq $LCPI0_2, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 32(%rsp)
.p2align 4, 0x90
LBB0_54:
vmovsd (%r15,%rbx,8), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
vzeroupper
callq *%r14
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, (%r12,%rbx,8)
vmovsd 8(%r15,%rbx,8), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r14
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 8(%r12,%rbx,8)
vmovsd 16(%r15,%rbx,8), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r14
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 16(%r12,%rbx,8)
vmovsd 24(%r15,%rbx,8), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r14
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 24(%r12,%rbx,8)
addq $4, %rbx
cmpq %rbx, 328(%rsp)
jne LBB0_54
jmp LBB0_55
LBB0_33:
subq %r14, %rbp
xorl %ebx, %ebx
movabsq $LCPI0_1, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 64(%rsp)
movabsq $_exp, %r13
movabsq $LCPI0_2, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 32(%rsp)
.p2align 4, 0x90
LBB0_34:
vmovsd (%r15), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, (%r12,%rbx,8)
vmovsd (%r15), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 8(%r12,%rbx,8)
vmovsd (%r15), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 16(%r12,%rbx,8)
vmovsd (%r15), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 24(%r12,%rbx,8)
addq $4, %rbx
cmpq %rbx, %rbp
jne LBB0_34
LBB0_35:
testq %r14, %r14
je LBB0_55
leaq (%r12,%rbx,8), %rbx
xorl %ebp, %ebp
movabsq $LCPI0_1, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 64(%rsp)
movabsq $_exp, %r13
movabsq $LCPI0_2, %rax
vmovsd (%rax), %xmm0
vmovsd %xmm0, 32(%rsp)
.p2align 4, 0x90
LBB0_37:
vmovsd (%r15), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd 64(%rsp), %xmm0, %xmm0
callq *%r13
vmulsd 16(%rsp), %xmm0, %xmm0
vmulsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, (%rbx,%rbp,8)
incq %rbp
cmpq %rbp, %r14
jne LBB0_37
LBB0_55:
movq 248(%rsp), %rdi
movabsq $_NRT_decref, %rax
vzeroupper
callq *%rax
movq 256(%rsp), %rax
movq 240(%rsp), %rcx
movq %rcx, (%rax)
movq $0, 8(%rax)
movq 328(%rsp), %rcx
movq %rcx, 16(%rax)
movq $8, 24(%rax)
movq %r12, 32(%rax)
movq %rcx, 40(%rax)
movq $8, 48(%rax)
xorl %eax, %eax
LBB0_56:
addq $264, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
LBB0_16:
movq %rbp, %rcx
andq $-16, %rcx
vbroadcastsd %xmm7, %ymm1
vbroadcastsd %xmm6, %ymm0
leaq -16(%rcx), %rdx
movq %rdx, %rdi
shrq $4, %rdi
incq %rdi
movl %edi, %esi
andl $1, %esi
testq %rdx, %rdx
je LBB0_57
subq %rsi, %rdi
xorl %edx, %edx
.p2align 4, 0x90
LBB0_18:
vmovupd (%rax,%rdx,8), %ymm2
vmovupd 32(%rax,%rdx,8), %ymm3
vmovupd 64(%rax,%rdx,8), %ymm4
vmovupd 96(%rax,%rdx,8), %ymm5
vsubpd %ymm1, %ymm2, %ymm2
vsubpd %ymm1, %ymm3, %ymm3
vsubpd %ymm1, %ymm4, %ymm4
vsubpd %ymm1, %ymm5, %ymm5
vmulpd %ymm0, %ymm2, %ymm2
vmulpd %ymm0, %ymm3, %ymm3
vmulpd %ymm0, %ymm4, %ymm4
vmulpd %ymm0, %ymm5, %ymm5
vmovupd %ymm2, (%r15,%rdx,8)
vmovupd %ymm3, 32(%r15,%rdx,8)
vmovupd %ymm4, 64(%r15,%rdx,8)
vmovupd %ymm5, 96(%r15,%rdx,8)
vmovupd 128(%rax,%rdx,8), %ymm2
vmovupd 160(%rax,%rdx,8), %ymm3
vmovupd 192(%rax,%rdx,8), %ymm4
vmovupd 224(%rax,%rdx,8), %ymm5
vsubpd %ymm1, %ymm2, %ymm2
vsubpd %ymm1, %ymm3, %ymm3
vsubpd %ymm1, %ymm4, %ymm4
vsubpd %ymm1, %ymm5, %ymm5
vmulpd %ymm0, %ymm2, %ymm2
vmulpd %ymm0, %ymm3, %ymm3
vmulpd %ymm0, %ymm4, %ymm4
vmulpd %ymm0, %ymm5, %ymm5
vmovupd %ymm2, 128(%r15,%rdx,8)
vmovupd %ymm3, 160(%r15,%rdx,8)
vmovupd %ymm4, 192(%r15,%rdx,8)
vmovupd %ymm5, 224(%r15,%rdx,8)
addq $32, %rdx
addq $-2, %rdi
jne LBB0_18
testq %rsi, %rsi
je LBB0_21
LBB0_20:
vmovupd (%rax,%rdx,8), %ymm2
vmovupd 32(%rax,%rdx,8), %ymm3
vmovupd 64(%rax,%rdx,8), %ymm4
vmovupd 96(%rax,%rdx,8), %ymm5
vsubpd %ymm1, %ymm2, %ymm2
vsubpd %ymm1, %ymm3, %ymm3
vsubpd %ymm1, %ymm4, %ymm4
vsubpd %ymm1, %ymm5, %ymm1
vmulpd %ymm0, %ymm2, %ymm2
vmulpd %ymm0, %ymm3, %ymm3
vmulpd %ymm0, %ymm4, %ymm4
vmulpd %ymm0, %ymm1, %ymm0
vmovupd %ymm2, (%r15,%rdx,8)
vmovupd %ymm3, 32(%r15,%rdx,8)
vmovupd %ymm4, 64(%r15,%rdx,8)
vmovupd %ymm0, 96(%r15,%rdx,8)
LBB0_21:
cmpq %rbp, %rcx
je LBB0_26
jmp LBB0_22
LBB0_41:
movq %rbp, %rbx
andq $-4, %rbx
vbroadcastsd %xmm0, %ymm0
leaq -4(%rbx), %rax
movq %rax, %rbp
shrq $2, %rbp
incq %rbp
movl %ebp, %ecx
andl $3, %ecx
movabsq $LCPI0_1, %rdx
movabsq $_exp, %r13
movabsq $LCPI0_2, %rsi
cmpq $12, %rax
movq %rcx, 232(%rsp)
vmovupd %ymm0, 64(%rsp)
jae LBB0_43
xorl %r14d, %r14d
jmp LBB0_45
LBB0_43:
subq %rcx, %rbp
xorl %r14d, %r14d
vbroadcastsd (%rdx), %ymm1
vmovups %ymm1, 32(%rsp)
vbroadcastsd (%rsi), %ymm1
vmovupd %ymm1, 192(%rsp)
.p2align 4, 0x90
LBB0_44:
vmovupd (%r15,%r14,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 32(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 96(%rsp)
vzeroupper
callq *%r13
vmovapd %xmm0, 128(%rsp)
vpermilpd $1, 96(%rsp), %xmm0
callq *%r13
vmovapd 128(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 128(%rsp)
vmovups 160(%rsp), %ymm0
vzeroupper
callq *%r13
vmovaps %xmm0, 96(%rsp)
vpermilpd $1, 160(%rsp), %xmm0
callq *%r13
vmovapd 96(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, 128(%rsp), %ymm0, %ymm0
vmulpd 64(%rsp), %ymm0, %ymm0
vmulpd 192(%rsp), %ymm0, %ymm0
vmovupd %ymm0, (%r12,%r14,8)
vmovupd 32(%r15,%r14,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 32(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 96(%rsp)
vzeroupper
callq *%r13
vmovapd %xmm0, 128(%rsp)
vpermilpd $1, 96(%rsp), %xmm0
callq *%r13
vmovapd 128(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 128(%rsp)
vmovups 160(%rsp), %ymm0
vzeroupper
callq *%r13
vmovaps %xmm0, 96(%rsp)
vpermilpd $1, 160(%rsp), %xmm0
callq *%r13
vmovapd 96(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, 128(%rsp), %ymm0, %ymm0
vmulpd 64(%rsp), %ymm0, %ymm0
vmulpd 192(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 32(%r12,%r14,8)
vmovupd 64(%r15,%r14,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 32(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 96(%rsp)
vzeroupper
callq *%r13
vmovapd %xmm0, 128(%rsp)
vpermilpd $1, 96(%rsp), %xmm0
callq *%r13
vmovapd 128(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 128(%rsp)
vmovups 160(%rsp), %ymm0
vzeroupper
callq *%r13
vmovaps %xmm0, 96(%rsp)
vpermilpd $1, 160(%rsp), %xmm0
callq *%r13
vmovapd 96(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, 128(%rsp), %ymm0, %ymm0
vmulpd 64(%rsp), %ymm0, %ymm0
vmulpd 192(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 64(%r12,%r14,8)
vmovupd 96(%r15,%r14,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 32(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 160(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 96(%rsp)
vzeroupper
callq *%r13
vmovapd %xmm0, 128(%rsp)
vpermilpd $1, 96(%rsp), %xmm0
callq *%r13
vmovapd 128(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 128(%rsp)
vmovups 160(%rsp), %ymm0
vzeroupper
callq *%r13
vmovaps %xmm0, 96(%rsp)
vpermilpd $1, 160(%rsp), %xmm0
callq *%r13
vmovupd 64(%rsp), %ymm1
vmovapd 96(%rsp), %xmm2
vunpcklpd %xmm0, %xmm2, %xmm0
vinsertf128 $1, 128(%rsp), %ymm0, %ymm0
vmulpd %ymm1, %ymm0, %ymm0
vmulpd 192(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 96(%r12,%r14,8)
addq $16, %r14
addq $-4, %rbp
jne LBB0_44
LBB0_45:
movq 232(%rsp), %rbp
testq %rbp, %rbp
je LBB0_48
shlq $3, %r14
negq %rbp
movabsq $LCPI0_1, %rax
vbroadcastsd (%rax), %ymm0
vmovups %ymm0, 128(%rsp)
movabsq $LCPI0_2, %rax
vbroadcastsd (%rax), %ymm0
vmovupd %ymm0, 96(%rsp)
.p2align 4, 0x90
LBB0_47:
vmovupd (%r15,%r14), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 128(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 160(%rsp)
vzeroupper
callq *%r13
vmovapd %xmm0, 192(%rsp)
vpermilpd $1, 160(%rsp), %xmm0
callq *%r13
vmovapd 192(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 192(%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r13
vmovaps %xmm0, 160(%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r13
vmovupd 64(%rsp), %ymm1
vmovapd 160(%rsp), %xmm2
vunpcklpd %xmm0, %xmm2, %xmm0
vinsertf128 $1, 192(%rsp), %ymm0, %ymm0
vmulpd %ymm1, %ymm0, %ymm0
vmulpd 96(%rsp), %ymm0, %ymm0
vmovupd %ymm0, (%r12,%r14)
addq $32, %r14
incq %rbp
jne LBB0_47
LBB0_48:
cmpq 328(%rsp), %rbx
je LBB0_55
jmp LBB0_49
LBB0_57:
xorl %edx, %edx
testq %rsi, %rsi
jne LBB0_20
jmp LBB0_21
LBB0_58:
movabsq $_.const.picklebuf.4970103488, %rax
jmp LBB0_60
LBB0_59:
movabsq $_.const.picklebuf.4979668416, %rax
LBB0_60:
movq %rax, (%r14)
movl $1, %eax
jmp LBB0_56
.cfi_endproc
.globl __ZN7cpython8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
__ZN7cpython8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-32, %rsp
subq $352, %rsp
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
movq %rsi, %rdi
subq $8, %rsp
leaq 112(%rsp), %r10
movabsq $_.const.norm_pdf, %rsi
movabsq $_PyArg_UnpackTuple, %rbx
leaq 128(%rsp), %r8
leaq 120(%rsp), %r9
movl $3, %edx
movl $3, %ecx
xorl %eax, %eax
pushq %r10
callq *%rbx
addq $16, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 192(%rsp)
vmovups %ymm0, 216(%rsp)
vmovaps %ymm0, 128(%rsp)
vmovups %ymm0, 152(%rsp)
movq $0, 72(%rsp)
vmovaps %ymm0, 256(%rsp)
vmovups %ymm0, 280(%rsp)
testl %eax, %eax
je LBB1_1
movabsq $__ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rax
movq (%rax), %rbx
testq %rbx, %rbx
je LBB1_4
movq 120(%rsp), %rdi
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 192(%rsp)
vmovups %ymm0, 216(%rsp)
movabsq $_NRT_adapt_ndarray_from_python, %rax
leaq 192(%rsp), %rsi
vzeroupper
callq *%rax
testl %eax, %eax
jne LBB1_8
cmpq $8, 216(%rsp)
jne LBB1_8
movq %rbx, 80(%rsp)
movq 192(%rsp), %rax
movq %rax, 64(%rsp)
movq 200(%rsp), %rax
movq %rax, 48(%rsp)
movq 208(%rsp), %rax
movq %rax, 32(%rsp)
movq 224(%rsp), %rax
movq %rax, 56(%rsp)
movq 232(%rsp), %rax
movq %rax, 40(%rsp)
movq 240(%rsp), %rax
movq %rax, 24(%rsp)
movq 112(%rsp), %rdi
movabsq $_PyNumber_Float, %r13
callq *%r13
movq %rax, %rbx
movabsq $_PyFloat_AsDouble, %r14
movq %rax, %rdi
callq *%r14
vmovsd %xmm0, 96(%rsp)
movabsq $_Py_DecRef, %r15
movq %rbx, %rdi
callq *%r15
movabsq $_PyErr_Occurred, %r12
callq *%r12
testq %rax, %rax
jne LBB1_10
movq 104(%rsp), %rdi
callq *%r13
movq %rax, %rbx
movq %rax, %rdi
callq *%r14
vmovsd %xmm0, 88(%rsp)
movq %rbx, %rdi
callq *%r15
callq *%r12
testq %rax, %rax
jne LBB1_10
vxorps %xmm0, %xmm0, %xmm0
vmovups %ymm0, 152(%rsp)
vmovaps %ymm0, 128(%rsp)
subq $8, %rsp
movabsq $__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rax
leaq 136(%rsp), %rdi
leaq 80(%rsp), %rsi
movl $8, %r9d
movq 40(%rsp), %r8
movq 72(%rsp), %rbx
movq %rbx, %rdx
movq 56(%rsp), %rcx
vmovsd 104(%rsp), %xmm0
vmovsd 96(%rsp), %xmm1
pushq 32(%rsp)
pushq 56(%rsp)
pushq 80(%rsp)
vzeroupper
callq *%rax
addq $32, %rsp
movl %eax, %r12d
movq 72(%rsp), %r14
movq 128(%rsp), %rax
movq %rax, 56(%rsp)
movq 136(%rsp), %r15
movq 144(%rsp), %r13
movq 152(%rsp), %rax
movq %rax, 48(%rsp)
movq 160(%rsp), %rax
movq %rax, 40(%rsp)
movq 168(%rsp), %rax
movq %rax, 32(%rsp)
movq 176(%rsp), %rax
movq %rax, 24(%rsp)
movabsq $_NRT_decref, %rax
movq %rbx, %rdi
callq *%rax
cmpl $-2, %r12d
je LBB1_17
testl %r12d, %r12d
jne LBB1_14
movq 80(%rsp), %rax
movq 24(%rax), %rdi
testq %rdi, %rdi
je LBB1_20
movabsq $_PyList_GetItem, %rax
xorl %esi, %esi
callq *%rax
movq %rax, %rcx
jmp LBB1_21
LBB1_17:
movabsq $__Py_NoneStruct, %rbx
movabsq $_Py_IncRef, %rax
movq %rbx, %rdi
callq *%rax
movq %rbx, %rax
jmp LBB1_2
LBB1_14:
jle LBB1_22
movabsq $_PyErr_Clear, %rax
callq *%rax
movq 16(%r14), %rdx
movl 8(%r14), %esi
movq (%r14), %rdi
movabsq $_numba_unpickle, %rax
callq *%rax
testq %rax, %rax
je LBB1_1
movabsq $_numba_do_raise, %rcx
movq %rax, %rdi
callq *%rcx
jmp LBB1_1
LBB1_20:
movabsq $_PyExc_RuntimeError, %rdi
movabsq $"_.const.`env.consts` is NULL in `read_const`", %rsi
movabsq $_PyErr_SetString, %rax
callq *%rax
xorl %ecx, %ecx
LBB1_21:
movq 56(%rsp), %rax
movq %rax, 256(%rsp)
movq %r15, 264(%rsp)
movq %r13, 272(%rsp)
movq 48(%rsp), %rax
movq %rax, 280(%rsp)
movq 40(%rsp), %rax
movq %rax, 288(%rsp)
movq 32(%rsp), %rax
movq %rax, 296(%rsp)
movq 24(%rsp), %rax
movq %rax, 304(%rsp)
movabsq $_NRT_adapt_ndarray_to_python, %rax
leaq 256(%rsp), %rdi
movl $1, %esi
movl $1, %edx
callq *%rax
jmp LBB1_2
LBB1_22:
cmpl $-3, %r12d
je LBB1_25
cmpl $-1, %r12d
je LBB1_1
movabsq $_PyExc_SystemError, %rdi
movabsq $"_.const.unknown error when calling native function", %rsi
jmp LBB1_5
LBB1_25:
movabsq $_PyExc_StopIteration, %rdi
movabsq $_PyErr_SetNone, %rax
callq *%rax
jmp LBB1_1
LBB1_10:
movabsq $_NRT_decref, %rax
movq 64(%rsp), %rdi
callq *%rax
jmp LBB1_1
LBB1_4:
movabsq $_PyExc_RuntimeError, %rdi
movabsq $"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd", %rsi
jmp LBB1_5
LBB1_8:
movabsq $_PyExc_TypeError, %rdi
movabsq $"_.const.can't unbox array from PyObject into native value. The object maybe of a different type", %rsi
LBB1_5:
movabsq $_PyErr_SetString, %rax
vzeroupper
callq *%rax
LBB1_1:
xorl %eax, %eax
LBB1_2:
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
.cfi_endproc
.globl _cfunc._ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
_cfunc._ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-32, %rsp
subq $192, %rsp
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
movq %r8, %rax
movq %rcx, %r8
movq %rdx, %rcx
movq %rsi, %rdx
movq %rdi, %r14
vmovaps 16(%rbp), %xmm2
vxorps %xmm3, %xmm3, %xmm3
vmovups %ymm3, 120(%rsp)
vmovaps %ymm3, 96(%rsp)
movq $0, 48(%rsp)
vmovups %xmm2, 8(%rsp)
movq %r9, (%rsp)
movabsq $__ZN8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd, %rbx
leaq 96(%rsp), %rdi
leaq 48(%rsp), %rsi
movq %rax, %r9
vzeroupper
callq *%rbx
movl %eax, %ebx
movq 48(%rsp), %r15
movq 96(%rsp), %rax
movq 104(%rsp), %r12
movq 112(%rsp), %r13
movq 120(%rsp), %rcx
movq 128(%rsp), %rdx
movq 136(%rsp), %rsi
movq 144(%rsp), %rdi
movl $0, 44(%rsp)
cmpl $-2, %ebx
je LBB2_5
testl %ebx, %ebx
je LBB2_5
movq %rdi, 56(%rsp)
movq %rsi, 64(%rsp)
movq %rdx, 72(%rsp)
movq %rcx, 80(%rsp)
movq %rax, 88(%rsp)
movabsq $_numba_gil_ensure, %rax
leaq 44(%rsp), %rdi
callq *%rax
testl %ebx, %ebx
jle LBB2_6
movabsq $_PyErr_Clear, %rax
callq *%rax
movq 16(%r15), %rdx
movl 8(%r15), %esi
movq (%r15), %rdi
movabsq $_numba_unpickle, %rax
callq *%rax
testq %rax, %rax
je LBB2_4
movabsq $_numba_do_raise, %rcx
movq %rax, %rdi
callq *%rcx
LBB2_4:
movabsq $"_.const.<numba.core.cpu.CPUContext object at 0x12899e6d0>", %rdi
movabsq $_PyUnicode_FromString, %rax
callq *%rax
movq %rax, %rbx
movabsq $_PyErr_WriteUnraisable, %rax
movq %rbx, %rdi
callq *%rax
movabsq $_Py_DecRef, %rax
movq %rbx, %rdi
callq *%rax
movabsq $_numba_gil_release, %rax
leaq 44(%rsp), %rdi
callq *%rax
movq 88(%rsp), %rax
movq 80(%rsp), %rcx
movq 72(%rsp), %rdx
movq 64(%rsp), %rsi
movq 56(%rsp), %rdi
LBB2_5:
movq %rax, (%r14)
movq %r12, 8(%r14)
movq %r13, 16(%r14)
movq %rcx, 24(%r14)
movq %rdx, 32(%r14)
movq %rsi, 40(%r14)
movq %rdi, 48(%r14)
movq %r14, %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
LBB2_6:
cmpl $-3, %ebx
je LBB2_10
cmpl $-1, %ebx
je LBB2_4
movabsq $_PyExc_SystemError, %rdi
movabsq $"_.const.unknown error when calling native function.1", %rsi
movabsq $_PyErr_SetString, %rax
callq *%rax
jmp LBB2_4
LBB2_10:
movabsq $_PyExc_StopIteration, %rdi
movabsq $_PyErr_SetNone, %rax
callq *%rax
jmp LBB2_4
.cfi_endproc
.globl _NRT_incref
.weak_def_can_be_hidden _NRT_incref
.p2align 4, 0x90
_NRT_incref:
testq %rdi, %rdi
je LBB3_1
lock incq (%rdi)
retq
LBB3_1:
retq
.globl _NRT_decref
.weak_def_can_be_hidden _NRT_decref
.p2align 4, 0x90
_NRT_decref:
.cfi_startproc
testq %rdi, %rdi
je LBB4_2
##MEMBARRIER
lock decq (%rdi)
je LBB4_3
LBB4_2:
retq
LBB4_3:
##MEMBARRIER
movabsq $_NRT_MemInfo_call_dtor, %rax
jmpq *%rax
.cfi_endproc
.comm __ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd,8,3
.section __DATA,__const
.p2align 4
_.const.picklebuf.4979668416:
.quad _.const.pickledata.4979668416
.long 69
.space 4
.quad _.const.pickledata.4979668416.sha1
.p2align 4
_.const.picklebuf.4970103488:
.quad _.const.pickledata.4970103488
.long 137
.space 4
.quad _.const.pickledata.4970103488.sha1
.section __TEXT,__const
.p2align 4
_.const.pickledata.4970103488:
.ascii "\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."
.p2align 4
_.const.pickledata.4970103488.sha1:
.ascii "X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"
.p2align 4
_.const.pickledata.4979668416:
.ascii "\200\004\225:\000\000\000\000\000\000\000\214\bbuiltins\224\214\021ZeroDivisionError\224\223\224\214\020division by zero\224\205\224N\207\224."
.p2align 4
_.const.pickledata.4979668416.sha1:
.ascii "\262\200\b\240\370\213\255_\360\360$>\204\332\271\f\253\031\263f"
_.const.norm_pdf:
.asciz "norm_pdf"
.p2align 4
"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd":
.asciz "missing Environment: _ZN08NumbaEnv8__main__13norm_pdf$2434E5ArrayIdLi1E1C7mutable7alignedEdd"
.p2align 4
"_.const.can't unbox array from PyObject into native value. The object maybe of a different type":
.asciz "can't unbox array from PyObject into native value. The object maybe of a different type"
.p2align 4
"_.const.`env.consts` is NULL in `read_const`":
.asciz "`env.consts` is NULL in `read_const`"
.p2align 4
"_.const.unknown error when calling native function":
.asciz "unknown error when calling native function"
.p2align 4
"_.const.<numba.core.cpu.CPUContext object at 0x12899e6d0>":
.asciz "<numba.core.cpu.CPUContext object at 0x12899e6d0>"
.p2align 4
"_.const.unknown error when calling native function.1":
.asciz "unknown error when calling native function"
.comm __ZN08NumbaEnv13$3cdynamic$3e35__numba_array_expr_0x12834bb20$2435Eddd,8,3
.comm __ZN08NumbaEnv5numba2np7npyimpl20_broadcast_onto$2430Ex8int64$2ax8int64$2a,8,3
.comm __ZN08NumbaEnv13$3cdynamic$3e35__numba_array_expr_0x128dc18b0$2436Edd,8,3
.comm __ZN08NumbaEnv5numba7cpython7numbers14int_power_impl12$3clocals$3e14int_power$2421Edx,8,3
.subsections_via_symbols
This code section is very long, but the assembly grammar is very simple. Constants starts with .
and SOMETHING:
is a jump label for the assembly equivalent of goto
. Everything else is an instruction with its name on the left and the arguments are on the right.
You can google all the commands, the interesting ones are those that end with pd
, those are SIMD instructions that operate on up to eight doubles at once. This is where the speed comes from. There is a lot of repetition, because the optimizer partially unrolled some loops to make them faster. Using unrolled loops only works if the remaining chunk of data is large enough. Since the compiler does not know the length of the incoming array, it also generates sections which handle shorter chunks
and all the code to select which section to use. Finally, there is some code which does the translation from and to Python objects with corresponding error handling.
We don’t need to write SIMD instructions by hand, the optimizer does it for us and in a very sophisticated way.