Acceleration with Numba¶
We explore how the computation of cost functions can be dramatically accelerated with numba’s JIT compiler.
The run-time of iminuit is usually dominated by the execution time of the cost function. To get good performance, it recommended to use array arthimetic and scipy and numpy functions in the body of the cost function. Python loops should be avoided, but if they are unavoidable, numba can help. Numba can also parallelize numerical calculations to make full use of multi-core CPUs and even do computations on the GPU.
Note: This tutorial shows how one can generate faster pdfs with Numba. Before you start to write your own pdf, please check whether one is already implemented in the numba_stats library. If you have a pdf that is not included there, please consider contributing it to numba_stats.
[1]:
# !pip install matplotlib numpy numba scipy iminuit
from iminuit import Minuit
import numpy as np
import numba as nb
import math
from scipy.stats import expon, norm
from matplotlib import pyplot as plt
from argparse import Namespace
The standard fit in particle physics is the fit of a peak over some smooth background. We generate a Gaussian peak over exponential background, using scipy.
[2]:
np.random.seed(1) # fix seed
# true parameters for signal and background
truth = Namespace(n_sig=2000, f_bkg=10, sig=(5.0, 0.5), bkg=(0.0, 4.0))
n_bkg = truth.n_sig * truth.f_bkg
# make a data set
x = np.empty(truth.n_sig + n_bkg)
# fill m variables
x[: truth.n_sig] = norm(*truth.sig).rvs(truth.n_sig)
x[truth.n_sig :] = expon(*truth.bkg).rvs(n_bkg)
# cut a range in x
xrange = np.array((1.0, 9.0))
ma = (xrange[0] < x) & (x < xrange[1])
x = x[ma]
plt.hist(
(x[truth.n_sig :], x[: truth.n_sig]),
bins=50,
stacked=True,
label=("background", "signal"),
)
plt.xlabel("x")
plt.legend();

[3]:
# ideal starting values for iminuit
start = np.array((truth.n_sig, n_bkg, truth.sig[0], truth.sig[1], truth.bkg[1]))
# iminuit instance factory, will be called a lot in the benchmarks blow
def m_init(fcn):
m = Minuit(fcn, start, name=("ns", "nb", "mu", "sigma", "lambd"))
m.limits = ((0, None), (0, None), None, (0, None), (0, None))
m.errordef = Minuit.LIKELIHOOD
return m
[4]:
# extended likelihood (https://doi.org/10.1016/0168-9002(90)91334-8)
# this version uses numpy and scipy and array arithmetic
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = s.cdf(xrange)
bn = b.cdf(xrange)
sn = sn[1] - sn[0]
bn = bn[1] - bn[0]
return (n_sig + n_bkg) - np.sum(
np.log(s.pdf(x) / sn * n_sig + b.pdf(x) / bn * n_bkg)
)
nll(start)
[4]:
-103168.78482586428
[5]:
%%timeit -r 3 -n 1
m = m_init(nll) # setup time is negligible
m.migrad();
916 ms ± 231 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
Let’s see whether we can beat that. The code above is already pretty fast, because numpy and scipy routines are fast, and we spend most of the time in those. But these implementations do not parallelize the execution and are not optimised for this particular CPU, unlike numba-jitted functions.
To use numba, in theory we just need to put the njit
decorator on top of the function, but often that doesn’t work out of the box. numba understands many numpy functions, but no scipy. We must evaluate the code that uses scipy in ‘object mode’, which is numba-speak for calling into the Python interpreter.
[6]:
# first attempt to use numba
@nb.njit(parallel=True)
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
with nb.objmode(spdf="float64[:]", bpdf="float64[:]", sn="float64", bn="float64"):
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
return no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
nll(start) # test and warm-up JIT
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
[6]:
-103168.78482586429
[7]:
%%timeit -r 3 -n 1 m = m_init(nll)
m.migrad()
398 ms ± 23.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
It is even a bit slower. :( Let’s break the original function down by parts to see why.
[8]:
# let's time the body of the function
n_sig, n_bkg, mu, sigma, lambd = start
s = norm(mu, sigma)
b = expon(0, lambd)
# normalisation factors are needed for pdfs, since x range is restricted
sn = np.diff(s.cdf(xrange))[0]
bn = np.diff(b.cdf(xrange))[0]
spdf = s.pdf(x)
bpdf = b.pdf(x)
no = n_sig + n_bkg
# no - np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
%timeit -r 3 -n 100 norm(*start[2:4]).pdf(x)
%timeit -r 3 -n 500 expon(0, start[4]).pdf(x)
%timeit -r 3 -n 1000 np.sum(np.log(spdf / sn * n_sig + bpdf / bn * n_bkg))
1.37 ms ± 69.4 µs per loop (mean ± std. dev. of 3 runs, 100 loops each)
1.63 ms ± 114 µs per loop (mean ± std. dev. of 3 runs, 500 loops each)
224 µs ± 25 µs per loop (mean ± std. dev. of 3 runs, 1,000 loops each)
Most of the time is spend in norm
and expon
which numba could not accelerate and the total time is dominated by the slowest part.
This, unfortunately, means we have to do much more manual work to make the function faster, since we have to replace the scipy routines with Python code that numba can accelerate and run in parallel.
[9]:
kwd = {"parallel": True, "fastmath": True}
@nb.njit(**kwd)
def sum_log(fs, spdf, fb, bpdf):
return np.sum(np.log(fs * spdf + fb * bpdf))
@nb.njit(**kwd)
def norm_pdf(x, mu, sigma):
invs = 1.0 / sigma
z = (x - mu) * invs
invnorm = 1 / np.sqrt(2 * np.pi) * invs
return np.exp(-0.5 * z ** 2) * invnorm
@nb.njit(**kwd)
def nb_erf(x):
y = np.empty_like(x)
for i in nb.prange(len(x)):
y[i] = math.erf(x[i])
return y
@nb.njit(**kwd)
def norm_cdf(x, mu, sigma):
invs = 1.0 / (sigma * np.sqrt(2))
z = (x - mu) * invs
return 0.5 * (1 + nb_erf(z))
@nb.njit(**kwd)
def expon_pdf(x, lambd):
inv_lambd = 1.0 / lambd
return inv_lambd * np.exp(-inv_lambd * x)
@nb.njit(**kwd)
def expon_cdf(x, lambd):
inv_lambd = 1.0 / lambd
return 1.0 - np.exp(-inv_lambd * x)
def nll(par):
n_sig, n_bkg, mu, sigma, lambd = par
# normalisation factors are needed for pdfs, since x range is restricted
sn = norm_cdf(xrange, mu, sigma)
bn = expon_cdf(xrange, lambd)
sn = sn[1] - sn[0]
bn = bn[1] - bn[0]
spdf = norm_pdf(x, mu, sigma)
bpdf = expon_pdf(x, lambd)
no = n_sig + n_bkg
return no - sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
nll(start) # test and warm-up JIT
[9]:
-103168.78482586428
Let’s see how well these versions do:
[10]:
%timeit -r 5 -n 100 norm_pdf(x, *start[2:4])
%timeit -r 5 -n 500 expon_pdf(x, start[4])
%timeit -r 5 -n 1000 sum_log(n_sig / sn, spdf, n_bkg / bn, bpdf)
126 µs ± 13.9 µs per loop (mean ± std. dev. of 5 runs, 100 loops each)
107 µs ± 2.08 µs per loop (mean ± std. dev. of 5 runs, 500 loops each)
94.2 µs ± 499 ns per loop (mean ± std. dev. of 5 runs, 1,000 loops each)
Only a minor improvement for sum_log
, but the pdf calculation was drastically accelerated. Since this was the bottleneck before, we expect also Migrad to finish faster now.
[11]:
%%timeit -r 3 -n 1
m = m_init(nll) # setup time is negligible
m.migrad();
40.7 ms ± 858 µs per loop (mean ± std. dev. of 3 runs, 1 loop each)
Success! We managed to get a big speed improvement over the initial code. This is impressive, but it cost us a lot of developer time. This is not always a good trade-off, especially if you consider that library routines are heavily tested, while you always need to test your own code in addition to writing it.
By putting these faster functions into a library, however, we would only have to pay the developer cost once. You can find those in the numba_stats library.
Try to compile the functions again with parallel=False
to see how much of the speed increase came from the parallelization and how much from the generally optimized code that numba
generated for our specific CPU. On my machine, the gain was entirely due to numba.
In general, it is good advice to not automatically add parallel=True
, because this comes with an overhead of breaking data into chunks, copy chunks to the individual CPUs and finally merging everything back together. For large arrays, this overhead is negligible, but for small arrays, it can be a net loss.
So why is numba
so fast even without parallelization? We can look at the assembly code generated.
[12]:
for signature, code in norm_pdf.inspect_asm().items():
print(f"signature: {signature}\n{'-'*(len(str(signature)) + 11)}\n{code}")
signature: (array(float64, 1d, C), float64, float64)
----------------------------------------------------
.section __TEXT,__text,regular,pure_instructions
.build_version macos, 12, 0
.section __TEXT,__literal8,8byte_literals
.p2align 3
LCPI0_0:
.quad 0x3ff0000000000000
LCPI0_1:
.quad 0x3fd9884533d43651
.section __TEXT,__literal16,16byte_literals
.p2align 4
LCPI0_2:
.quad 8
.quad 8
.section __TEXT,__text,regular,pure_instructions
.globl __ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
subq $632, %rsp
.cfi_def_cfa_offset 688
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
movq $0, 104(%rsp)
movq $0, 96(%rsp)
movq $0, 496(%rsp)
movq $0, 208(%rsp)
movq $0, 88(%rsp)
movq $0, 80(%rsp)
movq $0, 152(%rsp)
movq $0, 304(%rsp)
movq $0, 72(%rsp)
movq $0, 64(%rsp)
movq $0, 368(%rsp)
movq $0, 176(%rsp)
movq $0, 56(%rsp)
movq $0, 128(%rsp)
movq $0, 248(%rsp)
vxorpd %xmm2, %xmm2, %xmm2
vucomisd %xmm2, %xmm1
je LBB0_1
movq 696(%rsp), %r14
testq %r14, %r14
js LBB0_3
imulq $8, %r14, %r12
jo LBB0_5
movq %rdi, %r13
vmovsd %xmm1, 32(%rsp)
vmovsd %xmm0, 120(%rsp)
movq %rsi, 40(%rsp)
movabsq $_NRT_MemInfo_alloc_safe_aligned, %rax
movq %r12, %rdi
movl $32, %esi
callq *%rax
movq %rax, %rbp
movq 24(%rax), %rax
movq %rax, 48(%rsp)
leaq -1(%r14), %r15
movq $0, 104(%rsp)
movq %r15, 96(%rsp)
movabsq $_get_num_threads, %rax
callq *%rax
movq %rax, %rbx
testq %rax, %rax
jle LBB0_9
movq %rbp, 112(%rsp)
movabsq $LCPI0_0, %rax
vmovsd (%rax), %xmm0
vdivsd 32(%rsp), %xmm0, %xmm0
vmovsd %xmm0, 32(%rsp)
movabsq $_do_scheduling_unsigned, %rax
leaq 104(%rsp), %rsi
leaq 96(%rsp), %rdx
leaq 496(%rsp), %rbp
movl $1, %edi
movq %rbx, %rcx
movq %rbp, %r8
xorl %r9d, %r9d
callq *%rax
movq %rbp, 208(%rsp)
vmovsd 32(%rsp), %xmm0
vmovsd %xmm0, 88(%rsp)
leaq 88(%rsp), %rax
movq %rax, 216(%rsp)
vmovsd 120(%rsp), %xmm0
vmovsd %xmm0, 80(%rsp)
leaq 80(%rsp), %rax
movq %rax, 224(%rsp)
movq 688(%rsp), %rax
movq %rax, 232(%rsp)
movq 48(%rsp), %rax
movq %rax, 240(%rsp)
movq %rbx, 152(%rsp)
movq $2, 160(%rsp)
movq %r14, 168(%rsp)
movq $16, 304(%rsp)
vxorps %xmm0, %xmm0, %xmm0
vmovups %ymm0, 312(%rsp)
movq $8, 344(%rsp)
movq 704(%rsp), %rax
movq %rax, 352(%rsp)
movq $8, 360(%rsp)
movl $0, 28(%rsp)
movabsq $_numba_gil_ensure, %rax
leaq 28(%rsp), %r14
movq %r14, %rdi
vzeroupper
callq *%rax
movabsq $_PyEval_SaveThread, %rax
callq *%rax
movq %rax, %rbp
movabsq $_get_num_threads, %rbx
callq *%rbx
movq %rax, 8(%rsp)
movq $6, (%rsp)
movabsq $___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE, %rdi
movabsq $_numba_parallel_for, %rax
leaq 208(%rsp), %rsi
leaq 152(%rsp), %rdx
leaq 304(%rsp), %rcx
movl $2, %r9d
xorl %r8d, %r8d
callq *%rax
movabsq $_PyEval_RestoreThread, %rax
movq %rbp, %rdi
callq *%rax
movabsq $_numba_gil_release, %rax
movq %r14, %rdi
callq *%rax
movq %r12, %rdi
movl $32, %esi
movabsq $_NRT_MemInfo_alloc_safe_aligned, %rax
callq *%rax
movq %rax, %r12
movq 24(%rax), %r14
movq $0, 72(%rsp)
movq %r15, 64(%rsp)
callq *%rbx
movq %rax, %rbx
testq %rax, %rax
jle LBB0_13
movabsq $LCPI0_1, %rax
vmovsd 32(%rsp), %xmm0
vmulsd (%rax), %xmm0, %xmm0
vmovsd %xmm0, 32(%rsp)
leaq 72(%rsp), %rsi
leaq 64(%rsp), %rdx
leaq 368(%rsp), %rbp
movl $1, %edi
movq %rbx, %rcx
movq %rbp, %r8
xorl %r9d, %r9d
movabsq $_do_scheduling_unsigned, %rax
callq *%rax
movq %rbp, 176(%rsp)
vmovsd 32(%rsp), %xmm0
vmovsd %xmm0, 56(%rsp)
leaq 56(%rsp), %rax
movq %rax, 184(%rsp)
movq 48(%rsp), %rax
movq %rax, 192(%rsp)
movq %r14, 200(%rsp)
movq %rbx, 128(%rsp)
movq $2, 136(%rsp)
movq 696(%rsp), %rbx
movq %rbx, 144(%rsp)
movq $16, 248(%rsp)
vxorps %xmm0, %xmm0, %xmm0
vmovups %xmm0, 256(%rsp)
movq $0, 272(%rsp)
movabsq $LCPI0_2, %rax
vmovaps (%rax), %xmm0
vmovups %xmm0, 280(%rsp)
movq $8, 296(%rsp)
movl $0, 28(%rsp)
leaq 28(%rsp), %r15
movq %r15, %rdi
movabsq $_numba_gil_ensure, %rax
callq *%rax
movabsq $_PyEval_SaveThread, %rax
callq *%rax
movq %rax, %rbp
movabsq $_get_num_threads, %rax
callq *%rax
movq %rax, 8(%rsp)
movq $5, (%rsp)
movabsq $___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE, %rdi
leaq 176(%rsp), %rsi
leaq 128(%rsp), %rdx
leaq 248(%rsp), %rcx
movl $2, %r9d
xorl %r8d, %r8d
movabsq $_numba_parallel_for, %rax
callq *%rax
movq %rbp, %rdi
movabsq $_PyEval_RestoreThread, %rax
callq *%rax
movq %r15, %rdi
movabsq $_numba_gil_release, %rax
callq *%rax
movq %r12, (%r13)
movq $0, 8(%r13)
movq %rbx, 16(%r13)
movq $8, 24(%r13)
movq %r14, 32(%r13)
movq %rbx, 40(%r13)
movq $8, 48(%r13)
movabsq $_NRT_decref, %rax
movq 112(%rsp), %rdi
callq *%rax
xorl %eax, %eax
LBB0_8:
addq $632, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
LBB0_1:
movabsq $_.const.picklebuf.5129668672, %rax
jmp LBB0_6
LBB0_3:
movabsq $_.const.picklebuf.5131375424, %rax
jmp LBB0_6
LBB0_5:
movabsq $_.const.picklebuf.5131380544, %rax
LBB0_6:
movq %rax, (%rsi)
jmp LBB0_7
LBB0_9:
movabsq $_printf_format, %rdi
jmp LBB0_10
LBB0_13:
movabsq $_printf_format.1, %rdi
LBB0_10:
movabsq $_printf, %rcx
movq %rbx, %rsi
xorl %eax, %eax
callq *%rcx
movabsq $_.const.picklebuf.5129402880, %rax
movq 40(%rsp), %rcx
movq %rax, (%rcx)
LBB0_7:
movl $1, %eax
jmp LBB0_8
.cfi_endproc
.globl __ZN7cpython8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
__ZN7cpython8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-32, %rsp
subq $384, %rsp
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
movq %rsi, %rdi
subq $8, %rsp
leaq 112(%rsp), %r10
movabsq $_.const.norm_pdf, %rsi
movabsq $_PyArg_UnpackTuple, %rbx
leaq 128(%rsp), %r8
leaq 120(%rsp), %r9
movl $3, %edx
movl $3, %ecx
xorl %eax, %eax
pushq %r10
callq *%rbx
addq $16, %rsp
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 128(%rsp)
vmovups %ymm0, 152(%rsp)
vmovaps %ymm0, 192(%rsp)
vmovups %ymm0, 216(%rsp)
movq $0, 24(%rsp)
vmovaps %ymm0, 288(%rsp)
vmovups %ymm0, 312(%rsp)
testl %eax, %eax
je LBB1_1
movabsq $__ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %rax
movq (%rax), %rbx
testq %rbx, %rbx
je LBB1_4
movq 120(%rsp), %rdi
vxorps %xmm0, %xmm0, %xmm0
vmovaps %ymm0, 128(%rsp)
vmovups %ymm0, 152(%rsp)
movabsq $_NRT_adapt_ndarray_from_python, %rax
leaq 128(%rsp), %rsi
vzeroupper
callq *%rax
testl %eax, %eax
jne LBB1_8
cmpq $8, 152(%rsp)
jne LBB1_8
movq %rbx, 56(%rsp)
movq 128(%rsp), %rax
movq %rax, 16(%rsp)
movq 136(%rsp), %rax
movq %rax, 32(%rsp)
movq 144(%rsp), %rax
movq %rax, 88(%rsp)
movq 160(%rsp), %rax
movq %rax, 256(%rsp)
movq 168(%rsp), %rax
movq %rax, 96(%rsp)
movq 176(%rsp), %rax
movq %rax, 80(%rsp)
movq 112(%rsp), %rdi
movabsq $_PyNumber_Float, %r13
callq *%r13
movq %rax, %rbx
movabsq $_PyFloat_AsDouble, %r14
movq %rax, %rdi
callq *%r14
vmovsd %xmm0, 72(%rsp)
movabsq $_Py_DecRef, %r15
movq %rbx, %rdi
callq *%r15
movabsq $_PyErr_Occurred, %r12
callq *%r12
testq %rax, %rax
jne LBB1_10
movq 104(%rsp), %rdi
callq *%r13
movq %rax, %rbx
movq %rax, %rdi
callq *%r14
vmovsd %xmm0, 64(%rsp)
movq %rbx, %rdi
callq *%r15
callq *%r12
testq %rax, %rax
jne LBB1_10
vxorps %xmm0, %xmm0, %xmm0
vmovups %ymm0, 216(%rsp)
vmovaps %ymm0, 192(%rsp)
subq $8, %rsp
movabsq $__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %rax
leaq 200(%rsp), %rdi
leaq 32(%rsp), %rsi
movl $8, %r9d
movq 96(%rsp), %r8
movq 24(%rsp), %rbx
movq %rbx, %rdx
movq 40(%rsp), %rcx
vmovsd 80(%rsp), %xmm0
vmovsd 72(%rsp), %xmm1
pushq 88(%rsp)
pushq 112(%rsp)
pushq 280(%rsp)
vzeroupper
callq *%rax
addq $32, %rsp
movl %eax, %r12d
movq 24(%rsp), %r13
movq 192(%rsp), %r14
vmovups 200(%rsp), %ymm0
vmovaps %ymm0, 256(%rsp)
vmovups 232(%rsp), %xmm0
vmovaps %xmm0, 32(%rsp)
movabsq $_NRT_decref, %r15
movq %rbx, %rdi
vzeroupper
callq *%r15
cmpl $-2, %r12d
je LBB1_17
testl %r12d, %r12d
jne LBB1_14
LBB1_17:
movq 56(%rsp), %rax
movq 24(%rax), %rdi
testq %rdi, %rdi
je LBB1_19
movabsq $_PyList_GetItem, %rax
xorl %esi, %esi
callq *%rax
movq %rax, %rbx
jmp LBB1_20
LBB1_14:
jle LBB1_21
movabsq $_PyErr_Clear, %rax
callq *%rax
movq 16(%r13), %rdx
movl 8(%r13), %esi
movq (%r13), %rdi
movabsq $_numba_unpickle, %rax
callq *%rax
testq %rax, %rax
je LBB1_1
movabsq $_numba_do_raise, %rcx
movq %rax, %rdi
callq *%rcx
jmp LBB1_1
LBB1_19:
movabsq $_PyExc_RuntimeError, %rdi
movabsq $"_.const.`env.consts` is NULL in `read_const`", %rsi
movabsq $_PyErr_SetString, %rax
callq *%rax
xorl %ebx, %ebx
LBB1_20:
movabsq $_.const.pickledata.4576487424, %rdi
movabsq $_.const.pickledata.4576487424.sha1, %rdx
movabsq $_numba_unpickle, %rax
movl $32, %esi
callq *%rax
movq %r14, 288(%rsp)
vmovaps 256(%rsp), %ymm0
vmovups %ymm0, 296(%rsp)
vmovaps 32(%rsp), %xmm0
vmovups %xmm0, 328(%rsp)
movabsq $_NRT_adapt_ndarray_to_python_acqref, %r9
leaq 288(%rsp), %rdi
movq %rax, %rsi
movl $1, %edx
movl $1, %ecx
movq %rbx, %r8
vzeroupper
callq *%r9
movq %rax, %rbx
movq %r14, %rdi
callq *%r15
movq %rbx, %rax
jmp LBB1_2
LBB1_21:
movabsq $_PyExc_SystemError, %rdi
movabsq $"_.const.unknown error when calling native function", %rsi
LBB1_5:
movabsq $_PyErr_SetString, %rax
vzeroupper
callq *%rax
LBB1_1:
xorl %eax, %eax
LBB1_2:
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
LBB1_10:
movabsq $_NRT_decref, %rax
movq 16(%rsp), %rdi
callq *%rax
jmp LBB1_1
LBB1_4:
movabsq $_PyExc_RuntimeError, %rdi
movabsq $"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd", %rsi
jmp LBB1_5
LBB1_8:
movabsq $_PyExc_TypeError, %rdi
movabsq $"_.const.can't unbox array from PyObject into native value. The object maybe of a different type", %rsi
jmp LBB1_5
.cfi_endproc
.globl _cfunc._ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd
.p2align 4, 0x90
_cfunc._ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset %rbp, -16
movq %rsp, %rbp
.cfi_def_cfa_register %rbp
pushq %r15
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-32, %rsp
subq $192, %rsp
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
movq %r8, %rax
movq %rcx, %r8
movq %rdx, %rcx
movq %rsi, %rdx
movq %rdi, %rbx
vmovaps 16(%rbp), %xmm2
vxorps %xmm3, %xmm3, %xmm3
vmovups %ymm3, 120(%rsp)
vmovaps %ymm3, 96(%rsp)
movq $0, 48(%rsp)
vmovups %xmm2, 8(%rsp)
movq %r9, (%rsp)
movabsq $__ZN8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd, %r10
leaq 96(%rsp), %rdi
leaq 48(%rsp), %rsi
movq %rax, %r9
vzeroupper
callq *%r10
movl %eax, %r14d
movq 48(%rsp), %rdi
movq 96(%rsp), %rax
movq 104(%rsp), %rcx
movq 112(%rsp), %rdx
movq 120(%rsp), %rsi
movq 128(%rsp), %r12
movq 136(%rsp), %r13
movq 144(%rsp), %r15
movl $0, 44(%rsp)
testl %r14d, %r14d
jne LBB2_1
LBB2_4:
movq %r15, 48(%rbx)
movq %r13, 40(%rbx)
movq %r12, 32(%rbx)
movq %rsi, 24(%rbx)
movq %rdx, 16(%rbx)
movq %rcx, 8(%rbx)
movq %rax, (%rbx)
movq %rbx, %rax
leaq -40(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
retq
LBB2_1:
movq %rdi, 56(%rsp)
movq %rsi, 64(%rsp)
movq %rdx, 72(%rsp)
movq %rcx, 80(%rsp)
movq %rax, 88(%rsp)
movabsq $_numba_gil_ensure, %rax
leaq 44(%rsp), %rdi
callq *%rax
testl %r14d, %r14d
jle LBB2_6
movabsq $_PyErr_Clear, %rax
callq *%rax
movq 56(%rsp), %rax
movq 16(%rax), %rdx
movl 8(%rax), %esi
movq (%rax), %rdi
movabsq $_numba_unpickle, %rax
callq *%rax
testq %rax, %rax
je LBB2_3
movabsq $_numba_do_raise, %rcx
movq %rax, %rdi
callq *%rcx
jmp LBB2_3
LBB2_6:
movabsq $_PyExc_SystemError, %rdi
movabsq $"_.const.unknown error when calling native function.1", %rsi
movabsq $_PyErr_SetString, %rax
callq *%rax
LBB2_3:
movabsq $"_.const.<numba.core.cpu.CPUContext object at 0x131b86580>", %rdi
movabsq $_PyUnicode_FromString, %rax
callq *%rax
movq %rax, %r14
movabsq $_PyErr_WriteUnraisable, %rax
movq %r14, %rdi
callq *%rax
movabsq $_Py_DecRef, %rax
movq %r14, %rdi
callq *%rax
movabsq $_numba_gil_release, %rax
leaq 44(%rsp), %rdi
callq *%rax
movq 88(%rsp), %rax
movq 80(%rsp), %rcx
movq 72(%rsp), %rdx
movq 64(%rsp), %rsi
jmp LBB2_4
.cfi_endproc
.globl ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
.weak_definition ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
.p2align 4, 0x90
___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
movq (%rsi), %rcx
testq %rcx, %rcx
jle LBB3_16
movq (%rdx), %rsi
movq 8(%rdx), %r10
movq 8(%rdi), %r11
movq 16(%rdi), %r14
movq 16(%rdx), %r15
movq 24(%rdx), %r12
movq 32(%rdx), %rax
movq %rax, -8(%rsp)
movq 32(%rdi), %rax
movq %rax, -16(%rsp)
movq (%rdi), %r13
movq 24(%rdi), %rax
movq %rax, -24(%rsp)
xorl %r8d, %r8d
movq %rcx, -32(%rsp)
movq %rsi, -40(%rsp)
movq %r10, -48(%rsp)
movq %r11, -56(%rsp)
movq %r14, -64(%rsp)
movq %r15, -72(%rsp)
movq %r12, -80(%rsp)
jmp LBB3_2
.p2align 4, 0x90
LBB3_15:
incq %r8
cmpq %rcx, %r8
je LBB3_16
LBB3_2:
movq %r8, %rax
imulq %rsi, %rax
movq (%rax,%r13), %rbx
movq 8(%rax,%r13), %rdi
subq %rbx, %rdi
incq %rdi
testq %rdi, %rdi
jle LBB3_15
movq %r8, %rax
imulq %r10, %rax
vmovsd (%r11,%rax), %xmm0
movq %r8, %rax
imulq %r15, %rax
vmovsd (%r14,%rax), %xmm1
movq %r8, %rbp
imulq %r12, %rbp
addq -24(%rsp), %rbp
movq %r8, %rdx
imulq -8(%rsp), %rdx
addq -16(%rsp), %rdx
cmpq $8, %rdi
jb LBB3_13
movq %rdi, %r9
andq $-8, %r9
vbroadcastsd %xmm1, %ymm2
vbroadcastsd %xmm0, %ymm3
leaq -8(%r9), %rax
movq %rax, %r10
shrq $3, %r10
incq %r10
movl %r10d, %r11d
andl $3, %r11d
cmpq $24, %rax
jae LBB3_6
xorl %r14d, %r14d
jmp LBB3_8
LBB3_6:
leaq (%rdx,%rbx,8), %r15
addq $224, %r15
leaq 224(,%rbx,8), %r12
addq %rbp, %r12
andq $-4, %r10
negq %r10
xorl %r14d, %r14d
.p2align 4, 0x90
LBB3_7:
vmovupd -224(%r12,%r14,8), %ymm4
vmovupd -192(%r12,%r14,8), %ymm5
vsubpd %ymm2, %ymm4, %ymm4
vsubpd %ymm2, %ymm5, %ymm5
vmulpd %ymm3, %ymm4, %ymm4
vmulpd %ymm3, %ymm5, %ymm5
vmovupd %ymm4, -224(%r15,%r14,8)
vmovupd %ymm5, -192(%r15,%r14,8)
vmovupd -160(%r12,%r14,8), %ymm4
vmovupd -128(%r12,%r14,8), %ymm5
vsubpd %ymm2, %ymm4, %ymm4
vsubpd %ymm2, %ymm5, %ymm5
vmulpd %ymm3, %ymm4, %ymm4
vmulpd %ymm3, %ymm5, %ymm5
vmovupd %ymm4, -160(%r15,%r14,8)
vmovupd %ymm5, -128(%r15,%r14,8)
vmovupd -96(%r12,%r14,8), %ymm4
vmovupd -64(%r12,%r14,8), %ymm5
vsubpd %ymm2, %ymm4, %ymm4
vsubpd %ymm2, %ymm5, %ymm5
vmulpd %ymm3, %ymm4, %ymm4
vmulpd %ymm3, %ymm5, %ymm5
vmovupd %ymm4, -96(%r15,%r14,8)
vmovupd %ymm5, -64(%r15,%r14,8)
vmovupd -32(%r12,%r14,8), %ymm4
vmovupd (%r12,%r14,8), %ymm5
vsubpd %ymm2, %ymm4, %ymm4
vsubpd %ymm2, %ymm5, %ymm5
vmulpd %ymm3, %ymm4, %ymm4
vmulpd %ymm3, %ymm5, %ymm5
vmovupd %ymm4, -32(%r15,%r14,8)
vmovupd %ymm5, (%r15,%r14,8)
addq $32, %r14
addq $4, %r10
jne LBB3_7
LBB3_8:
testq %r11, %r11
movq -72(%rsp), %r15
movq -80(%rsp), %r12
je LBB3_11
addq %rbx, %r14
shlq $6, %r11
leaq (%rdx,%r14,8), %rcx
addq $32, %rcx
leaq 32(,%r14,8), %rax
addq %rbp, %rax
xorl %esi, %esi
.p2align 4, 0x90
LBB3_10:
vmovupd -32(%rax,%rsi), %ymm4
vmovupd (%rax,%rsi), %ymm5
vsubpd %ymm2, %ymm4, %ymm4
vsubpd %ymm2, %ymm5, %ymm5
vmulpd %ymm3, %ymm4, %ymm4
vmulpd %ymm3, %ymm5, %ymm5
vmovupd %ymm4, -32(%rcx,%rsi)
vmovupd %ymm5, (%rcx,%rsi)
addq $64, %rsi
cmpq %rsi, %r11
jne LBB3_10
LBB3_11:
cmpq %r9, %rdi
movq -32(%rsp), %rcx
movq -40(%rsp), %rsi
movq -48(%rsp), %r10
movq -56(%rsp), %r11
movq -64(%rsp), %r14
je LBB3_15
andl $7, %edi
addq %r9, %rbx
LBB3_13:
incq %rdi
shlq $3, %rbx
.p2align 4, 0x90
LBB3_14:
vmovsd (%rbp,%rbx), %xmm2
vsubsd %xmm1, %xmm2, %xmm2
vmulsd %xmm0, %xmm2, %xmm2
vmovsd %xmm2, (%rdx,%rbx)
decq %rdi
addq $8, %rbx
cmpq $1, %rdi
jg LBB3_14
jmp LBB3_15
LBB3_16:
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
.cfi_endproc
.globl _NRT_decref
.weak_def_can_be_hidden _NRT_decref
.p2align 4, 0x90
_NRT_decref:
.cfi_startproc
testq %rdi, %rdi
je LBB4_2
##MEMBARRIER
lock decq (%rdi)
je LBB4_3
LBB4_2:
retq
LBB4_3:
##MEMBARRIER
movabsq $_NRT_MemInfo_call_dtor, %rax
jmpq *%rax
.cfi_endproc
.section __TEXT,__literal8,8byte_literals
.p2align 3
LCPI5_0:
.quad 0xbfe0000000000000
.section __TEXT,__text,regular,pure_instructions
.globl ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
.weak_definition ___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE
.p2align 4, 0x90
___gufunc__._ZN13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
pushq %r15
.cfi_def_cfa_offset 24
pushq %r14
.cfi_def_cfa_offset 32
pushq %r13
.cfi_def_cfa_offset 40
pushq %r12
.cfi_def_cfa_offset 48
pushq %rbx
.cfi_def_cfa_offset 56
subq $360, %rsp
.cfi_def_cfa_offset 416
.cfi_offset %rbx, -56
.cfi_offset %r12, -48
.cfi_offset %r13, -40
.cfi_offset %r14, -32
.cfi_offset %r15, -24
.cfi_offset %rbp, -16
movq (%rsi), %rax
movq %rax, 264(%rsp)
testq %rax, %rax
jle LBB5_17
movq (%rdx), %rsi
movq 8(%rdx), %rax
movq %rax, 248(%rsp)
movq 8(%rdi), %rax
movq %rax, 240(%rsp)
movq 16(%rdx), %rax
movq %rax, 232(%rsp)
movq 24(%rdx), %rax
movq %rax, 224(%rsp)
movq 24(%rdi), %rax
movq %rax, 216(%rsp)
movq (%rdi), %rax
movq %rax, 256(%rsp)
movq 16(%rdi), %rax
movq %rax, 208(%rsp)
xorl %ecx, %ecx
movabsq $LCPI5_0, %rax
vmovsd (%rax), %xmm1
movabsq $_exp, %r15
vbroadcastsd (%rax), %ymm0
vmovupd %ymm0, 128(%rsp)
movq %rsi, %rbx
movq %rsi, 184(%rsp)
vmovsd %xmm1, 120(%rsp)
jmp LBB5_2
.p2align 4, 0x90
LBB5_16:
movq 112(%rsp), %rcx
incq %rcx
cmpq 264(%rsp), %rcx
je LBB5_17
LBB5_2:
movq %rcx, 112(%rsp)
movq %rcx, %rax
imulq %rbx, %rax
movq 256(%rsp), %rcx
movq (%rax,%rcx), %r13
movq 8(%rax,%rcx), %r14
subq %r13, %r14
incq %r14
testq %r14, %r14
jle LBB5_16
movq 112(%rsp), %rdx
movq %rdx, %rax
movq 248(%rsp), %rcx
imulq %rcx, %rax
movq 240(%rsp), %rcx
vmovsd (%rcx,%rax), %xmm0
movq %rdx, %r12
imulq 232(%rsp), %r12
addq 208(%rsp), %r12
imulq 224(%rsp), %rdx
addq 216(%rsp), %rdx
cmpq $4, %r14
vmovapd %xmm0, 336(%rsp)
jae LBB5_5
movq %rdx, %rbp
jmp LBB5_14
.p2align 4, 0x90
LBB5_5:
movq %r14, %rsi
andq $-4, %rsi
vbroadcastsd %xmm0, %ymm1
vmovupd %ymm1, 288(%rsp)
leaq -4(%rsi), %rax
movq %rax, %rdi
shrq $2, %rdi
incq %rdi
movl %edi, %ebp
andl $3, %ebp
cmpq $12, %rax
movq %r12, 104(%rsp)
movq %rdx, 96(%rsp)
movq %rsi, 200(%rsp)
movq %rbp, 192(%rsp)
jae LBB5_7
xorl %ebp, %ebp
vmovupd 128(%rsp), %ymm1
jmp LBB5_9
LBB5_7:
leaq (%rdx,%r13,8), %rax
addq $96, %rax
movq %rax, 280(%rsp)
leaq (%r12,%r13,8), %rax
addq $96, %rax
movq %rax, 272(%rsp)
andq $-4, %rdi
negq %rdi
xorl %ebp, %ebp
vmovupd 128(%rsp), %ymm1
.p2align 4, 0x90
LBB5_8:
movq %rdi, 80(%rsp)
movq 272(%rsp), %r12
vmovupd -96(%r12,%rbp,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd %ymm1, %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 16(%rsp)
vzeroupper
callq *%r15
vmovapd %xmm0, (%rsp)
vpermilpd $1, 16(%rsp), %xmm0
callq *%r15
vmovapd (%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, (%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r15
vmovaps %xmm0, 16(%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r15
vmovapd 16(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, (%rsp), %ymm0, %ymm0
vmulpd 288(%rsp), %ymm0, %ymm0
movq 280(%rsp), %rbx
vmovupd %ymm0, -96(%rbx,%rbp,8)
vmovupd -64(%r12,%rbp,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 128(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 16(%rsp)
vzeroupper
callq *%r15
vmovapd %xmm0, (%rsp)
vpermilpd $1, 16(%rsp), %xmm0
callq *%r15
vmovapd (%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, (%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r15
vmovaps %xmm0, 16(%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r15
vmovapd 16(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, (%rsp), %ymm0, %ymm0
vmulpd 288(%rsp), %ymm0, %ymm0
vmovupd %ymm0, -64(%rbx,%rbp,8)
vmovupd -32(%r12,%rbp,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 128(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 16(%rsp)
vzeroupper
callq *%r15
vmovapd %xmm0, (%rsp)
vpermilpd $1, 16(%rsp), %xmm0
callq *%r15
vmovapd (%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, (%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r15
vmovaps %xmm0, 16(%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r15
vmovapd 16(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vinsertf128 $1, (%rsp), %ymm0, %ymm0
vmulpd 288(%rsp), %ymm0, %ymm0
vmovupd %ymm0, -32(%rbx,%rbp,8)
vmovupd (%r12,%rbp,8), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd 128(%rsp), %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, 16(%rsp)
vzeroupper
callq *%r15
vmovapd %xmm0, (%rsp)
vpermilpd $1, 16(%rsp), %xmm0
callq *%r15
vmovapd (%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, (%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r15
vmovaps %xmm0, 16(%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r15
movq 80(%rsp), %rdi
vmovupd 128(%rsp), %ymm1
vmovapd 16(%rsp), %xmm2
vunpcklpd %xmm0, %xmm2, %xmm0
vinsertf128 $1, (%rsp), %ymm0, %ymm0
vmulpd 288(%rsp), %ymm0, %ymm0
vmovupd %ymm0, (%rbx,%rbp,8)
addq $16, %rbp
addq $4, %rdi
jne LBB5_8
LBB5_9:
movq 192(%rsp), %r12
testq %r12, %r12
je LBB5_12
addq %r13, %rbp
shlq $5, %r12
movq 96(%rsp), %rax
leaq (%rax,%rbp,8), %rax
movq %rax, 16(%rsp)
movq 104(%rsp), %rax
leaq (%rax,%rbp,8), %rbp
xorl %ebx, %ebx
.p2align 4, 0x90
LBB5_11:
vmovupd (%rbp,%rbx), %ymm0
vmulpd %ymm0, %ymm0, %ymm0
vmulpd %ymm1, %ymm0, %ymm0
vmovupd %ymm0, 32(%rsp)
vextractf128 $1, %ymm0, %xmm0
vmovapd %xmm0, (%rsp)
vzeroupper
callq *%r15
vmovapd %xmm0, 80(%rsp)
vpermilpd $1, (%rsp), %xmm0
callq *%r15
vmovapd 80(%rsp), %xmm1
vunpcklpd %xmm0, %xmm1, %xmm0
vmovapd %xmm0, 80(%rsp)
vmovups 32(%rsp), %ymm0
vzeroupper
callq *%r15
vmovaps %xmm0, (%rsp)
vpermilpd $1, 32(%rsp), %xmm0
callq *%r15
vmovupd 128(%rsp), %ymm1
vmovapd (%rsp), %xmm2
vunpcklpd %xmm0, %xmm2, %xmm0
vinsertf128 $1, 80(%rsp), %ymm0, %ymm0
vmulpd 288(%rsp), %ymm0, %ymm0
movq 16(%rsp), %rax
vmovupd %ymm0, (%rax,%rbx)
addq $32, %rbx
cmpq %rbx, %r12
jne LBB5_11
LBB5_12:
movq 200(%rsp), %rax
cmpq %rax, %r14
movq 184(%rsp), %rbx
vmovsd 120(%rsp), %xmm1
movq 104(%rsp), %r12
movq 96(%rsp), %rbp
je LBB5_16
andl $3, %r14d
addq %rax, %r13
LBB5_14:
incq %r14
shlq $3, %r13
.p2align 4, 0x90
LBB5_15:
vmovsd (%r12,%r13), %xmm0
vmulsd %xmm0, %xmm0, %xmm0
vmulsd %xmm1, %xmm0, %xmm0
vzeroupper
callq *%r15
vmovsd 120(%rsp), %xmm1
vmulsd 336(%rsp), %xmm0, %xmm0
vmovsd %xmm0, (%rbp,%r13)
decq %r14
addq $8, %r13
cmpq $1, %r14
jg LBB5_15
jmp LBB5_16
LBB5_17:
addq $360, %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %r15
popq %rbp
vzeroupper
retq
.cfi_endproc
.globl _NRT_incref
.weak_def_can_be_hidden _NRT_incref
.p2align 4, 0x90
_NRT_incref:
testq %rdi, %rdi
je LBB6_1
lock incq (%rdi)
retq
LBB6_1:
retq
.comm __ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd,8,3
.section __DATA,__const
.p2align 4
_.const.picklebuf.5129668672:
.quad _.const.pickledata.5129668672
.long 69
.space 4
.quad _.const.pickledata.5129668672.sha1
.section __TEXT,__const
.p2align 4
_printf_format:
.asciz "num_threads: %d\n"
.section __DATA,__const
.p2align 4
_.const.picklebuf.5129402880:
.quad _.const.pickledata.5129402880
.long 112
.space 4
.quad _.const.pickledata.5129402880.sha1
.section __TEXT,__const
.p2align 4
_printf_format.1:
.asciz "num_threads: %d\n"
.p2align 4
_.const.pickledata.5129402880:
.ascii "\200\004\225e\000\000\000\000\000\000\000\214\bbuiltins\224\214\fRuntimeError\224\223\224\214@Invalid number of threads. This likely indicates a bug in Numba.\224\205\224N\207\224."
.p2align 4
_.const.pickledata.5129402880.sha1:
.ascii "\235\213\326\325A\263\3436\375y\027\231I@x\033\306\212:\212"
.p2align 4
_.const.pickledata.5129668672:
.ascii "\200\004\225:\000\000\000\000\000\000\000\214\bbuiltins\224\214\021ZeroDivisionError\224\223\224\214\020division by zero\224\205\224N\207\224."
.p2align 4
_.const.pickledata.5129668672.sha1:
.ascii "\262\200\b\240\370\213\255_\360\360$>\204\332\271\f\253\031\263f"
_.const.norm_pdf:
.asciz "norm_pdf"
.p2align 4
"_.const.missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd":
.asciz "missing Environment: _ZN08NumbaEnv8__main__13norm_pdf_2421B110c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYedCAs0UPuWp1kJR6LdBgYCxA7AAAE5ArrayIdLi1E1C7mutable7alignedEdd"
.p2align 4
"_.const.can't unbox array from PyObject into native value. The object maybe of a different type":
.asciz "can't unbox array from PyObject into native value. The object maybe of a different type"
.p2align 4
"_.const.`env.consts` is NULL in `read_const`":
.asciz "`env.consts` is NULL in `read_const`"
.p2align 4
_.const.pickledata.4576487424:
.ascii "\200\004\225\025\000\000\000\000\000\000\000\214\005numpy\224\214\007ndarray\224\223\224."
.p2align 4
_.const.pickledata.4576487424.sha1:
.ascii "\337\274\375\323\237\313&\364\320\306\200\225D\207\270\300\265;\270\243"
.p2align 4
"_.const.unknown error when calling native function":
.asciz "unknown error when calling native function"
.p2align 4
"_.const.<numba.core.cpu.CPUContext object at 0x131b86580>":
.asciz "<numba.core.cpu.CPUContext object at 0x131b86580>"
.p2align 4
"_.const.unknown error when calling native function.1":
.asciz "unknown error when calling native function"
.comm __ZN08NumbaEnv13_3cdynamic_3e42jit_wrapper__built_in_function_empty__2416B66c8tJTIeFCjyCbUFRqqOAK_2f6h0jAX2aI7qVodJKVeqwlUg4hHqC7MmIRJFEEM1gQAEx18class_28float64_29,8,3
.comm __ZN08NumbaEnv5numba2np8arrayobj19_call_allocator_247B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,3
.comm __ZN08NumbaEnv5numba2np8arrayobj18_ol_array_allocate12_3clocals_3e8impl_248B44c8tJTIeFCjyCbUFRqqOAK_2f6h0phxApMogijRBAA_3dEN29typeref_5b_3cclass_20_27numba4core5types8npytypes14Array_27_3e_5dExj,8,3
.section __DATA,__const
.p2align 4
_.const.picklebuf.5131375424:
.quad _.const.pickledata.5131375424
.long 77
.space 4
.quad _.const.pickledata.5131375424.sha1
.p2align 4
_.const.picklebuf.5131380544:
.quad _.const.pickledata.5131380544
.long 137
.space 4
.quad _.const.pickledata.5131380544.sha1
.section __TEXT,__const
.p2align 4
_.const.pickledata.5131380544:
.ascii "\200\004\225~\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214[array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.\224\205\224N\207\224."
.p2align 4
_.const.pickledata.5131380544.sha1:
.ascii "X\341N\314\265\007\261\340 i\201t\002#\346\205\313\214<W"
.p2align 4
_.const.pickledata.5131375424:
.ascii "\200\004\225B\000\000\000\000\000\000\000\214\bbuiltins\224\214\nValueError\224\223\224\214\037negative dimensions not allowed\224\205\224N\207\224."
.p2align 4
_.const.pickledata.5131375424.sha1:
.ascii "3\033\205c\275\271\332\310\0338B\"s\005,Ho\301pk"
.comm __ZN08NumbaEnv13_3cdynamic_3e38__numba_parfor_gufunc_0x131bdb2b0_2422B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEdd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE,8,3
.comm __ZN08NumbaEnv13_3cdynamic_3e38__numba_parfor_gufunc_0x131c72370_2423B128c8tJTC_2fWQAkyW1xhBopo9CCDiCFCDMJHDTCIGCy8IDxIcEFloKEF4UEDC8KBhhWIo6mjgJZwoWpwJVOYNCIkbcG2Aq2AhqUtemBWq4Ok1GuBdkFDFObcIohxmgA_3dE5ArrayIyLi1E1C7mutable7alignedEd5ArrayIdLi1E1C7mutable7alignedE5ArrayIdLi1E1C7mutable7alignedE,8,3
.comm __ZN08NumbaEnv5numba7cpython7numbers14int_power_impl12_3clocals_3e14int_power_2424B44c8tJTC_2fWQA9HW1CcAv0EjzIkAdRogEkUlYBZmgA_3dEdx,8,3
.subsections_via_symbols
This code section is very long, but the assembly grammar is very simple. Constants starts with .
and SOMETHING:
is a jump label for the assembly equivalent of goto
. Everything else is an instruction with its name on the left and the arguments are on the right.
You can google all the commands, the interesting ones are those that end with pd
, those are SIMD instructions that operate on up to eight doubles at once. This is where the speed comes from. There is a lot of repetition, because the optimizer partially unrolled some loops to make them faster. Using unrolled loops only works if the remaining chunk of data is large enough. Since the compiler does not know the length of the incoming array, it also generates sections which handle shorter chunks
and all the code to select which section to use. Finally, there is some code which does the translation from and to Python objects with corresponding error handling.
We don’t need to write SIMD instructions by hand, the optimizer does it for us and in a very sophisticated way.