Eric Bavier
2018-04-05 22:13:29 UTC
Hello Guix,
I recently discovered that the FFTW library can do runtime cpu
detection. In order to do this, the package needs to be configured to
build SIMD "codelets", like how our 'fftw-avx' currently does. Then,
based on the instruction support detected at runtime, make those
kernels available to the fftw "planner" for execution.
I tested this on two systems: 1) system with sse2, and 2) system with
avx2. I configured the library with "--enable-sse2 --enable-avx
--enable-avx2", then ran the following on both systems:
1)
$ ./tests/bench --verbose=3 --verify 'ibcd11x7x6v10'
Planning ibcd11x7x6v10...
using plan_many_dft
estimate-planner time: 0.004355 s
using plan_many_dft
planner time: 0.035684 s
(dft-rank>=2/1
(dft-vrank>=1-x11/1
(dft-rank>=2/1
(dft-vrank>=1-x7/1
(dft-direct-6-x10 "n1bv_6_sse2"))
(dft-direct-7-x60 "n1bv_7_sse2")))
(dft-direct-11-x420 "n1bv_11_sse2"))
flops: 36800 add, 9700 mul, 26260 fma
estimated cost: 99057.699080, pcost = 115706.000000
ibcd11x7x6v10 4.33362e-16 7.27264e-16 8.46842e-16
2)
$ ./tests/bench --verbose=3 --verify 'ibcd11x7x6v10'
Planning ibcd11x7x6v10...
using plan_many_dft
estimate-planner time: 0.001485 s
using plan_many_dft
planner time: 0.025788 s
(dft-rank>=2/1
(dft-rank>=2/1
(dft-vrank>=1-x77/1
(dft-direct-6-x10 "n1bv_6_sse2"))
(dft-vrank>=1-x11/1
(dft-direct-7-x60 "n1bv_7_avx")))
(dft-direct-11-x420 "n1bv_11_avx"))
flops: 12280 add, 2810 mul, 6950 fma
estimated cost: 28996.283180, pcost = 40767.000000
ibcd11x7x6v10 2.24601e-07 3.90447e-07 2.42548e-07
The attached patch is a WIP.
I recently discovered that the FFTW library can do runtime cpu
detection. In order to do this, the package needs to be configured to
build SIMD "codelets", like how our 'fftw-avx' currently does. Then,
based on the instruction support detected at runtime, make those
kernels available to the fftw "planner" for execution.
I tested this on two systems: 1) system with sse2, and 2) system with
avx2. I configured the library with "--enable-sse2 --enable-avx
--enable-avx2", then ran the following on both systems:
1)
$ ./tests/bench --verbose=3 --verify 'ibcd11x7x6v10'
Planning ibcd11x7x6v10...
using plan_many_dft
estimate-planner time: 0.004355 s
using plan_many_dft
planner time: 0.035684 s
(dft-rank>=2/1
(dft-vrank>=1-x11/1
(dft-rank>=2/1
(dft-vrank>=1-x7/1
(dft-direct-6-x10 "n1bv_6_sse2"))
(dft-direct-7-x60 "n1bv_7_sse2")))
(dft-direct-11-x420 "n1bv_11_sse2"))
flops: 36800 add, 9700 mul, 26260 fma
estimated cost: 99057.699080, pcost = 115706.000000
ibcd11x7x6v10 4.33362e-16 7.27264e-16 8.46842e-16
2)
$ ./tests/bench --verbose=3 --verify 'ibcd11x7x6v10'
Planning ibcd11x7x6v10...
using plan_many_dft
estimate-planner time: 0.001485 s
using plan_many_dft
planner time: 0.025788 s
(dft-rank>=2/1
(dft-rank>=2/1
(dft-vrank>=1-x77/1
(dft-direct-6-x10 "n1bv_6_sse2"))
(dft-vrank>=1-x11/1
(dft-direct-7-x60 "n1bv_7_avx")))
(dft-direct-11-x420 "n1bv_11_avx"))
flops: 12280 add, 2810 mul, 6950 fma
estimated cost: 28996.283180, pcost = 40767.000000
ibcd11x7x6v10 2.24601e-07 3.90447e-07 2.42548e-07
The attached patch is a WIP.
--
Eric Bavier, Scientific Libraries, Cray Inc.
Eric Bavier, Scientific Libraries, Cray Inc.