High-Performance BLAS by Kazushige Goto
What's New
- Please visit our Software and Tools page to download the latest version of GotoBLAS.
- GOTO BLAS help lift LLNL computer from 5.69 TeraFLOPS to 7.63 TeraFLOPS on the LINPACK Benchmark.
- sgemm, dgemm, cgemm, zgemm for the Intel Itanium and Itanium2 (R) architectures.
- Complete BLAS for Pentium architectures.
- UMFPACK likes it!
- Impact of our approach on the Massively Parallel LINPACK benchmark.
- Kazushige Goto and Robert van de Geijn. On Reducing TLB Misses in Matrix Multiplication. FLAME Working Note #9, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-2002-55. Nov. 2002.
We keep you informed of new developments and faster implementations UNLESS YOU SIGN UP!
- Overview
- About the current implementation
- Obtaining the library
- Performance
- Publications
- Related Projects
- Commonly asked questions
- Get on the FLAME mailing list
- Happy users
- Future directions
- Disclaimer
Overview
During the last decade, a number of projects have pursued the high-performance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner-kernel", C = trans(A) B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approached that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The purpose of the game is to optimally amortize the cost of moving data between memory layers.Our approach is fundamentally different. It starts by observing that for current generation architectures, much of the overhead comes from Translation Look-aside Buffer (TLB) table misses. While the importance of caches is also taken into consideration, it is the minimization of such TLB misses that drives the approach. The result is a novel approach that achieves highly competitive performance on broad spectrum of current high-performance architectures.
In addition, we support a large number of BLAS routines as part of the library.
About the current implementation:
We have implementations for the Intel Pentium (R) III and 4 processors, the Compaq Alpha, and the IBM Power 3. While we support all flavors of arithmetic, we currently only make double precision real available. Contact us (at kgoto@cs.utexas.edu) if you would like to try a complete library.Obtaining the library
To download the library, please click on the appropriate link:- Intel Pentium (R) III
- Intel Pentium (R) 4
- IBM Power 3
- IBM Power 4
- Intel Itanium
- Intel Itanium2
- AMD Opteron
Please
- Do not redistribute the library.
- Point others to this web page instead.
- Reference this work when you use it successfully for your own research.
Performance
For additional performance results, see our paper
- Kazushige Goto and Robert van de Geijn. On Reducing TLB Misses in Matrix Multiplication. FLAME Working Note #9, The University of Texas at Austin, Department of Computer Sciences. Technical Report TR-2002-??. Nov. 2002.
Related Publications
For related publications, see the FLAME publication web page.Related Projects
- ITXGEMM
- FLAME: Formal Linear Algebra Methods Environment.
- FLARE: Formal Linear Algebra Recovery Environment. (A fault-tolerant version of the BLAS fit for space travel)
- PLAPACK: Parallel Linear Algebra Package
Commonly asked questions
-
Yes , we rely on assembly-coded kernels.
Get on the FLAME mailing list
Please sign the FLAME guest book so we can keep you informed of new developments regarding this work and related work.
Happy users
-
GOTO BLAS help lift LLNL computer from 5.69 TeraFLOPS to
7.63 TeraFLOPS on the LINPACK Benchmark.
-
Scott Studham at Pacific Northwest National Labotory made us aware of
the fact that they used the GOTO BLAS to benchmark their 1540 Intel
Itanium2 processor based cluster, a HP RX2600 Itanium Linux Elan3
SuperComputer, achieving 4.881 TeraFLOPS out of a possible 6.160
TeraFLOPS, or 79.2% of peak.
This is an astonishingly high percentage of peak for a cluster.
-
The Center for Computational Research,
University at
Buffalo-SUNY
- The performance of the 600 processor (Pentium 4 processor based) cluster at the University at Buffalo-SUNY was increased from roughly 1.5 TeraFLOPS to 2.0 TeraFLOPS (HPL LINPACK benchmark used for the TOP500 list).
-
UMFPACK
- "... increased the performance of UMFPACK v4.1 by up to 50% on my
Dell Latitude C840 laptop (2GHz Pentium 4M, 512 L2 cache, 1GB memory),
from a peak of abut 800 mflops to 1.2 Gflops."-- Tim Davis.
More recently: "UMFPACK now peaks at 1.65 Gflops with your BLAS. ... Patrick Amestoy's MA41 (the asymmetrized version) peaks at 1.96 Gflops. These figures include the symbolic pre-analysis & minimum degree ordering phase, which don't do any flops at all."-- Tim Davis
- "... increased the performance of UMFPACK v4.1 by up to 50% on my
Dell Latitude C840 laptop (2GHz Pentium 4M, 512 L2 cache, 1GB memory),
from a peak of abut 800 mflops to 1.2 Gflops."-- Tim Davis.
-
Are you a happy user? If so, please give us a link to
your pages.
Please give us feedback on how this kernel helps or hurts performance for your application by mailing to flame@cs.utexas.edu
Disclaimer
WE MAKE THESE LIBRARIES AVAILABLE FOR EVALUATION PURPOSES. IN OTHER WORDS, WE WISH TO EVALUATE THE USEFULNESS OF THESE TECHNIQUES AND THEREFORE NEED YOUR FEEDBACK.THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT OF INTELLECTUAL PROPERTY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS OR ITS SUPPLIERS BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, LOSS OF INFORMATION) ARISING OUT OF THE USE OF OR INABILITY TO USE THE MATERIALS, EVEN IF THE UNIVERSITY OF TEXAS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU. The University of Texas further does not warrant the accuracy or completeness of the information, text, graphics, links or other items contained within these materials. The University of Texas may make changes to these materials, or to the products described therein, at any time without notice. The University of Texas makes no commitment to update the Materials.



