Click here to go to the TACC Home Page

Kazushige Goto

Kazushige Goto
Kazushige Goto

Research Associate
High Performance Computing
(512) 471-3864
kgoto@tacc.utexas.edu

High-Performance BLAS by Kazushige Goto

Kazushige Goto (Visiting Scientist, UT-Austin)


What's New


To be kept informed sign the FLAME guest book

We CANNOT keep you informed of new developments and faster implementations UNLESS YOU SIGN UP!



Overview

During the last decade, a number of projects have pursued the high-performance implementation of matrix multiplication. Typically, these projects organize the computation around an "inner-kernel", C = trans(A) B + C, that keeps one of the operands in the L1 cache, while streaming parts of the other operands through that cache. Variants include approached that extend this principle to multiple levels of cache or that apply the same principle to the L2 cache while essentially ignoring the L1 cache. The purpose of the game is to optimally amortize the cost of moving data between memory layers.

Our approach is fundamentally different. It starts by observing that for current generation architectures, much of the overhead comes from Translation Look-aside Buffer (TLB) table misses. While the importance of caches is also taken into consideration, it is the minimization of such TLB misses that drives the approach. The result is a novel approach that achieves highly competitive performance on broad spectrum of current high-performance architectures.

In addition, we support a large number of BLAS routines as part of the library.


About the current implementation:

We have implementations for the Intel Pentium (R) III and 4 processors, the Compaq Alpha, and the IBM Power 3. While we support all flavors of arithmetic, we currently only make double precision real available. Contact us (at kgoto@cs.utexas.edu) if you would like to try a complete library.

Obtaining the library

To download the library, please click on the appropriate link:

Please

  • Do not redistribute the library.
  • Point others to this web page instead.
  • Reference this work when you use it successfully for your own research.


Performance

For additional performance results, see our paper


Related Publications

For related publications, see the FLAME publication web page.

Related Projects


Commonly asked questions

  • Yes , we rely on assembly-coded kernels.


Get on the FLAME mailing list

Please sign the FLAME guest book so we can keep you informed of new developments regarding this work and related work.


Happy users

  • GOTO BLAS help lift LLNL computer from 5.69 TeraFLOPS to 7.63 TeraFLOPS on the LINPACK Benchmark.

  • Scott Studham at Pacific Northwest National Labotory made us aware of the fact that they used the GOTO BLAS to benchmark their 1540 Intel Itanium2 processor based cluster, a HP RX2600 Itanium Linux Elan3 SuperComputer, achieving 4.881 TeraFLOPS out of a possible 6.160 TeraFLOPS, or 79.2% of peak. This is an astonishingly high percentage of peak for a cluster.

  • The Center for Computational Research, University at Buffalo-SUNY
    • The performance of the 600 processor (Pentium 4 processor based) cluster at the University at Buffalo-SUNY was increased from roughly 1.5 TeraFLOPS to 2.0 TeraFLOPS (HPL LINPACK benchmark used for the TOP500 list).

  • UMFPACK
    • "... increased the performance of UMFPACK v4.1 by up to 50% on my Dell Latitude C840 laptop (2GHz Pentium 4M, 512 L2 cache, 1GB memory), from a peak of abut 800 mflops to 1.2 Gflops."-- Tim Davis.

      More recently: "UMFPACK now peaks at 1.65 Gflops with your BLAS. ... Patrick Amestoy's MA41 (the asymmetrized version) peaks at 1.96 Gflops. These figures include the symbolic pre-analysis & minimum degree ordering phase, which don't do any flops at all."-- Tim Davis

  • Are you a happy user? If so, please give us a link to your pages.



Please give us feedback on how this kernel helps or hurts performance for your application by mailing to flame@cs.utexas.edu


Disclaimer

WE MAKE THESE LIBRARIES AVAILABLE FOR EVALUATION PURPOSES. IN OTHER WORDS, WE WISH TO EVALUATE THE USEFULNESS OF THESE TECHNIQUES AND THEREFORE NEED YOUR FEEDBACK.

THE MATERIALS ARE PROVIDED "AS IS" WITHOUT ANY EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY, NONINFRINGEMENT OF INTELLECTUAL PROPERTY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS OR ITS SUPPLIERS BE LIABLE FOR ANY DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS, BUSINESS INTERRUPTION, LOSS OF INFORMATION) ARISING OUT OF THE USE OF OR INABILITY TO USE THE MATERIALS, EVEN IF THE UNIVERSITY OF TEXAS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY NOT APPLY TO YOU. The University of Texas further does not warrant the accuracy or completeness of the information, text, graphics, links or other items contained within these materials. The University of Texas may make changes to these materials, or to the products described therein, at any time without notice. The University of Texas makes no commitment to update the Materials.