#318 Introduction to BLAS
Closed: published 6 months ago by rlengland. Opened 7 months ago by rlengland.

Article Summary:
An article to better explain the BLAS spec with practical examples

Article Description:
I would like to propose an article to introduce programmers to BLAS spec and implementations from fortran to openblas and the flexiblas wrapper that fedora is using. Some historical context and some examples to show how fast is a blas library compared to a custom implementation and how it can be imported and used if fast matrix operations are needed. As of now from personal experience I can tell the information is sparse across various wikipedia, discontinued webpages.

https://discussion.fedoraproject.org/t/article-proposal-introduction-to-blas/130901/3


Hi @rlengland

I created a draft here https://fedoramagazine.org/?p=40987&preview=true&preview_id=40987

Will update in the following weeks as well, but I think it might take 1 month for me to finish it.

Thanks @romangherta Let us know when you feel it is ready for review and we'll move forward with the review/edit.

Good morning, I think I finished at least a first draft. Please take a look and let me know if something is incomplete or doesn't seem right.

Metadata Update from @rlengland:
- Custom field preview-link adjusted to https://fedoramagazine.org/?p=40987&preview=true

7 months ago

Metadata Update from @rlengland:
- Custom field image-editor adjusted to rlengland
- Issue untagged with: needs-image

7 months ago

The following are suggestions. You are welcome to disagree with them or to devise some other phrasing.


... in an effort to standardize and fasten ...

"in an effort" might be seen as implying that there were more efforts/attempts and that this effort was not entirely successful. I would omit that phrase.


... to standardize and fasten ...

I'd drop "and fasten". It's a little redundant and maybe even dated. It means "to attach" or "connect" to many people. The "to establish" sense of the word is not as familiar anymore.


Mostly this involved initially ...

Avoid packing too many adverbs into a short phrase. How about "Over time, BLAS came to support more complex algorithms. BLAS initially supported ..."?


For a general idea, the netlib Quick Reference Guide is enough.

How about "For a general overview, the netlib Quick Reference Guide is a good start."?


If however more explanations or examples are needed, section “References” of pdf contains titles of 3 research papers that ...

How about "Be sure to check out the References section of the Quick References Guide for more detailed explanations and examples. It cites three research papers that ..."


These hardware are programmed using other terminologies and programming paradigms as ones used here but matrix multiplication is still a core operation.

How about "These devices are programmed using different terms and programming paradigms from the ones used here, but matrix multiplication is still a core operation."


... and this is how blas came to being.

The phrasing should either be "came to be" or "came into being" (the latter sounds better to me in this context).


There are also a few capitalization and punctuation issues scattered throughout the text, but I'll correct those after the sentence phrasing has been decided.

Otherwise, this LGTM. 🙂

I also noted that the page linked to in the first paragraph (specification) points to a site (?https://www.netlib.org/blas/index.html ) that has many broken links (Not Found). This might be out of your control but if there is an alternative it might help the reader.

Hi @glb , corrected, thank you. I'll try to run through some grammar tool next time.

@rlengland I replaced one link with the wikipedia page. I would leave the other link, because netlib is still considered a central repository with some history about blas/lapack, I even found the flexiblas 2013 paper somewhere there, but its chaotic, I agree.

Give me a few days until next week to double check and get some feedback about flexiblas. There is also this idea of trying to use openmp to offload those loops to a gpu and see how much faster it will get. Maybe not a good idea for the current article, but I still need to check one more time the code blocks.

A nice weekend ahead, I'll get back with an update next week.

Good morning

I finished fixing some small mistakes, I rephrased some other sentences, removed something I wasn't sure of. Also a racing conditon I missed from one of the code blocks because I declared i,j,p in the wrong place and it was mapped implicitly....

I have 2 points

I installed a fresh fedora40 on VMM and when I double checked, I could not find flexiblas anymore... although I remember I did not have to install it on the host... So I thought it was nice to mention that fedora ships by default with flexiblas but I am not sure this is the case anymore and I don t know how to check this. I added a dnf install flexiblas line to account for the worst case.

  • gpu

I was thinking it might be great to also show off some gpu testing. I discovered to my ignorance that openmp has support to gpu offload in its latest major versions using the target directive. But not all the compilers support latest openmp... So I installed fedora on my gpu laptop... Then I was able to install nvidia drivers only after disabling bios secure boot... Finally I was able to run the following lines:

user@fedora:~$ sudo dnf install -y gcc gcc-offload-nvptx

user@fedora:~$ cat <<EOF > main.c
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#define N 3000
double
A[N][N], B[N][N], C[N][N];
int main(){
        // seed with random values
        for(int i=0;i<N;i++)
        for(int j=0;j<N;j++){
        A[i][j]=rand(); B[i][j]=rand();C[i][j]=0;
}

#pragma omp target teams distribute parallel for
for(int i=0;i<N;i++){
        for(int j=0;j<N;j++){
                for(int p=0;p<N;p++){
                        C[i][j] = A[i][p] * B[p][j] + C[i][j];
                }}}
}
EOF

user@fedora:~$ gcc main.c -fopenmp -foffload=nvptx-none

And while I am pretty sure that compiler directive should be ok, to my surprise this ran in 4 minutes even if my gpu is nvidia geforce 1650 mobile with 800 cuda cores... So the cause in my opinion is either openmp or compiler does not support this specific directive, or maybe I should try another compiler, or maybe my drivers are not really working well, or maybe matrix multiplication is slow on gpu too, which would explain the existence of cublas... Either case, I need to play more with this so I think it s not worth adding this to the article...

Another counter intuitive thing ... for the single core example, if I change N to a power of 2, ie 4096, I notice the execution time has doubled and I can't think of a reason for this...

Aside from this, I think I am done editing. Feel free to correct or edit as you see fit.

@romangherta it sounds like you may have material for an article on performance or optimization. It would be welcomed. :-)

I've made some minor changes for readability. Nothing major but if you want to verify that I've not introduced any conflicts or issues, now would be a good time.

When we have your approval we can schedule the article for publication.

Thank you for contributing to the Fedora Magazine.

Metadata Update from @rlengland:
- Custom field editor adjusted to glb rlengland

7 months ago

Hi @rlengland , it looks much better, no objections from my side
@glb I need to read this carefully tomorrow after a good night sleep, I already started to question my sanity.

Thank you both for your time

@glb I need to read this carefully tomorrow after a good night sleep, I already started to question my sanity.

It's my mistake -- when you allocate an array with int A[4096], the array is of size 4096, not 4097. :person_facepalming: I guess I need to do more coding, I'm losing it. 😛

So I have no idea what the reason would be for N=4096 being slower than N=5000. That is confusing. I'd still guess it is something to do with resource allocation since 4095 is so much faster than 4096, but I have no idea why "4096" or other powers of two are the "magic" number. The spikes in overhead should happen at 2^N+1 (e.g., when additional memory pages or disk sectors need to be allocated).

Metadata Update from @rlengland:
- Custom field publish adjusted to 2024-09-30

7 months ago

@romangherta your article is scheduled for publication this Monday 30 Sept 0800 UTC

Thank you for your contribution to the Fedora Magazine

Hi @glb, one small update. It seems to be indeed related to memory alignment like you said in your initial comment. I tried 2 small examples 1024 and 1100 and I used in the end perf. The results are below

romh@fedora:/tmp/newdir$ perf stat -d -d -d -e cache-misses,branch-misses ./simple_mm_1100.o

 Performance counter stats for './simple_mm_1100.o':

        56,941,948      cache-misses:u                                                          (45.45%)
         1,306,779      branch-misses:u                                                         (45.47%)
    18,804,401,365      L1-dcache-loads:u                                                       (45.48%)
     1,507,128,919      L1-dcache-load-misses:u          #    8.01% of all L1-dcache accesses   (45.48%)
   <not supported>      LLC-loads:u                                                           
   <not supported>      LLC-load-misses:u                                                     
        16,143,263      L1-icache-loads:u                                                       (45.48%)
           101,420      L1-icache-load-misses:u          #    0.63% of all L1-icache accesses   (45.46%)
     1,336,993,336      dTLB-loads:u                                                            (45.44%)
         5,336,821      dTLB-load-misses:u               #    0.40% of all dTLB cache accesses  (45.44%)
                 0      iTLB-loads:u                                                            (45.43%)
                 0      iTLB-load-misses:u                                                      (45.44%)
       168,732,329      L1-dcache-prefetches:u                                                  (45.44%)
   <not supported>      L1-dcache-prefetch-misses:u                                           

       5.789180084 seconds time elapsed

       5.777109000 seconds user
       0.010998000 seconds sys


romh@fedora:/tmp/newdir$ perf stat -d -d -d -e cache-misses,branch-misses ./simple_mm_1024.o

 Performance counter stats for './simple_mm_1024.o':

     1,091,484,061      cache-misses:u                                                          (45.45%)
         1,142,442      branch-misses:u                                                         (45.45%)
    15,321,560,005      L1-dcache-loads:u                                                       (45.46%)
     1,211,167,695      L1-dcache-load-misses:u          #    7.90% of all L1-dcache accesses   (45.46%)
   <not supported>      LLC-loads:u                                                           
   <not supported>      LLC-load-misses:u                                                     
        59,840,146      L1-icache-loads:u                                                       (45.46%)
           514,357      L1-icache-load-misses:u          #    0.86% of all L1-icache accesses   (45.46%)
     1,085,775,241      dTLB-loads:u                                                            (45.46%)
        36,329,162      dTLB-load-misses:u               #    3.35% of all dTLB cache accesses  (45.46%)
                 2      iTLB-loads:u                                                            (45.45%)
               211      iTLB-load-misses:u               # 10550.00% of all iTLB cache accesses  (45.45%)
       881,108,974      L1-dcache-prefetches:u                                                  (45.45%)
   <not supported>      L1-dcache-prefetch-misses:u                                           

      13.743535143 seconds time elapsed

      13.707549000 seconds user
       0.022985000 seconds sys

So the 1024 example was 3 times slower, and the only thing that jumps in my eyes is iTLB-load-misses and cache-misses... I will assume this is a cpu cache issue and it requires more knowledge about cache inner working for me to fully understand.

Amusingly, blas libraries don't have this problem. I remember reading they use some kinf of a technique called cache blocking... anyways.. maybe I will get back to this in the future, now I need a break from the laptop. A nice week ahead to all.

Issue status updated to: Closed (was: Open)
Issue close_status updated to: published

6 months ago

Log in to comment on this ticket.

Metadata
Boards 1
articles Status: published