TN-Grid on AMD GPUs
log in

Advanced search

Message boards : Number crunching : TN-Grid on AMD GPUs

Author Message
Alessio Susi
Avatar
Send message
Joined: 7 Jun 15
Posts: 15
Credit: 661,981
RAC: 0
Italy
Message 1739 - Posted: 3 Apr 2020, 12:42:01 UTC

Hi. Will you create an application for AMD GPUs? Actually mine is working on Folding@Home (when the work is available) or MilkyWay@Home but I'd like to help medical research.[/quote]
____________
ASUS X570 E-Gaming
AMD Ryzen 9 3950X, 16 core / 32 thread 4.4 GHz
AMD Radeon Sapphire RX 480 4GB Nitro+
Nvidia GTX 1080 Ti Gaming X Trio
4x16 GB Corsair Vengeance RGB 3466 MHz

Profile Buro87 [Lombardia]
Send message
Joined: 23 Nov 16
Posts: 100
Credit: 4,000,541
RAC: 0
Italy
Message 1741 - Posted: 3 Apr 2020, 14:42:34 UTC - in response to Message 1739.

I remember that, they says, their algorithm and simulations are sequential, and they would not benefit by introducing gpu app, because gpu are suitable with high parallelized work

L.Gilson
Volunteer developer
Send message
Joined: 25 Mar 20
Posts: 5
Credit: 596,499
RAC: 0
Germany
Message 1766 - Posted: 10 Apr 2020, 17:05:02 UTC - in response to Message 1739.

Will you create an application for AMD GPUs?


I'm actually on it.

The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation.

The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects:

1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU.

2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable.

TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough".

Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized.

ETA wise think months, not days. And the results need to be verified, which will also take time.

Falconet
Send message
Joined: 21 Dec 16
Posts: 105
Credit: 3,078,410
RAC: 4
Portugal
Message 1767 - Posted: 10 Apr 2020, 18:38:27 UTC - in response to Message 1766.
Last modified: 10 Apr 2020, 18:38:34 UTC

Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's.

Profile Buro87 [Lombardia]
Send message
Joined: 23 Nov 16
Posts: 100
Credit: 4,000,541
RAC: 0
Italy
Message 1768 - Posted: 10 Apr 2020, 19:59:37 UTC - in response to Message 1767.

Thank you very much L.Gilson :)

Jim1348
Send message
Joined: 29 Dec 16
Posts: 87
Credit: 21,013,002
RAC: 0
United States
Message 1769 - Posted: 10 Apr 2020, 21:07:50 UTC - in response to Message 1767.

Regardless of how it turns out, you have my thanks and, I'm sure, everyone else's.

Yes, certainly. Most projects I have seen get maybe a 15X - 20X speedup with a GPU.
But TN-Grid had better start planning for more work. This will keep them busy.

italianpower18
Send message
Joined: 18 Oct 18
Posts: 6
Credit: 16,414,273
RAC: 0
Italy
Message 1770 - Posted: 12 Apr 2020, 16:54:57 UTC - in response to Message 1766.

thank you Mr. Gilson

L.Gilson
Volunteer developer
Send message
Joined: 25 Mar 20
Posts: 5
Credit: 596,499
RAC: 0
Germany
Message 1774 - Posted: 15 Apr 2020, 20:10:18 UTC

Thanks for the positive feedback. I have good and bad news.

Good news first:
the OpenCL version is running and produces roughly the same results. Roughly means 8% difference on the testing data included in the repo (that's 8% at the level of individual links). I have not yet tested on current production data.

Bad news:
The OpenCL port is in large parts pretty much a 1:1 copy of the CPU code. That implies it is still very, very slow. Memory access is all over the place and wrap divergence must be huge. Long story short: It's about as fast as a single CPU-core. I sort of expected that, 1:1 ports never end up looking good. I'm now looking at which factor is the most dominating slowdown and eliminate those (if possible).

I will keep you informed ...

Profile [B@P] Daniel
Volunteer developer
Send message
Joined: 19 Oct 16
Posts: 90
Credit: 2,205,103
RAC: 0
Poland
Message 1779 - Posted: 17 Apr 2020, 20:12:53 UTC - in response to Message 1774.
Last modified: 17 Apr 2020, 20:13:51 UTC

Thanks for the positive feedback. I have good and bad news.

Good news first:
the OpenCL version is running and produces roughly the same results. Roughly means 8% difference on the testing data included in the repo (that's 8% at the level of individual links). I have not yet tested on current production data.

Bad news:
The OpenCL port is in large parts pretty much a 1:1 copy of the CPU code. That implies it is still very, very slow. Memory access is all over the place and wrap divergence must be huge. Long story short: It's about as fast as a single CPU-core. I sort of expected that, 1:1 ports never end up looking good. I'm now looking at which factor is the most dominating slowdown and eliminate those (if possible).

I will keep you informed ...


I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck!
https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks
____________

L.Gilson
Volunteer developer
Send message
Joined: 25 Mar 20
Posts: 5
Credit: 596,499
RAC: 0
Germany
Message 1781 - Posted: 18 Apr 2020, 9:48:52 UTC - in response to Message 1779.

I am author of optimized apps used by TN-Grid now. I also tried to port app to the GPU, but faced the same problem - global memory access was too slow. If I remember correctly, I tried to port app version which does not use Gray codes (predecessor of current app version), it looked more GPU-friendly for me. I was looking for potential solution for this slow memory access, and asked for algorithm on StackOverflow. I got an answer, however I never actually tried to implement it - I was busy with different things at that time. Here is link to that question, I hope it will be useful for you. Good luck!
https://stackoverflow.com/questions/46635137/how-to-generate-combinations-in-chunks


Looking at the first answer, I will not implement it either.

Yes, I also used the iterative version as a starting point and just started one thread per link. That backfires at l > 2 because the matrix is already sparse and most threads do nothing (80% empty matrix =>
80% of threads dead on arrival).

I want to run it as a 3D grid with the first 2 dimensions scanning the matrix and the last dimension splitting up combinations to test. Chucking them is actually easy if you allow for some work dublication. The first 3 members of the combination group will be fixed for each thread (two from the matrix row and column, one additional fixed by the thread ID). Long running combinations (the ones that stay true in the matrix) will distribute better across the cores that way, while wraps hitting an false entry in the matrix can quit all together. Matrix compaction or just writing the still active combinations in a flat list is probably a good idea at some point. The more complex themes detailed in the first SO answer will result in wrap divergence. Just accepting work dublication avoids that.

The second idea is related p, the mini-rho matrix re-calculated for each combination. Applying the gray code idea by just recalculating the change parts (last element changes => organise the matrix in a way that the last element is also the last one calculated in p) will reduce the number of float ops by a lot. That will be a bit trickier, but probably needed to get a viable solution.

The gray code would also run on the GPU if it weren't recursive. Turning that into iterative is the other path to a fast GPU implementation.

Profile Bill F
Avatar
Send message
Joined: 3 Apr 17
Posts: 40
Credit: 1,149,046
RAC: 194
United States
Message 1956 - Posted: 19 Sep 2020, 2:15:01 UTC - in response to Message 1766.

Will you create an application for AMD GPUs?


I'm actually on it.

The client code is open-source and anyone willing can improve it. Several people from outside the research group already added very clever stuff (see posts by "[B@P] Daniel"), but so far no GPU version made it to stable. I'm now focusing on a new OpenCL-GPU (Nvidia and AMD GPU) implementation.

The current code cannot be ported 1:1, but the idea behind it can be ported. The switch from sequential to parallel will have 2 effects:

1. the raw number of calculations goes up by a factor of 1.5 - 2. That isn't so bad as most computers have way, way faster GPU than CPU.

2. the results will slightly change. I contacted the team about this. They seem to be willing to at least consider the option. The change in results implies the CPU and GPU version cannot crunch the same WU without major changes on the server side. But the scientific output will be comparable.

TN-Grid recovers links between gene expressions. Nobody knows for sure what the true result is. There is the uncertainty related to the measurement itself, statistics,... . But there are also test data sets generated by a human (ground truth data). Even for those datasets the algorithms don't hit the target perfectly. Getting the main parts right is good enough, for various liberal definitions of "main", "right" and "good enough".

Porting the code takes time, don't hold your breath. I'm not part of the team, i'm not related to the university, the scientific team may not like the result in the end and i do this on my own in my free time. It currently looks doable without major gotchas. It will be faster than the CPU version, but not by an exorbitant factor. Expect 25x or maybe 100x faster than a single CPU-core, not some magic 10.000x faster. Given that modern CPUs with 8 cores / 16 threads already run 16 WU in parallel, the impact will not be a total game changer. The current code is already good and highly optimized.

ETA wise think months, not days. And the results need to be verified, which will also take time.



L.Gibson
Your possible OpenCl porting was mentioned on thread in the news
AMD COVID-19 HPC Fund.

Bill


Post to thread

Message boards : Number crunching : TN-Grid on AMD GPUs


Main page · Your account · Message boards


Copyright © 2024 CNR-TN & UniTN