Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance comparisons: nc mesh vs conf. mesh #18

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

artv3
Copy link
Collaborator

@artv3 artv3 commented May 23, 2022

Folks, we seem to loose quite a bit of performance when going the nc-mesh machinery.

This PR will generate the data for the plot below:

nc-mesh

when using a conforming mesh, we see the following performance:

Screen Shot 2022-05-23 at 9 23 46 AM

With hypre branch: parcsr_local_trans

  • a change in ex1

parcsr_local_trans png

@artv3 artv3 marked this pull request as draft May 23, 2022 16:28
@artv3 artv3 changed the title Performance Comparisons: nc mesh vs conf. mesh Performance comparisons: nc mesh vs conf. mesh May 23, 2022
@artv3 artv3 requested review from v-dobrev, camierjs and YohannDudouit and removed request for v-dobrev May 23, 2022 19:43
@artv3
Copy link
Collaborator Author

artv3 commented May 23, 2022

Turning on the profiler for the following run:
./ex1 -p 0 -o 2 -l 14 -d cuda

shows that most of the run-time for the non-conforming mesh is dominated by sorting algorithms?

Screen Shot 2022-05-23 at 12 46 21 PM

While there are no sorting algorithms in the conforming mesh case:

Screen Shot 2022-05-23 at 12 46 58 PM

I have the profiler data if you guys want to look at it with me.

@tzanio
Copy link
Member

tzanio commented May 23, 2022

Do you have the stack trace for stable_sort_by_key_domino_phase1? I am surprised a sort is being called during Matvec...

@v-dobrev
Copy link
Member

I see that HYPRE creates and destroys a cuSparse matrix every time the matvec function is called -- maybe that's where these sorts come from?

@v-dobrev
Copy link
Member

Or maybe when we do MultTranspose, the transpose is created and destroyed every time?

@YohannDudouit
Copy link

I see that HYPRE creates and destroys a cuSparse matrix every time the matvec function is called -- maybe that's where these sorts come from?

That would explain why I see:

8.82326s  8.6400us            (533 1 1)       (256 1 1)        38  8.5000KB        0B         -           -           -           -  Tesla V100-SXM2         1        17  void stable_sort_by_key_local_core<int=256, int=4>(int, int, int*, int*, int*, int*) [15956]
8.82327s  1.4080us                    -               -         -         -         -      128B  86.698MB/s      Device           -  Tesla V100-SXM2         1        17  [CUDA memset]
8.82327s  1.4080us                    -               -         -         -         -        4B  2.7093MB/s      Device           -  Tesla V100-SXM2         1        17  [CUDA memset]
8.82328s  4.4160us              (2 1 1)       (256 1 1)        22      256B        0B         -           -           -           -  Tesla V100-SXM2         1        17  void cusparseIinclusive_localscan_core<int=256, int=4>(int, int*, int*, int*) [15961]
8.82329s  4.7360us              (1 1 1)       (256 1 1)        24      264B        0B         -           -           -           -  Tesla V100-SXM2         1        17  void cusparseIinclusive_scan_domino_v1_core<int=256, int=4>(int, int*, int*, int*, int*, int*) [15963]
8.82330s  4.3840us              (2 1 1)       (256 1 1)        16        4B        0B         -           -           -           -  Tesla V100-SXM2         1        17  void cusparseIinclusive_scan_merge_core<int=256, int=4>(int, int, int*, int*, int*) [15965]
8.82331s  12.832us            (533 1 1)       (256 1 1)        28       16B        0B         -           -           -           -  Tesla V100-SXM2         1        17  void stable_sort_by_key_merge_core<int=256, int=4>(int, int*, int*, int*, int*, int*, int*) [15967]
8.82332s  5.6320us            (533 1 1)       (256 1 1)        24        0B        0B         -           -           -           -  Tesla V100-SXM2         1        17  void stable_sort_by_key_stop_core<int=256, int=4>(int, int*, int*) [15969]

being called over and over.

@v-dobrev
Copy link
Member

I'm not sure the cuSparse constructor (with external data) or action need to do sorting. The transpose construction on GPU, on the other hand, probably needs some sorting.

@liruipeng
Copy link

Do you have $A^T x$ anywhere?

@v-dobrev
Copy link
Member

v-dobrev commented May 23, 2022

Yes, we call hypre_ParCSRMatrixMatvecT. Does that create and destroy the transpose each time when running on GPU?

@liruipeng
Copy link

Yes. you can avoid that by HYPRE_BoomerAMGSetKeepTranspose but it only makes sense in the context of AMG. I need to make a branch for changing this.

@tzanio
Copy link
Member

tzanio commented May 23, 2022

If the transpose will be constructed either way, isn't it better to create and store the $P^T$ and then just multiply with it?

@liruipeng
Copy link

liruipeng commented May 23, 2022

If the transpose will be constructed either way, isn't it better to create and store the P^T and then just multiply with it?

If it's worth paying the cost of constructing and storing the global $P^T$, yes.

@tzanio
Copy link
Member

tzanio commented May 23, 2022

What's the difference in storage between that and the HYPRE_BoomerAMGSetKeepTranspose approach?

@liruipeng
Copy link

OK. I take part of it back. Storage wise, local and global $P^T$ are more or less the same I think (except for some parcsr info other than the matrix itself). But constructing $P^T$ globally is more expensive, which requires communications .

@tzanio
Copy link
Member

tzanio commented May 26, 2022

Yes. you can avoid that by HYPRE_BoomerAMGSetKeepTranspose but it only makes sense in the context of AMG. I need to make a branch for changing this.

@liruipeng, let us know when you have this ready for testing

CUDA_MAKE_OPTS=(
"--with-cuda"
"--with-gpu-arch=70"
"--enable-device-memory-pool"
Copy link
Collaborator

@tomstitt tomstitt May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this is as performant as using Umpire? Also, does this disable the use of unified memory?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We suggest using Umpire if you can. We probably will remove our device memory pool implementation in near future since even CUDA now has the "official" memory pool.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default for hypre is no unified memory (ref. docs).
--enable-unified-memory Use unified memory for allocating the memory
(default is NO).

If I recall correctly I think Umpire does improve the performance a bit, not sure by how what factor compared to the internal memory pool framework.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liruipeng do you know how much faster it is to use umpire?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liruipeng do you know how much faster it is to use umpire?

Hard to give a number. It highly depends on the application.

@liruipeng
Copy link

Yes. you can avoid that by HYPRE_BoomerAMGSetKeepTranspose but it only makes sense in the context of AMG. I need to make a branch for changing this.

@liruipeng, let us know when you have this ready for testing

@tzanio @v-dobrev
For here, you only do matrix-transpose-vector, without matrix-transpose-matrix multiplication. Is it correct?

@tzanio
Copy link
Member

tzanio commented May 27, 2022

For here, you only do matrix-transpose-vector, without matrix-transpose-matrix multiplication. Is it correct?

Yes, overall we evaluate the action of $P^T A P$ by

  • hypre MatVec with $P$
  • mfem matrix-free partial assembly action of $A$
  • hypre MatVecT with $P$

The discussion here is for the last step

@liruipeng
Copy link

@artv3 @tzanio I created a new branch https://github.com/hypre-space/hypre/tree/parcsr_local_trans

The way I imagined to work is to call hypre_ParCSRMatrixLocalTranspose on $P$, so it generates the local transposes. And in the following matvecT the transposes should be reused without recreations. Of course, you can also do the global transposition hypre_ParCSRMatrixTranspose which I expected to be less efficient.

Give it a try and let me know. Thanks!

@tzanio
Copy link
Member

tzanio commented May 27, 2022

Thanks @liruipeng !

@artv3
Copy link
Collaborator Author

artv3 commented May 28, 2022

@artv3 @tzanio I created a new branch https://github.com/hypre-space/hypre/tree/parcsr_local_trans

The way I imagined to work is to call hypre_ParCSRMatrixLocalTranspose on P, so it generates the local transposes. And in the following matvecT the transposes should be reused without recreations. Of course, you can also do the global transposition hypre_ParCSRMatrixTranspose which I expected to be less efficient.

Give it a try and let me know. Thanks!

Improved performance! Looking much better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants