From 094331847dafbb8ce1c13f4737d6baaea7093c1d Mon Sep 17 00:00:00 2001 From: Meike Chabowski Date: Tue, 15 Aug 2023 18:12:54 +0200 Subject: [PATCH] Implemented doc review enhancements Did semi-automated stylecheck, spellcheck, linkcheck. Implemented changes according to SUSE documentation Style Guide. Fixed typos, shortened sentences for better readability, adjusted format, changed wording to be more precise, fixed punctuation. --- xml/MAIN-SBP-GCC-12.xml | 114 ++++++++++++++++++++-------------------- 1 file changed, 57 insertions(+), 57 deletions(-) diff --git a/xml/MAIN-SBP-GCC-12.xml b/xml/MAIN-SBP-GCC-12.xml index 148158c7..cccbff3b 100644 --- a/xml/MAIN-SBP-GCC-12.xml +++ b/xml/MAIN-SBP-GCC-12.xml @@ -29,7 +29,7 @@ SUSE Best Practices - Performance + System Tuning and Performance SUSE Linux Enterprise Server 15 SP4 Development Tools Module @@ -140,8 +140,8 @@ took place in May 2022. Later that month, the entire openSUSE Tumbleweed Linux distribution was rebuilt with it and shipped to users. GCC 12.2, with fixes to over 71 bugs, was released in August of the same year. Subsequently, it has replaced the compiler in the SUSE Linux - Enterprise (SLE) Development Tools Module. GCC 12.3 followed in May 2023 and apart from further - bug fixes also introduced support for Zen 4 based CPUs. GCC 12 comes with many new features, such as + Enterprise (SLE) Development Tools Module. GCC 12.3 followed in May 2023. Apart from further + bug fixes, it also introduced support for Zen 4 based CPUs. GCC 12 comes with many new features, such as implementing parts of the most recent versions of specifications of various languages (especially C2X, C++20, C++23) and their extensions (OpenMP, OpenACC), supporting new capabilities of a wide range of computer architectures and @@ -149,11 +149,11 @@ This document gives an overview of GCC 12. It focuses on selecting appropriate optimization options for your application and stresses the benefits of advanced modes of compilation. First, - we describe the optimization levels the compiler offers and other important options developers + we describe the optimization levels the compiler offers, and other important options developers often use. We explain when and how you can benefit from using Link Time Optimization (LTO) and Profile Guided Optimization - (PGO) builds. We also detail their effects when building a set of well known CPU - intensive benchmarks, and we look at how these perform on AMD Zen 4 based EPYC 9004 Series + (PGO) builds. We also detail their effects when building a set of well-known CPU + intensive benchmarks. Finally, we look at how these perform on AMD Zen 4 based EPYC 9004 Series Processors. @@ -217,8 +217,8 @@ warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Code using <literal>C++17</literal> features Code using C++17 features should always be compiled with the compiler from the Development Tools Module. Linking two objects, such as an application and a shared - library, which both use C++17, one was built with g++ 8 - or earlier and the other with g++ 9 or later is particularly dangerous + library, which both use C++17, where one was built with g++ 8 + or earlier and the other with g++ 9 or later, is particularly dangerous. This is because C++ STL objects instantiated by the experimental code may provide implementation and even ABI that is different from what the mature implementation expects and vice versa. Issues caused by such a mismatch are difficult to predict and may include silent @@ -237,7 +237,7 @@ warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Proposal P0912R5 are also implemented but require that the source file is compiled with the -⁠fcoroutines switch. GCC 12 also experimentally implements many - C++23 features, if you are interested in the implementation + C++23 features. If you are interested in the implementation status of any particular C++ feature in the compiler or the standard library, consult the following pages: @@ -377,8 +377,8 @@ warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Installing GCC 12 from the Development Tools Module Similar to other modules and extensions for SUSE Linux Enterprise Server 15, you can - activate the Development Tools Module either using the command line tool - SUSEConnect or using the YaST setup and configuration + activate the Development Tools Module using either the command line tool + SUSEConnect or the YaST setup and configuration tool. To use the former, carry out the following steps: @@ -515,7 +515,7 @@ S | Name | Summary -⁠O3 imply anything about the precision and semantics of floating-point operations. Even at the optimization level -⁠O3 GCC implements math operations and functions so that they follow the respective IEEE and/or ISO - rulesWhen when the rounding mode is set to the default round-to-nearest (look up + rules When the rounding mode is set to the default round-to-nearest (look up -⁠frounding-⁠math in the manual). with the exception of allowing floating-point expression contraction, for example when fusing an addition and a multiplication into one operationSee @@ -531,13 +531,13 @@ S | Name | Summary imply -⁠ffast-math along with a few options that disregard strict standard compliance. In GCC 12 this level also means the optimizers may introduce data races when moving memory stores which may not be safe for multithreaded applications and disregards the - possibility of ELF symbol interposition happening at run-time. Additionally, the + possibility of ELF symbol interposition happening at runtime. Additionally, the Fortran compiler can take advantage of associativity of math operations even across parentheses and convert big memory allocations on the heap to allocations on stack. The last mentioned transformation may cause the code to violate maximum stack size allowed by ulimit which is then reported to the user as a segmentation fault. We often use level -⁠Ofast to build benchmarks. It is a shorthand for the - options on top of -⁠O3 which often make them run faster and most + options on top of -⁠O3 which often make them run faster. Most benchmarks are intentionally written in a way that they run correctly even when these rules are relaxed. @@ -581,7 +581,7 @@ S | Name | Summary therefore be a challenging task but usually is still somewhat possible. The complete list of optimization and other command line switches is available in the - compiler manual, provided in the info format in the package gcc12-info or + compiler manual. The manual is provided in the info format in the package gcc12-info or online at the GCC project Web site. @@ -600,8 +600,8 @@ S | Name | Summary aware that its release optimization level defaults to -⁠O3 which might not be what you want. To change it, you must modify the CMAKE_C_FLAGS_RELEASE, CMAKE_CXX_FLAGS_RELEASE - and/or CMAKE_Fortran_FLAGS_RELEASE variables, since they are appended at the - end of the compilation command lines, thus overwriting any level set in the variables + and/or CMAKE_Fortran_FLAGS_RELEASE variables. Since they are appended at the + end of the compilation command lines, they are overwriting any level set in the variables CMAKE_C_FLAGS, CMAKE_CXX_FLAGS, and the like. @@ -621,7 +621,7 @@ S | Name | Summary instruction set extensions, you can specify it on the command line. Their complete list is available in the manual, but the most prominent one is -⁠march which lets you select a CPU model to generate code for. For example, if you know that your program will - only be executed on AMD EPYC 9004 Series Processors which is based on AMD Zen 4 cores or + only be executed on AMD EPYC 9004 Series Processors based on AMD Zen 4 cores or processors that are compatible with it, you can instruct GCC to take advantage of all the instructions the CPU supports with option -⁠march=znver4. Note that on SUSE Linux Enterprise Server 15, the system compiler does not know this particular value of @@ -640,9 +640,9 @@ S | Name | Summary Running 32-bit code SUSE Linux Enterprise Server does not support compilation of 32-bit applications, it - only offers runtime support for 32-bit binaries. In order to do so, you will need 32-bit + only offers runtime support for 32-bit binaries. To do so, you will need 32-bit libraries your binary depends on which likely include at least glibc which can be found in - package glibc-32bit. See glibc-32bit. See chapter 20 (32-bit and 64-bit applications in a 64-bit system environment) of the Administration Guide for more information. @@ -656,8 +656,8 @@ S | Name | Summary outlines the classic mode of operation of a compiler and a linker. Pieces of a program are compiled and optimized in chunks - defined by the user called compilation units to produce so-called object files which already - contain binary machine instructions and which are combined together by a linker. Because the + defined by the user called compilation units to produce so-called object files. These object files already + contain binary machine instructions and are combined together by a linker. Because the linker works at such low level, it cannot perform much optimization and the division of the program into compilation units thus presents a profound barrier to optimization. @@ -676,7 +676,7 @@ S | Name | Summary This limitation can be overcome by rearranging the process so that the linker does not receive as its input the almost finished object files containing machine instructions, but is invoked on files containing so called intermediate language - (IL) which is a much richer representation of each original compilation unit (see figure ). The linker identifies the input as not yet entirely compiled and invokes a linker plugin which in turn runs the compiler again. But this time it has at its disposal the representation of the entire program or library that is @@ -774,7 +774,7 @@ S | Name | Summary the assembler snippets defining symbols must be placed into a separate assembler source file so that they only participate in the final linking step. Global register variables are not supported by LTO, and programs either must not use this feature or be built the - traditional way. It is also possible to exclude just some compilation units from LTO (simply by + traditional way. It is also possible to exclude some compilation units from LTO (simply by compiling them without -⁠flto or appending -⁠fno-⁠lto to the compilation command line), while the rest of the program can still benefit from using this feature. @@ -799,7 +799,7 @@ int foo_v1 (void) themselves otherwise. Violations of (strict) aliasing rules and C++ one definition rule tend to cause misbehavior significantly - more often; the latter is fortunately reported by the -Wodr warning which is + more often. The latter is fortunately reported by the -Wodr warning which is on by default and should not be ignored. We have also seen cases where the use of the flatten function attribute led to unsustainable amount of inlining with LTO. Furthermore, LTO is not a good fit for code snippets compiled by configure @@ -832,7 +832,7 @@ int foo_v1 (void) then execute the resulting binary in one or multiple train runs during which it will save information about the behavior of the program to special files. Afterward, the project needs to be rebuilt again, this time with the - -⁠fprofile-use option which instructs the compiler to look for the + -⁠fprofile-use option. This instructs the compiler to look for the files with the measurements and use them when making optimization decisions, a process called Profile-Guided Optimization (PGO). @@ -870,8 +870,8 @@ int foo_v1 (void) non-profit corporation that publishes a variety of industry standard benchmarks to evaluate performance and other characteristics of computer systems. Its latest suite of CPU intensive workloads, SPEC CPU 2017, is often used to compare compilers and how well they optimize code with - different settings because the included benchmarks are well known and represent a wide variety of - computation-heavy programs. This section highlights selected results of a GCC 12 evaluation using + different settings. This is because the included benchmarks are well known and represent a wide variety of + computation-heavy programs. The following section highlights selected results of a GCC 12 evaluation using the suite. Note that when we use SPEC to perform compiler comparisons, we are lenient toward some @@ -953,7 +953,7 @@ int foo_v1 (void) faster than when compiled with GCC 11 and the same optimization level. Nevertheless, it still benefits from the more advanced modes of compilation a lot, together with several other benchmarks which are derived from programs that are typically compiled with - -⁠O2, as can be seen in -⁠O2. This is illustrated in .
@@ -975,7 +975,7 @@ int foo_v1 (void) (measured without debug info). Note that it does not depict that the size of benchmark 548.exchange2_r grew to 290% and 200% of the original size when built with PGO or both PGO and LTO respectively, which looks huge but the growth is from a particularly - small base. It is the only Fortran benchmark in the integer suite and, most importantly, the + small base. It is the only Fortran benchmark in the integer suite and, most importantly, the size penalty is offset by significant speed-up, making the trade-off reasonable. For completeness, we show this result in @@ -1062,7 +1062,7 @@ int foo_v1 (void)
Many of the SPEC 2017 floating-point benchmarks measure how well a given system can - optimize and execute a handful of number crunching loops and they often come from performance + optimize and execute a handful of number crunching loops. They often come from performance sensitive programs written with traditional compilation method in mind. Consequently there are fewer cross-module dependencies, identifying hot paths is less crucial and the overall effect of LTO and PGO suite only improves by 5% (see - Floating-point computations tend to particularly benefit from vectorization advancements - and so it should be no surprise that the FPrate benchmarks also improve substantially when + Floating-point computations tend to particularly benefit from vectorization advancements. + Thus it should be no surprise that the FPrate benchmarks also improve substantially when compiled with GCC 12.3, which also emits AVX512 instructions for a Zen 4 based CPU. The overall boost is shown in whereas @@ -1221,11 +1221,11 @@ int foo_v1 (void) We have built the benchmarking suite using optimization level -⁠O3, LTO (though without PGO) and - -⁠march=native to target the native ISA of our AMD EPYC 9654 Processor - and we compared its runtime score against the suite built with these options and + -⁠march=native to target the native ISA of our AMD EPYC 9654 Processor. + Then we compared its runtime score against the suite built with these options and -⁠ffast-math. As you can see in , the geometric - mean grew by over 13%, but a quick look at will tell you that there are four benchmarks with scores which improved by more than 20% and that of 510.parest_r grew by over 76%. @@ -1333,14 +1333,14 @@ int foo_v1 (void) shows relative rates of integer benchmarks written in C/C++ and the compilers perform fairly similarly there. GCC wins by a large margin on 500.perlbench_r but loses - significantly when compiling 525.x264_r. This is because the compiler + significantly when compiling 525.x264_r. This is because the compiler chooses a vectorizing factor that is too large for the important loops in this video encoder. It is possible to mitigate the problem using compiler option -⁠mprefer-⁠vector-⁠width=128, with which it is again competitive, as you can see in . This problem is being actively worked on by the upstream - GCC community and we plan to use masked vectorized epilogues to minimize the fallout of - choosing a large vectorizing factor for the principal vector loop. Note that PGO can + GCC community. We plan to use masked vectorized epilogues to minimize the fallout of + choosing a large vectorizing factor for the principal vector loop. Note that PGO can substantially help in this case too. @@ -1361,14 +1361,14 @@ int foo_v1 (void) compiled with LLVM with LTO, we have excluded the benchmark in our comparison of geometric mean of SPEC FPrate 2017 suite depicted in . The floating point benchmark suite contains many more Fortran - benchmarks and it can be seen that GCC has advantage in having a mature optimization pipeline + benchmarks. It can be seen that GCC has advantage in having a mature optimization pipeline for this language as well, especially when compiling 503.bwaves_r, 510.parest_r, 549.fotonik3d_r, 554.roms_r (see ) and the already mentioned 527.cam4_r (see - ). The + ). The comparison also shows that the performance of 538.imagick_r when compiled - with GCC 12.3 is substantially smaller. This is caused by store-to-load + with GCC 12.3 is substantially smaller. This is caused by store-to-load forwarding stall issues, which can be mitigated by relaxing inlining limits, something that GCC 13 does automatically. @@ -1415,19 +1415,19 @@ int foo_v1 (void) Even though ICC is not intended as a compiler for AMD processors, it is known for its - high-level optimization capabilities, especially when it comes to vectorization. Therefore we + high-level optimization capabilities, especially when it comes to vectorization. Therefore we have traditionally included it our comparisons of compilers. Recently, however, Intel has - decided to abandon this compiler and is directing its users towards ICX, a new one built on top - of LLVM. This year we have therefore included not just ICC 2021.9.0 (20230302) but also ICX - 2023.1.0 in our comparison. In order to keep the amount of presented data in the rest of this + decided to abandon this compiler and is directing its users toward ICX, a new one built on top + of LLVM. This year we have therefore included not just ICC 2021.9.0 (20230302) but also ICX + 2023.1.0 in our comparison. To keep the amount of presented data in the rest of this section reasonable, we only compare binaries built with -⁠Ofast and - LTO. We have simply passed -⁠march=native GCC and ICX. On the other + LTO. We have simply passed -⁠march=native GCC and ICX. On the other hand, we have used -⁠march=core-avx2 option to specify the target ISA for the old ICC because it is unclear which option is the most appropriate for AMD EPYC 9654 - Processor. This puts this compiler at a disadvantage because it can only emit AVX256 - instructions while the other two can, and GCC does, make use of AVX512. We believe that the + Processor. This puts this compiler at a disadvantage because it can only emit AVX256 + instructions while the other two can, and GCC does, make use of AVX512. We believe that the comparison is still useful as ICC serves mainly as a base and the focus now shifts to ICX but - please keep this in mind when looking at the results below. + keep this in mind when looking at the results below.
Overall performance (bigger is better) of SPEC INTrate 2017 built with ICC 2021.9.0, ICX @@ -1466,7 +1466,7 @@ int foo_v1 (void) </figure> <figure xml:id="fig-gcc12-specint-ofast-vsicc-x264_128"> - <title>Runtime performance (bigger is better) of of 525.x264_r benchmark built with ICC + <title>Runtime performance (bigger is better) of 525.x264_r benchmark built with ICC 2021.9.0, ICX 2023.1.0 and with GCC 12.3 using -mprefer-vector-width=128 @@ -1498,13 +1498,13 @@ int foo_v1 (void)
While GCC achieves the best geometric mean, it is important to look at individual results - too because the overall picture is mixed (see ), each of the three compilers managed to be the fastest in at - least one benchmark. We do not know the reason for rather poor performance of ICX on - 554.roms_r but we have seen a similar issue with the compiler on an Intel + too. The overall picture is mixed (see ), as each of the three compilers managed to be the fastest in at + least one benchmark. We do not know the reason for rather poor performance of ICX on + 554.roms_r. But we have seen a similar issue with the compiler on an Intel Cascadelake server machine too, so it is not a consequence of using an Intel compiler on an AMD - platform. For completeness, 521.wrf_r results for ICC and ICX are provided - in . In + platform. For completeness, 521.wrf_r results for ICC and ICX are provided + in . In conclusion, GCC manages to perform consistently and competitively against these high-performance compilers.