diff --git a/docs/assets/IRgraph_markstep.png b/docs/_static/img/IRgraph_markstep.png similarity index 100% rename from docs/assets/IRgraph_markstep.png rename to docs/_static/img/IRgraph_markstep.png diff --git a/docs/assets/IRgraph_no_markstep.png b/docs/_static/img/IRgraph_no_markstep.png similarity index 100% rename from docs/assets/IRgraph_no_markstep.png rename to docs/_static/img/IRgraph_no_markstep.png diff --git a/docs/assets/ci_test_dependency.png b/docs/_static/img/ci_test_dependency.png similarity index 100% rename from docs/assets/ci_test_dependency.png rename to docs/_static/img/ci_test_dependency.png diff --git a/docs/assets/ci_test_dependency_gpu.png b/docs/_static/img/ci_test_dependency_gpu.png similarity index 100% rename from docs/assets/ci_test_dependency_gpu.png rename to docs/_static/img/ci_test_dependency_gpu.png diff --git a/docs/assets/ddp_md_mnist_with_real_data.png b/docs/_static/img/ddp_md_mnist_with_real_data.png similarity index 100% rename from docs/assets/ddp_md_mnist_with_real_data.png rename to docs/_static/img/ddp_md_mnist_with_real_data.png diff --git a/docs/assets/dynamic_shape_mlp_perf.png b/docs/_static/img/dynamic_shape_mlp_perf.png similarity index 100% rename from docs/assets/dynamic_shape_mlp_perf.png rename to docs/_static/img/dynamic_shape_mlp_perf.png diff --git a/docs/assets/gpt2_2b_step_time_vs_batch.png b/docs/_static/img/gpt2_2b_step_time_vs_batch.png similarity index 100% rename from docs/assets/gpt2_2b_step_time_vs_batch.png rename to docs/_static/img/gpt2_2b_step_time_vs_batch.png diff --git a/docs/assets/gpt2_v4_8_mfu_batch.png b/docs/_static/img/gpt2_v4_8_mfu_batch.png similarity index 100% rename from docs/assets/gpt2_v4_8_mfu_batch.png rename to docs/_static/img/gpt2_v4_8_mfu_batch.png diff --git a/docs/assets/image-1.png b/docs/_static/img/image-1.png similarity index 100% rename from docs/assets/image-1.png rename to docs/_static/img/image-1.png diff --git a/docs/assets/image-2.png b/docs/_static/img/image-2.png similarity index 100% rename from docs/assets/image-2.png rename to docs/_static/img/image-2.png diff --git a/docs/assets/image-3.png b/docs/_static/img/image-3.png similarity index 100% rename from docs/assets/image-3.png rename to docs/_static/img/image-3.png diff --git a/docs/assets/image-4.png b/docs/_static/img/image-4.png similarity index 100% rename from docs/assets/image-4.png rename to docs/_static/img/image-4.png diff --git a/docs/assets/image.png b/docs/_static/img/image.png similarity index 100% rename from docs/assets/image.png rename to docs/_static/img/image.png diff --git a/docs/assets/llama2_2b_bsz128.png b/docs/_static/img/llama2_2b_bsz128.png similarity index 100% rename from docs/assets/llama2_2b_bsz128.png rename to docs/_static/img/llama2_2b_bsz128.png diff --git a/docs/assets/mesh_spmd2.png b/docs/_static/img/mesh_spmd2.png similarity index 100% rename from docs/assets/mesh_spmd2.png rename to docs/_static/img/mesh_spmd2.png diff --git a/docs/assets/perf_auto_vs_manual.png b/docs/_static/img/perf_auto_vs_manual.png similarity index 100% rename from docs/assets/perf_auto_vs_manual.png rename to docs/_static/img/perf_auto_vs_manual.png diff --git a/docs/_static/img/pytorch-logo-dark.svg b/docs/_static/img/pytorch-logo-dark.svg new file mode 100644 index 00000000000..717a3ce942f --- /dev/null +++ b/docs/_static/img/pytorch-logo-dark.svg @@ -0,0 +1,24 @@ + + + + + + + + + + + + + diff --git a/docs/assets/pytorchXLA_flow.svg b/docs/_static/img/pytorchXLA_flow.svg similarity index 100% rename from docs/assets/pytorchXLA_flow.svg rename to docs/_static/img/pytorchXLA_flow.svg diff --git a/docs/assets/spmd_debug_1.png b/docs/_static/img/spmd_debug_1.png similarity index 100% rename from docs/assets/spmd_debug_1.png rename to docs/_static/img/spmd_debug_1.png diff --git a/docs/assets/spmd_debug_1_light.png b/docs/_static/img/spmd_debug_1_light.png similarity index 100% rename from docs/assets/spmd_debug_1_light.png rename to docs/_static/img/spmd_debug_1_light.png diff --git a/docs/assets/spmd_debug_2.png b/docs/_static/img/spmd_debug_2.png similarity index 100% rename from docs/assets/spmd_debug_2.png rename to docs/_static/img/spmd_debug_2.png diff --git a/docs/assets/spmd_debug_2_light.png b/docs/_static/img/spmd_debug_2_light.png similarity index 100% rename from docs/assets/spmd_debug_2_light.png rename to docs/_static/img/spmd_debug_2_light.png diff --git a/docs/assets/spmd_mode.png b/docs/_static/img/spmd_mode.png similarity index 100% rename from docs/assets/spmd_mode.png rename to docs/_static/img/spmd_mode.png diff --git a/docs/assets/torchbench_pjrt_vs_xrt.svg b/docs/_static/img/torchbench_pjrt_vs_xrt.svg similarity index 100% rename from docs/assets/torchbench_pjrt_vs_xrt.svg rename to docs/_static/img/torchbench_pjrt_vs_xrt.svg diff --git a/docs/assets/torchbench_tfrt_vs_se.svg b/docs/_static/img/torchbench_tfrt_vs_se.svg similarity index 100% rename from docs/assets/torchbench_tfrt_vs_se.svg rename to docs/_static/img/torchbench_tfrt_vs_se.svg diff --git a/docs/ddp.md b/docs/ddp.md index 4e17fd2edeb..09e1c12f9d5 100644 --- a/docs/ddp.md +++ b/docs/ddp.md @@ -259,7 +259,7 @@ The following results are collected with the command: `python test/test_train_mp_mnist.py --logdir mnist/` on a TPU VM V3-8 environment with ToT PyTorch and PyTorch/XLA. -![learning_curves](assets/ddp_md_mnist_with_real_data.png) +![learning_curves](_static/img/ddp_md_mnist_with_real_data.png) And we can observe that the DDP wrapper converges slower than the native XLA approach even though it still achieves a high accuracy rate at 97.48% at the diff --git a/docs/dynamic_shape.md b/docs/dynamic_shape.md index 04eff4a9984..6804e0e49fb 100644 --- a/docs/dynamic_shape.md +++ b/docs/dynamic_shape.md @@ -36,7 +36,7 @@ Here are some numbers we get when we run the MLP model for 100 iterations: | Number of compilations | 102 | 49 | | Compilation cache hit | 198 | 1953 | -![Performance comparison (a) without dynamic shape (b) with dynamic shape](assets/dynamic_shape_mlp_perf.png) +![Performance comparison (a) without dynamic shape (b) with dynamic shape](_static/img/dynamic_shape_mlp_perf.png) One of the motivations of the dynamic shape is to reduce the number of excessive recompilation when the shape keeps changing between iterations. From the figure above, you can see the number of compilations reduced by half which results in the drop of the training time. diff --git a/docs/first_steps.md b/docs/first_steps.md index 2658d2d8bb8..772b9fc3a9a 100644 --- a/docs/first_steps.md +++ b/docs/first_steps.md @@ -13,7 +13,7 @@ This section provides a brief overview of the basic details of PyTorch XLA, Unlike regular PyTorch, which executes code line by line and does not block execution until the value of a PyTorch tensor is fetched, PyTorch XLA works differently. It iterates through the python code and records the operations on (PyTorch) XLA tensors in an intermediate representation (IR) graph until it encounters a barrier (discussed below). This process of generating the IR graph is referred to as tracing (LazyTensor tracing or code tracing). PyTorch XLA then converts the IR graph to a lower-level machine-readable format called HLO (High-Level Opcodes). HLO is a representation of a computation that is specific to the XLA compiler and allows it to generate efficient code for the hardware that it is running on. HLO is fed to the XLA compiler for compilation and optimization. Compilation is then cached by PyTorch XLA to be reused later if/when needed. The compilation of the graph is done on the host (CPU), which is the machine that runs the Python code. If there are multiple XLA devices, the host compiles the code for each of the devices separately except when using SPMD (single-program, multiple-data). For example, v4-8 has one host machine and [four devices](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v4). In this case the host compiles the code for each of the four devices separately. In case of pod slices, when there are multiple hosts, each host does the compilation for XLA devices it is attached to. If SPMD is used, then the code is compiled only once (for given shapes and computations) on each host for all the devices. -![img](assets/pytorchXLA_flow.svg) +![img](_static/img/pytorchXLA_flow.svg) For more details and examples, please refer to the [LazyTensor guide](https://pytorch.org/blog/understanding-lazytensor-system-performance-with-pytorch-xla-on-cloud-tpu/). @@ -32,7 +32,7 @@ for x, y in tensors_on_device: Without a barrier, the Python tracing will result in a single graph that wraps the addition of tensors `len(tensors_on_device)` times. This is because the `for` loop is not captured by the tracing, so each iteration of the loop will create a new subgraph corresponding to the computation of `z += x+y` and add it to the graph. Here is an example when `len(tensors_on_device)=3`. -![img](assets/IRgraph_no_markstep.png) +![img](_static/img/IRgraph_no_markstep.png) However, introducing a barrier at the end of the loop will result in a smaller graph that will be compiled once during the first pass inside the `for` loop and will be reused for the next `len(tensors_on_device)-1 ` iterations. The barrier will signal to the tracing that the graph traced so far can be submitted for execution, and if that graph has been seen before, a cached compiled program will be reused. @@ -44,7 +44,7 @@ for x, y in tensors_on_device: In this case there will be a small graph that is used `len(tensors_on_device)=3` times. -![img](assets/IRgraph_markstep.png) +![img](_static/img/IRgraph_markstep.png) It is important to highlight that in PyTorch XLA Python code inside for loops is traced and a new graph is constructed for each iteration if there is a barrier at the end. This can be a significant performance bottleneck. @@ -216,27 +216,27 @@ Starting from Stable Diffusion model version 2.1 If we capture a profile without inserting any traces, we will see the following: -![Alt text](assets/image.png) +![Alt text](_static/img/image.png) The single TPU device on v4-8, which has two cores, appears to be busy. There are no significant gaps in their usage, except for a small one in the middle. If we scroll up to try to find which process is occupying the host machine, we will not find any information. Therefore, we will add `xp.traces` to the pipeline [file](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py) as well as the U-net [function](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py). The latter may not be useful for this particular use case, but it does demonstrate how traces can be added in different places and how their information is displayed in TensorBoard. If we add traces and re-capture the profile with the largest batch size that can fit on the device (32 in this case), we will see that the gap in the device is caused by a Python process that is running on the host machine. -![Alt text](assets/image-1.png) -![Alt text](assets/image-2.png) +![Alt text](_static/img/image-1.png) +![Alt text](_static/img/image-2.png) We can use the appropriate tool to zoom in on the timeline and see which process is running during that period. This is when the Python code tracing happens on the host, and we cannot improve the tracing further at this point. Now, let's examine the XL version of the model and do the same thing. We will add traces to the pipeline [file](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py) in the same way that we did for the 2.1 version and capture a profile. -![Alt text](assets/image-4.png) +![Alt text](_static/img/image-4.png) This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830). First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is. Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens. -![Alt text](assets/image-3.png) +![Alt text](_static/img/image-3.png) If we investigate the U-Net function and the scheduler, we can see that the U-Net code does not contain any optimization targets for PyTorch/XLA. However, there are `.item()` and `.nonzero()` calls inside the [scheduler.step](https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L371). We can [rewrite](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/schedulers/scheduling_euler_discrete.py#L310) the function to avoid those calls. If we fix this issue and rerun a profile, we will not see much difference. However, since we have reduced the device-host communication that was introducing smaller graphs, we allowed the compiler to optimize the code better. The function [scale_model_input](https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L205) has similar issues, and we can fix these by making the changes we made above to the `step` function. Overall, since many of the gaps are caused from python level code tracing and graph building, these gaps are not possible to optimize with the current version of PyTorch XLA, but we may see improvements in the future when dynamo is enabled in PyTorch XLA. diff --git a/docs/pjrt.md b/docs/pjrt.md index 7262339a5b8..c09e42e5de5 100644 --- a/docs/pjrt.md +++ b/docs/pjrt.md @@ -404,7 +404,7 @@ compared to XRT, with an average improvement of over 35% on TPU v4-8. The benefits vary significantly by task and model type, ranging from 0% to 175%. The following chart shows the breakdown by task: -![PJRT vs XRT](assets/torchbench_pjrt_vs_xrt.svg) +![PJRT vs XRT](_static/img/torchbench_pjrt_vs_xrt.svg) ### New TPU runtime @@ -423,7 +423,7 @@ In most cases, we expect performance to be similar between the two runtimes, but in some cases, the new runtime may be up to 30% faster. The following chart shows the breakdown by task: -![TFRT vs StreamExecutor](assets/torchbench_tfrt_vs_se.svg) +![TFRT vs StreamExecutor](_static/img/torchbench_tfrt_vs_se.svg) Note: the improvements shown in this chart are also included in the PJRT vs XRT comparison. diff --git a/docs/pytorch_xla_overview.md b/docs/pytorch_xla_overview.md index da087098cae..afc36f5693d 100644 --- a/docs/pytorch_xla_overview.md +++ b/docs/pytorch_xla_overview.md @@ -15,7 +15,7 @@ This section provides a brief overview of the basic details of PyTorch XLA, Unlike regular PyTorch, which executes code line by line and does not block execution until the value of a PyTorch tensor is fetched, PyTorch XLA works differently. It iterates through the python code and records the operations on (PyTorch) XLA tensors in an intermediate representation (IR) graph until it encounters a barrier (discussed below). This process of generating the IR graph is referred to as tracing (LazyTensor tracing or code tracing). PyTorch XLA then converts the IR graph to a lower-level machine-readable format called HLO (High-Level Opcodes). HLO is a representation of a computation that is specific to the XLA compiler and allows it to generate efficient code for the hardware that it is running on. HLO is fed to the XLA compiler for compilation and optimization. Compilation is then cached by PyTorch XLA to be reused later if/when needed. The compilation of the graph is done on the host (CPU), which is the machine that runs the Python code. If there are multiple XLA devices, the host compiles the code for each of the devices separately except when using SPMD (single-program, multiple-data). For example, v4-8 has one host machine and [four devices](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu_v4). In this case the host compiles the code for each of the four devices separately. In case of pod slices, when there are multiple hosts, each host does the compilation for XLA devices it is attached to. If SPMD is used, then the code is compiled only once (for given shapes and computations) on each host for all the devices. -![img](assets/pytorchXLA_flow.svg) +![img](_static/img/pytorchXLA_flow.svg) For more details and examples, please refer to the [LazyTensor guide](https://pytorch.org/blog/understanding-lazytensor-system-performance-with-pytorch-xla-on-cloud-tpu/). @@ -34,7 +34,7 @@ for x, y in tensors_on_device: Without a barrier, the Python tracing will result in a single graph that wraps the addition of tensors `len(tensors_on_device)` times. This is because the `for` loop is not captured by the tracing, so each iteration of the loop will create a new subgraph corresponding to the computation of `z += x+y` and add it to the graph. Here is an example when `len(tensors_on_device)=3`. -![img](assets/IRgraph_no_markstep.png) +![img](_static/img/IRgraph_no_markstep.png) However, introducing a barrier at the end of the loop will result in a smaller graph that will be compiled once during the first pass inside the `for` loop and will be reused for the next `len(tensors_on_device)-1 ` iterations. The barrier will signal to the tracing that the graph traced so far can be submitted for execution, and if that graph has been seen before, a cached compiled program will be reused. @@ -46,7 +46,7 @@ for x, y in tensors_on_device: In this case there will be a small graph that is used `len(tensors_on_device)=3` times. -![img](assets/IRgraph_markstep.png) +![img](_static/img/IRgraph_markstep.png) It is important to highlight that in PyTorch XLA Python code inside for loops is traced and a new graph is constructed for each iteration if there is a barrier at the end. This can be a significant performance bottleneck. @@ -218,27 +218,27 @@ Starting from Stable Diffusion model version 2.1 If we capture a profile without inserting any traces, we will see the following: -![Alt text](assets/image.png) +![Alt text](_static/img/image.png) The single TPU device on v4-8, which has two cores, appears to be busy. There are no significant gaps in their usage, except for a small one in the middle. If we scroll up to try to find which process is occupying the host machine, we will not find any information. Therefore, we will add `xp.traces` to the pipeline [file](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py) as well as the U-net [function](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/models/unet_2d_condition.py). The latter may not be useful for this particular use case, but it does demonstrate how traces can be added in different places and how their information is displayed in TensorBoard. If we add traces and re-capture the profile with the largest batch size that can fit on the device (32 in this case), we will see that the gap in the device is caused by a Python process that is running on the host machine. -![Alt text](assets/image-1.png) -![Alt text](assets/image-2.png) +![Alt text](_static/img/image-1.png) +![Alt text](_static/img/image-2.png) We can use the appropriate tool to zoom in on the timeline and see which process is running during that period. This is when the Python code tracing happens on the host, and we cannot improve the tracing further at this point. Now, let's examine the XL version of the model and do the same thing. We will add traces to the pipeline [file](https://github.com/pytorch-tpu/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py) in the same way that we did for the 2.1 version and capture a profile. -![Alt text](assets/image-4.png) +![Alt text](_static/img/image-4.png) This time, in addition to the large gap in the middle, which is caused by the `pipe_watermark` tracing, there are many small gaps between the inference steps within [this loop](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py#L814-L830). First look closer into the large gap that is caused by `pipe_watermark`. The gap is preceded with `TransferFromDevice` which indicates that something is happening on the host machine that is waiting for computation to finish before proceeding. Looking into watermark [code](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/pipelines/stable_diffusion_xl/watermark.py#L29), we can see that tensors are transferred to cpu and converted to numpy arrays in order to be processed with `cv2` and `pywt` libraries later. Since this part is not straightforward to optimize, we will leave this as is. Now if we zoom in on the loop, we can see that the graph within the loop is broken into smaller parts because the `TransferFromDevice` operation happens. -![Alt text](assets/image-3.png) +![Alt text](_static/img/image-3.png) If we investigate the U-Net function and the scheduler, we can see that the U-Net code does not contain any optimization targets for PyTorch/XLA. However, there are `.item()` and `.nonzero()` calls inside the [scheduler.step](https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L371). We can [rewrite](https://github.com/pytorch-tpu/diffusers/blob/0243d2ef9c2c7bc06956bb1bcc92c23038f6519d/src/diffusers/schedulers/scheduling_euler_discrete.py#L310) the function to avoid those calls. If we fix this issue and rerun a profile, we will not see much difference. However, since we have reduced the device-host communication that was introducing smaller graphs, we allowed the compiler to optimize the code better. The function [scale_model_input](https://github.com/huggingface/diffusers/blob/15782fd506e8c4a7c2b288fc2e558bd77fdfa51a/src/diffusers/schedulers/scheduling_euler_discrete.py#L205) has similar issues, and we can fix these by making the changes we made above to the `step` function. Overall, since many of the gaps are caused from python level code tracing and graph building, these gaps are not possible to optimize with the current version of PyTorch XLA, but we may see improvements in the future when dynamo is enabled in PyTorch XLA. diff --git a/docs/source/_static/img/IRgraph_markstep.png b/docs/source/_static/img/IRgraph_markstep.png new file mode 100644 index 00000000000..2a9ad5ce54f Binary files /dev/null and b/docs/source/_static/img/IRgraph_markstep.png differ diff --git a/docs/source/_static/img/IRgraph_no_markstep.png b/docs/source/_static/img/IRgraph_no_markstep.png new file mode 100644 index 00000000000..282d3907104 Binary files /dev/null and b/docs/source/_static/img/IRgraph_no_markstep.png differ diff --git a/docs/source/_static/img/ci_test_dependency.png b/docs/source/_static/img/ci_test_dependency.png new file mode 100644 index 00000000000..e4b2c397ba0 Binary files /dev/null and b/docs/source/_static/img/ci_test_dependency.png differ diff --git a/docs/source/_static/img/ci_test_dependency_gpu.png b/docs/source/_static/img/ci_test_dependency_gpu.png new file mode 100644 index 00000000000..68cd77ec90c Binary files /dev/null and b/docs/source/_static/img/ci_test_dependency_gpu.png differ diff --git a/docs/source/_static/img/ddp_md_mnist_with_real_data.png b/docs/source/_static/img/ddp_md_mnist_with_real_data.png new file mode 100644 index 00000000000..f83c5182be6 Binary files /dev/null and b/docs/source/_static/img/ddp_md_mnist_with_real_data.png differ diff --git a/docs/source/_static/img/dynamic_shape_mlp_perf.png b/docs/source/_static/img/dynamic_shape_mlp_perf.png new file mode 100644 index 00000000000..109008991f1 Binary files /dev/null and b/docs/source/_static/img/dynamic_shape_mlp_perf.png differ diff --git a/docs/source/_static/img/gpt2_2b_step_time_vs_batch.png b/docs/source/_static/img/gpt2_2b_step_time_vs_batch.png new file mode 100644 index 00000000000..aafa90d6d93 Binary files /dev/null and b/docs/source/_static/img/gpt2_2b_step_time_vs_batch.png differ diff --git a/docs/source/_static/img/gpt2_v4_8_mfu_batch.png b/docs/source/_static/img/gpt2_v4_8_mfu_batch.png new file mode 100644 index 00000000000..0247e85b3c1 Binary files /dev/null and b/docs/source/_static/img/gpt2_v4_8_mfu_batch.png differ diff --git a/docs/source/_static/img/image-1.png b/docs/source/_static/img/image-1.png new file mode 100644 index 00000000000..1eddfc654c5 Binary files /dev/null and b/docs/source/_static/img/image-1.png differ diff --git a/docs/source/_static/img/image-2.png b/docs/source/_static/img/image-2.png new file mode 100644 index 00000000000..1349a8cfda1 Binary files /dev/null and b/docs/source/_static/img/image-2.png differ diff --git a/docs/source/_static/img/image-3.png b/docs/source/_static/img/image-3.png new file mode 100644 index 00000000000..fc3d1112ef5 Binary files /dev/null and b/docs/source/_static/img/image-3.png differ diff --git a/docs/source/_static/img/image-4.png b/docs/source/_static/img/image-4.png new file mode 100644 index 00000000000..0d27d0bd4d4 Binary files /dev/null and b/docs/source/_static/img/image-4.png differ diff --git a/docs/source/_static/img/image.png b/docs/source/_static/img/image.png new file mode 100644 index 00000000000..bf049acdbbc Binary files /dev/null and b/docs/source/_static/img/image.png differ diff --git a/docs/source/_static/img/llama2_2b_bsz128.png b/docs/source/_static/img/llama2_2b_bsz128.png new file mode 100644 index 00000000000..ddf28875a79 Binary files /dev/null and b/docs/source/_static/img/llama2_2b_bsz128.png differ diff --git a/docs/source/_static/img/mesh_spmd2.png b/docs/source/_static/img/mesh_spmd2.png new file mode 100644 index 00000000000..cd7bf793711 Binary files /dev/null and b/docs/source/_static/img/mesh_spmd2.png differ diff --git a/docs/source/_static/img/perf_auto_vs_manual.png b/docs/source/_static/img/perf_auto_vs_manual.png new file mode 100644 index 00000000000..4ef5f18c3b2 Binary files /dev/null and b/docs/source/_static/img/perf_auto_vs_manual.png differ diff --git a/docs/source/_static/img/pytorchXLA_flow.svg b/docs/source/_static/img/pytorchXLA_flow.svg new file mode 100644 index 00000000000..3812141ce48 --- /dev/null +++ b/docs/source/_static/img/pytorchXLA_flow.svg @@ -0,0 +1 @@ + diff --git a/docs/source/_static/img/spmd_debug_1.png b/docs/source/_static/img/spmd_debug_1.png new file mode 100644 index 00000000000..21e6f0554ed Binary files /dev/null and b/docs/source/_static/img/spmd_debug_1.png differ diff --git a/docs/source/_static/img/spmd_debug_1_light.png b/docs/source/_static/img/spmd_debug_1_light.png new file mode 100644 index 00000000000..9f2f060b2d0 Binary files /dev/null and b/docs/source/_static/img/spmd_debug_1_light.png differ diff --git a/docs/source/_static/img/spmd_debug_2.png b/docs/source/_static/img/spmd_debug_2.png new file mode 100644 index 00000000000..66e544f3355 Binary files /dev/null and b/docs/source/_static/img/spmd_debug_2.png differ diff --git a/docs/source/_static/img/spmd_debug_2_light.png b/docs/source/_static/img/spmd_debug_2_light.png new file mode 100644 index 00000000000..87deb04ce43 Binary files /dev/null and b/docs/source/_static/img/spmd_debug_2_light.png differ diff --git a/docs/source/_static/img/spmd_mode.png b/docs/source/_static/img/spmd_mode.png new file mode 100644 index 00000000000..dd9b5cc69cc Binary files /dev/null and b/docs/source/_static/img/spmd_mode.png differ diff --git a/docs/source/_static/img/torchbench_pjrt_vs_xrt.svg b/docs/source/_static/img/torchbench_pjrt_vs_xrt.svg new file mode 100644 index 00000000000..effe9b72be8 --- /dev/null +++ b/docs/source/_static/img/torchbench_pjrt_vs_xrt.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/source/_static/img/torchbench_tfrt_vs_se.svg b/docs/source/_static/img/torchbench_tfrt_vs_se.svg new file mode 100644 index 00000000000..161f0433b0a --- /dev/null +++ b/docs/source/_static/img/torchbench_tfrt_vs_se.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/docs/source/debug.rst b/docs/source/debug.rst new file mode 100644 index 00000000000..7c6a6eee671 --- /dev/null +++ b/docs/source/debug.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../TROUBLESHOOTING.md \ No newline at end of file diff --git a/docs/source/gpu.rst b/docs/source/gpu.rst new file mode 100644 index 00000000000..79d8385467a --- /dev/null +++ b/docs/source/gpu.rst @@ -0,0 +1 @@ +.. mdinclude:: ../gpu.md \ No newline at end of file diff --git a/docs/source/index.rst b/docs/source/index.rst index 7823096ae81..fb336975aa8 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -1,3 +1,21 @@ +:github_url: https://github.com/pytorch/xla + +PyTorch/XLA documentation +=================================== +PyTorch/XLA is a Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs. + +.. toctree:: + :hidden: + + self + +.. toctree:: + :glob: + :maxdepth: 1 + :caption: Docs + + * + .. mdinclude:: ../../API_GUIDE.md PyTorch/XLA API @@ -89,13 +107,3 @@ debug .. autofunction:: counter_value .. autofunction:: metric_names .. autofunction:: metric_data - -.. mdinclude:: ../pytorch_xla_overview.md -.. mdinclude:: ../../TROUBLESHOOTING.md -.. mdinclude:: ../pjrt.md -.. mdinclude:: ../dynamo.md -.. mdinclude:: ../fsdp.md -.. mdinclude:: ../ddp.md -.. mdinclude:: ../gpu.md -.. mdinclude:: ../spmd.md -.. mdinclude:: ../fsdpv2.md diff --git a/docs/source/multi_process_distributed.rst b/docs/source/multi_process_distributed.rst new file mode 100644 index 00000000000..096d969a6d8 --- /dev/null +++ b/docs/source/multi_process_distributed.rst @@ -0,0 +1,2 @@ +.. mdinclude:: ../ddp.md +.. mdinclude:: ../fsdp.md diff --git a/docs/source/runtime.rst b/docs/source/runtime.rst new file mode 100644 index 00000000000..b9739569fcf --- /dev/null +++ b/docs/source/runtime.rst @@ -0,0 +1 @@ +.. mdinclude:: ../pjrt.md diff --git a/docs/source/spmd.rst b/docs/source/spmd.rst new file mode 100644 index 00000000000..0010244e1eb --- /dev/null +++ b/docs/source/spmd.rst @@ -0,0 +1 @@ +.. mdinclude:: ../spmd.md \ No newline at end of file diff --git a/docs/source/torch_compile.rst b/docs/source/torch_compile.rst new file mode 100644 index 00000000000..4c399f12cf6 --- /dev/null +++ b/docs/source/torch_compile.rst @@ -0,0 +1 @@ +.. mdinclude:: ../dynamo.md diff --git a/docs/spmd.md b/docs/spmd.md index 28da4af0969..4f4ead2146d 100644 --- a/docs/spmd.md +++ b/docs/spmd.md @@ -8,7 +8,7 @@ In this user guide, we discuss how [GSPMD](https://arxiv.org/abs/2105.04663) is [GSPMD](https://arxiv.org/abs/2105.04663) is an automatic parallelization system for common ML workloads. The XLA compiler will transform the single device program into a partitioned one with proper collectives, based on the user provided sharding hints. This feature allows developers to write PyTorch programs as if they are on a single large device without any custom sharded computation ops and/or collective communications to scale. -![alt_text](assets/spmd_mode.png "image_tooltip") +![alt_text](_static/img/spmd_mode.png "image_tooltip") _Figure 1. Comparison of two different execution strategies, (a) for non-SPMD and (b) for SPMD._ To support GSPMD in PyTorch/XLA, we are introducing a new execution mode. Before GSPMD, the execution mode in PyTorch/XLA assumed multiple model replicas, each with a single core (Figure 1.a). This mode of execution, as illustrated in the above suits data parallelism frameworks, like the popular PyTorch [Distributed Data Parallel (DDP)](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or Fully Sharded Data Parallel (FSDP), but is also limited in that a replica can only reside on one device core for execution. PyTorch/XLA SPMD introduces a new execution mode that assumes a single replica with multiple cores (Figure 1.b), allowing a replica to run across multiple device cores. This shift unlocks more advanced parallelism strategies for better large model training performance. @@ -98,7 +98,7 @@ For a given cluster of devices, a physical mesh is a representation of the inter We derive a logical mesh based on this topology to create sub-groups of devices which can be used for partitioning different axes of tensors in a model. -![alt_text](assets/mesh_spmd2.png "image_tooltip") +![alt_text](_static/img/mesh_spmd2.png "image_tooltip") We abstract logical mesh with [Mesh API](https://github.com/pytorch/xla/blob/4e8e5511555073ce8b6d1a436bf808c9333dcac6/torch_xla/distributed/spmd/xla_sharding.py#L17). The axes of the logical Mesh can be named. Here is an example: @@ -502,8 +502,8 @@ from torch_xla.distributed.spmd.debugging import visualize_tensor_sharding generated_table = visualize_tensor_sharding(t, use_color=False) ``` - - visualize_tensor_sharding example on TPU v4-8(single-host) + + visualize_tensor_sharding example on TPU v4-8(single-host) - Code snippet used `visualize_sharding` and visualization result: @@ -514,8 +514,8 @@ sharding = '{devices=[2,2]0,1,2,3}' generated_table = visualize_sharding(sharding, use_color=False) ``` - - visualize_sharding example on TPU v4-8(single-host) + + visualize_sharding example on TPU v4-8(single-host) You could use these examples on TPU/GPU/CPU single-host and modify it to run on multi-host. And you could modify it to sharding-style `tiled`, `partial_replication` and `replicated`.