Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyTorch/XLA usability progress tracking #7739

Open
zpcore opened this issue Jul 24, 2024 · 0 comments
Open

PyTorch/XLA usability progress tracking #7739

zpcore opened this issue Jul 24, 2024 · 0 comments
Assignees
Labels
usability Bugs/features related to improving the usability of PyTorch/XLA

Comments

@zpcore
Copy link
Collaborator

zpcore commented Jul 24, 2024

Description

Status tracking for the progress of improving the usability.

Action items for APIs

Below are APIs we plan to clean up to imporve the usability. The list is mostly based on the doc from @will-cromar .

APIs Actions Complete Date Related issues Related Design PR
xla_model.parse_xla_device() internalize 2024-07-18 #7675
torch_xla.launch() new API introduce 2024-07-12 Improve multiprocess with torch_xla.launch() #7648
xla_model.xrt_world_size() deprecate with runtime.world_size() , remove defval argument 2024-07-24 #7679
xla_model.get_ordinal() deprecate with runtime.global_ordinal() 2024-07-24 #7679
xla_model.get_local_ordinal() deprecate with runtime.global_ordinal() 2024-07-24 #7679
using_pjrt() deprecate #7730
requires_pjrt() deprecate #7730
xla_real_devices deprecate
xla_device_hw deprecate
xla_replication_devices replace
set_replication replace
unlazy delete
RateTracker internalize
ToXlaTensorArena internalize
check_view_sharing delete
add_step_closure replace
mark_step replace #6751
reduce_gradients internalize
optimizer_step replace?
save replace/upstream
xla_rendezvous deprecate with torch.distributed
rendezvous deprecate with torch.distributed
do_on_ordinals delete
mesh_reduce replace and implement torch.distributed.all_gather_object
set_rng_state delete
get_rng_state delete

Actions items for integration between PyTorch/XLA and cloud infra

Distributed Training

How to handle distributed checkpoint

  • How to resume distributed training

Integrate with GCP (especially preemptible resources) makes the usability worse

  • Persistent disk gcp operation and storage for large models/datasheet with sharding
  • Retrieve logs from all workers including the failing worker.
  • Stability: high change one of the worker will stop working

Debug and logging

  • Node can fail without any notifications

Tutorial support

  • Check why the FSDP/SPMD tutorial is not working?
  • Incorrect use of FSDP is hard to spot.

Freshness of documentation

  • Not only the documentation in our site, but also third party like Huggingface

Debugging

  • Support of notification of dynamic graph
@zpcore zpcore added the usability Bugs/features related to improving the usability of PyTorch/XLA label Jul 24, 2024
@zpcore zpcore self-assigned this Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usability Bugs/features related to improving the usability of PyTorch/XLA
Projects
None yet
Development

No branches or pull requests

1 participant