Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Openxla-pin to Sep13, libtpu-pin to Sep13, jax to 0.4.33_nightly #7916

Merged
merged 46 commits into from
Sep 14, 2024

Conversation

ManfeiBai
Copy link
Collaborator

@ManfeiBai ManfeiBai commented Aug 27, 2024

update OpenXLA-pin update to Sep13's commit: 54052bcbe0cda5b0a7be7c8e22eb2bfef025786e

and also update libtpu version to Sep13

limit TPU CI tests XLA_EXPERIMENTAL=nonzero:masked_select:nms python3 test/ds/test_dynamic_shapes.py -v to only TPU v4 CI

@ManfeiBai ManfeiBai changed the title Update openxla-pin Update Openxla-pin Aug 27, 2024
@JackCaoG
Copy link
Collaborator

failure seems real

/github/home/.cache/bazel/_bazel_root/197a057057a49e5811107144e2d78508/external/bazel_tools/tools/build_defs/repo/http.bzl:372:31: in <toplevel>
    ERROR: /__w/xla/xla/pytorch/xla/torch_xla/csrc/runtime/BUILD:500:14: no such package '@tsl//tsl/lib/core': BUILD file not found in directory 'tsl/lib/core' of external repository @tsl. Add a BUILD file to a directory to mark it as a package. and referenced by '//torch_xla/csrc/runtime:ifrt_computation_client_test'
    ERROR: Analysis of target '//torch_xla/csrc/runtime:ifrt_computation_client_test' failed; build aborted: Analysis failed

@ManfeiBai ManfeiBai marked this pull request as ready for review August 30, 2024 08:05
@ManfeiBai ManfeiBai changed the title Update Openxla-pin Update Openxla-pin to Aug26 Aug 30, 2024
@JackCaoG JackCaoG added the tpuci label Aug 30, 2024
@JackCaoG
Copy link
Collaborator

hmm I don't know why TPU test still skipped. Let me trigger again.

@ManfeiBai
Copy link
Collaborator Author

failed with ERROR: Failed to query remote execution capabilities: UNAVAILABLE: Credentials failed to obtain metadata, looks like flaky, clicked rerun

@JackCaoG
Copy link
Collaborator

@ManfeiBai can you build this locally and check if pallas test can pass?

@lsy323
Copy link
Collaborator

lsy323 commented Aug 30, 2024

You need to have a new commit after the tpu ci tag is applied to run tpu ci

@ManfeiBai
Copy link
Collaborator Author

ManfeiBai commented Aug 30, 2024

@ManfeiBai can you build this locally and check if pallas test can pass?

tried locally on v4-8 and failed: https://gist.github.com/ManfeiBai/6dfc589f7d2f297c1954b675022dacee

met error like:

RuntimeError: Numpy is not available

RuntimeError: Bad StatusOr access: INVALID_ARGUMENT: Failed to parse the Mosaic module

RuntimeError: Bad StatusOr access: INTERNAL: Mosaic failed to compile TPU kernel: Bad lhs type in tpu.matmul

AssertionError: False is not true

@ManfeiBai ManfeiBai force-pushed the ManfeiBai-patch-101 branch 2 times, most recently from ccb7255 to 29db342 Compare September 4, 2024 21:55
@ManfeiBai
Copy link
Collaborator Author

try locally on v4-8 without update libtpu, PJRT_DEVICE=TPU python test/test_operations.py -v failed with:

======================================================================
ERROR: test_triangular_solve (__main__.TestOpBuilder)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/pytorch/xla/test/test_operations.py", line 2531, in test_triangular_solve
    self.runOpBuilderTest(
  File "/pytorch/xla/test/test_operations.py", line 2437, in runOpBuilderTest
    results = xu.as_list(aten_fn(*tensors, **kwargs))
  File "/pytorch/xla/test/test_operations.py", line 2524, in aten_fn
    return torch.triangular_solve(
RuntimeError: Calling torch.triangular_solve on a CPU tensor requires compiling PyTorch with BLAS. Please use PyTorch built with BLAS support.

----------------------------------------------------------------------

@ManfeiBai
Copy link
Collaborator Author

after skipped test/test_pallas.py and test/test_pallas_spmd.py, with libtpu 0821, other TPU CI tests passed;

next step is bring test/test_pallas.py and test/test_pallas_spmd.py back

@JackCaoG
Copy link
Collaborator

JackCaoG commented Sep 5, 2024

sg, you can ignore test/test_pallas_spmd.py and test/test_pallas.py for now, but we likely can not use 08/21 libtpu. Can you figure out why 08/22 libtpu will fail?

@ManfeiBai
Copy link
Collaborator Author

sg, you can ignore test/test_pallas_spmd.py and test/test_pallas.py for now, but we likely can not use 08/21 libtpu. Can you figure out why 08/22 libtpu will fail?

08/22 libtpu caused failure like:
tested test/tpu/run_tests.sh with libtpu0822, without test/test_pallas.py and test/test_pallas_spmd.py: https://gist.github.com/ManfeiBai/f215d9497d7d63b97c29df64ff69d107

failed due to:

free(): corrupted unsorted chunks
https://symbolize.stripped_domain/r/?trace=7f2ba0762ce1,7f2ba0a6013f,7f2ba07ab6d9&map= 
*** SIGABRT received by PID 627075 (TID 629631) on cpu 82 from PID 627075; stack trace: ***
PC: @     0x7f2ba0762ce1  (unknown)  raise
    @     0x7f2366fa91a1       1888  (unknown)
    @     0x7f2ba0a60140       2320  (unknown)
    @     0x7f2ba07ab6da  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f2ba0762ce1,7f2366fa91a0,7f2ba0a6013f,7f2ba07ab6d9,0&map= 
E0905 20:27:41.467463  629631 coredump_hook.cc:316] RAW: Remote crash data gathering hook invoked.
E0905 20:27:41.467482  629631 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0905 20:27:41.467487  629631 coredump_hook.cc:411] RAW: Sending fingerprint to remote end.
E0905 20:27:41.467508  629631 coredump_hook.cc:420] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0905 20:27:41.467514  629631 coredump_hook.cc:472] RAW: Dumping core locally.
base/elfcore.c:2078 Failed to write mapping 2480 at 0x7f01c8000000 of size 58597376: No space left on device(28)
E0905 20:29:27.997053  629631 process_state.cc:805] RAW: Raising signal 6 with default behavior
test/tpu/run_tests.sh: line 41: 627075 Aborted                 (core dumped) python3 examples/data_parallel/train_resnet_spmd_data_parallel.py

and might due to the above error, /tmp is full, then caused another issue for tests after some above infos:

FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/pytorch/xla']

@ManfeiBai
Copy link
Collaborator Author

sg, you can ignore test/test_pallas_spmd.py and test/test_pallas.py for now, but we likely can not use 08/21 libtpu. Can you figure out why 08/22 libtpu will fail?

08/22 libtpu caused failure like: tested test/tpu/run_tests.sh with libtpu0822, without test/test_pallas.py and test/test_pallas_spmd.py: https://gist.github.com/ManfeiBai/f215d9497d7d63b97c29df64ff69d107

failed due to:

free(): corrupted unsorted chunks
https://symbolize.stripped_domain/r/?trace=7f2ba0762ce1,7f2ba0a6013f,7f2ba07ab6d9&map= 
*** SIGABRT received by PID 627075 (TID 629631) on cpu 82 from PID 627075; stack trace: ***
PC: @     0x7f2ba0762ce1  (unknown)  raise
    @     0x7f2366fa91a1       1888  (unknown)
    @     0x7f2ba0a60140       2320  (unknown)
    @     0x7f2ba07ab6da  (unknown)  (unknown)
    @                0x1  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f2ba0762ce1,7f2366fa91a0,7f2ba0a6013f,7f2ba07ab6d9,0&map= 
E0905 20:27:41.467463  629631 coredump_hook.cc:316] RAW: Remote crash data gathering hook invoked.
E0905 20:27:41.467482  629631 client.cc:269] RAW: Coroner client retries enabled, will retry for up to 30 sec.
E0905 20:27:41.467487  629631 coredump_hook.cc:411] RAW: Sending fingerprint to remote end.
E0905 20:27:41.467508  629631 coredump_hook.cc:420] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0905 20:27:41.467514  629631 coredump_hook.cc:472] RAW: Dumping core locally.
base/elfcore.c:2078 Failed to write mapping 2480 at 0x7f01c8000000 of size 58597376: No space left on device(28)
E0905 20:29:27.997053  629631 process_state.cc:805] RAW: Raising signal 6 with default behavior
test/tpu/run_tests.sh: line 41: 627075 Aborted                 (core dumped) python3 examples/data_parallel/train_resnet_spmd_data_parallel.py

and might due to the above error, /tmp is full, then caused another issue for tests after some above infos:

FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/pytorch/xla']

this error appear with libtpu between Aug22 and Aug26, updated libtpu version to Aug27, this error disappear

@ManfeiBai ManfeiBai changed the title Update Openxla-pin to Aug26 Update Openxla-pin to Sep5 Sep 6, 2024
@ManfeiBai
Copy link
Collaborator Author

Synced offline with @JackCaoG, will wait until libtpu is needed and ready to be updated with OpenXLA-commit

@bhavya01 bhavya01 self-requested a review September 13, 2024 18:55
@bhavya01
Copy link
Collaborator

bhavya01 commented Sep 13, 2024

Can we also include the changes from https://github.com/pytorch/xla/pull/8008/files? We'll need that for the pallas tests to pass. I think we can directly include those in the pin update PR

@ManfeiBai
Copy link
Collaborator Author

Can we also include the changes from https://github.com/pytorch/xla/pull/8008/files? We'll need that for the pallas tests to pass. I think we can directly include those in the pin update PR

thanks, rebased, let's wait for CI test result

@ManfeiBai ManfeiBai changed the title Update Openxla-pin to Sep5 Update Openxla-pin to Sep13 Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants