Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

Closed
benclifford opened this issue Jun 26, 2024 · 0 comments · Fixed by #3496 or #3497
Closed

HTEX interchange ManagerLost/VersionMismatch reporting is broken by PR #3463 #3495

benclifford opened this issue Jun 26, 2024 · 0 comments · Fixed by #3496 or #3497
Assignees
Labels

Comments

@benclifford
Copy link
Collaborator

PR #3463 rearranges how classes are defined in the interchange, as an effect of rearranging how the interchange is launched. Prior to PR #3463, the interchange code was imported as a module, parsl.executors.high_throughput.interchange and all top level definitions in that module existed as, for example, parsl.executors.high_throughput.interchange.ManagerLost.

When sending a ManagerLost exception back to the interchange, this would be serialized (by pickle) including a reference to module+name parsl.executors.high_throughput.interchange.ManagerLost. There is an implicit requirement that the interchange is running from the same environment as the submit side so that module+name will be resolvable/importable on the submit side when unpickling the exception.

PR #3463 changes how the interchange is launched - interchange.py now runs as the __main__ module of the interchange process, and top level definitions there exist as (for example) __main__,ManagerLost. References to __main__.ManagerLost cannot in general be deserialized on the submit side, because __main__ refers to the user workflow process.

When sending a ManagerLost back to the submit side, this appears on the submit side as a deserialization error, rather than a properly deserialized ManagerLost:

parsl.serialize.errors.DeserializationError: Failed to deserialize objects. Reason: Received exception, but handling also threw an exception: Can't get attribute 'ManagerLost' on <module '__main__' from '/home/benc/parsl/src/parsl/wl.py'>

Discovered by @khk-globus as part of Globus Compute work.

To recreate

Start a long running python_app in a parsl process. killall -STOP process_worker_pool.py to suspend the worker pool, leading to a heartbeat timeout and ManagerLost (don't terminate the worker pool because Parsl will sometimes notice that the process has died on the submit side). Observe the broken exception.

Proposed fix
Move ManagerLost and VersionMismatch into parsl.executors.high_throughput.errors (alongside WorkerLost).

Guiding principle: objects which will be serialized should be importable, not defined locally. (dill has some complications to make that not entirely true, to do with code movement, but that isn't relevant here, I think)

khk-globus added a commit that referenced this issue Jun 26, 2024
Per the analysis in #3495, defining the `ManagerLost` and `VersionMismatch`
errors in the `interchange.py` became a problem in #3463, where the interchange
now runs as `__main__`.  This makes it difficult for Dill to get the serde
correct.

The organizational fix is simply to move these classes to an importable
location, which follows the expectation that classes are available in both
local and remote locations, which defining in `__main__` can't easily
guarantee.

Fixes: #3495
@khk-globus khk-globus self-assigned this Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants