Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: numpy 2.0 breaks OpenData store (probable pandas bug) #995

Open
1 of 3 tasks
rkingsbury opened this issue Sep 2, 2024 · 7 comments
Open
1 of 3 tasks

[Bug]: numpy 2.0 breaks OpenData store (probable pandas bug) #995

rkingsbury opened this issue Sep 2, 2024 · 7 comments
Labels

Comments

@rkingsbury
Copy link
Collaborator

rkingsbury commented Sep 2, 2024

Since updating dependencies to allow numpy 2.0 (#986 ), we have a test failure in the OpenData store that is being triggered by pandas. See, for example, this failed test run.

The exception raised is

   def resolve(self, key: str, is_local: bool):
        """
        Resolve a variable name in a possibly local context.
    
        Parameters
        ----------
        key : str
            A variable name
        is_local : bool
            Flag indicating whether the variable is local or not (prefixed with
            the '@' symbol)
    
        Returns
        -------
        value : object
            The value of a particular variable
        """
        try:
            # only look for locals in outer scope
            if is_local:
                return self.scope[key]
    
            # not a local variable so check in resolvers if we have them
            if self.has_resolvers:
                return self.resolvers[key]
    
            # if we're here that means that we have no locals and we also have
            # no resolvers
            assert not is_local and not self.has_resolvers
            return self.scope[key]
        except KeyError:
            try:
                # last ditch effort we look in temporaries
                # these are created when parsing indexing expressions
                # e.g., df[df > 0]
                return self.temps[key]
            except KeyError as err:
>               raise UndefinedVariableError(key, is_local) from err
E               pandas.errors.UndefinedVariableError: name 'np' is not defined

/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/site-packages/pandas/core/computation/scope.py:244: UndefinedVariableError

The test triggering the failure is test_update:

def test_update(s3store):
    assert len(s3store.index_data) == 2
    s3store.update(
        pd.DataFrame(
            [
                {
                    "task_id": "mp-199999",
                    "data": "asd",
                    "group": {"level_two": 4},
                    s3store.last_updated_field: datetime.utcnow(),
                }
            ]
        )
    )

I did some debugging on the resolve function in pandas (see source code here) and determined that the number in {"level_two": 4}, is getting turned into a np.int64 and that the key and is_local args to the resolve function are np and False, respectively.

Somewhere, pandas is getting confused and using np as a variable name. I'm not sure how or why this is happening but I have a feeling it is a pandas bug. The following may be relevant:

pandas-dev/pandas#54252

https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer

Version

latest

Which OS?

  • MacOS
  • Windows
  • Linux
@rkingsbury rkingsbury added the bug label Sep 2, 2024
@rkingsbury rkingsbury assigned rkingsbury and unassigned rkingsbury Sep 2, 2024
@rkingsbury
Copy link
Collaborator Author

rkingsbury commented Sep 2, 2024

@kbuma would you (or others very familiar with this Store) be able to investigate this and, if appropriate, open a bug report in pandas?

This appears to only affect the OpenDataStore, so I'm interested in opinions from @kbuma , @jmmshn , @munrojm , and @yang-ruoxi as to whether it would be less disruptive to 1) continue to allow maggma to install with numpy 2.0 (with a broken opendatastore) or 2) revert #986 and wait on numpy 2.0 support until this is fixed.

@rkingsbury
Copy link
Collaborator Author

Tagging #848 and #991 as related

@kbuma
Copy link
Contributor

kbuma commented Sep 3, 2024

i'll look into it

@rkingsbury
Copy link
Collaborator Author

Thank you @kbuma !

@kbuma
Copy link
Contributor

kbuma commented Sep 3, 2024

@rkingsbury per comments in pandas-dev/pandas#59668 "pandas does not have a released version that officially supports numpy 2.1." as of yet.

One possible path forward is to remove the open data stores (and thus the Pandas dependency) from the maggma project - we ended up using a different approach for the new builders so are not actively using these stores in MP infra. Let me know if that is an option.

@rkingsbury
Copy link
Collaborator Author

Thanks @kbuma ! FWIW, I believe I saw the bug with numpy>2 < 2.1 as well. Regardless, it's helpful to know that no one (at least no one in MP) is using OpenDataStore.

I like the idea of having some way of interfacing maggma with pandas eventually, so I'm hesitant to remove the class wholesale. What I suggest instead is

  1. disable the failing test (mark xfail)
  2. Issue a warning on OpenDataStore.__init__ to use with caution, because the store is in a a deprecated / unfinished / untested state, and in particular may be incompatible with numpy 2.0+.

That way we can have tests passing but allow future users to build on the nice work that went into OpenDataStore. What do you think? If you agree, would you be able to open a PR with those changes?

@kbuma
Copy link
Contributor

kbuma commented Sep 16, 2024

Thanks @rkingsbury ! I agree with your suggestion and will look to find some time this week to put together a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants