Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

null vs NaN #480

Open
m-mohr opened this issue Oct 31, 2023 · 4 comments · May be fixed by #490
Open

null vs NaN #480

m-mohr opened this issue Oct 31, 2023 · 4 comments · May be fixed by #490
Assignees
Milestone

Comments

@m-mohr
Copy link
Member

m-mohr commented Oct 31, 2023

Historically, the openEO processes are using null to encode no-data values due to the fact that JSON can't encode NaN (and +/-Infinity). This was always only meant to be a placeholder for communication through the API. Internally, this can be anything. For some data it might be 0 or 255, for some it might be NaN or null, it pretty much depends on the underlying implementation. The actual no-data values were meant to be encoded in the collection and output metadata. Thus, it was always meant that internally the processes can pass around NaN and +/-Infinity.

Now that we are starting with the test suite for individual processes, it occurs that we write tests and expect null to be handled in internal interfaces. Also, the process definitions get a bit akward if we define behavior for null and NaN, e.g. in #479. Sometimes the behvaior is even different (i.e. our null definition derives from what IEEE 754 defines for NaN). And is NaN covered by the ignore_nodata parameters?

I'd like to discuss how to handle this.

@soxofaan
Copy link
Member

soxofaan commented Nov 2, 2023

Note that we already have the nan process under proposal (https://github.com/Open-EO/openeo-processes/blob/master/proposals/nan.json) which can be used to express a NaN value in a (JSON compatible) process graph

@m-mohr
Copy link
Member Author

m-mohr commented Dec 6, 2023

conclusion from the openEO community call:

  • Keep nodata value and NaN separate unless NaN is the nodata value
  • Clarify processes, also don’t use null so much to refer to no-data values, define this more on the schema level

@m-mohr m-mohr self-assigned this Dec 6, 2023
@soxofaan
Copy link
Member

soxofaan commented Dec 7, 2023

(I've been pondering a bit more about this after the openEO community call, and wanted to dump some thoughts here)

This is indeed quite confusing, but it helps to be aware or explicit about which environment or representation level you are talking. Compare that with representation of (unicode) characters, e.g. the German letter ß: it has unicode codepoint U+00DF, in UTF8 it's encoded with two bytes \xC3\x9F, in latin1 encoding it's just one byte \xDF, in HTML you can encode it with ß, in (classic) ASCII it's impossible to represent directly (unless you coerce it to ss), etc...

Likewise, "nodata" is a more symbolic concept that has different representations in different contexts: in pure JSON null seems to be the must sensible option, in IEEE-style floats there is a specific NaN "code point" that is commonly used to encode nodata, in Python you typically use None, in geotrellis (as used in VITO backend) you can define a custom nodata value regardless of the datatype (int, float, ...) of the data tile you're working with, in numpy it depends and requires more DIY hacks (float arrays can use the IEEE-style NaN, but for other dtypes you have to use masked arrays, or object dtype with None), in a spreadsheet you can leave cells empty, in C/C++/Java you can have null pointers, ...

The problem at the level of openeo process specs is that it's done in JSON (schema), so you can only use null as representation of "nodata" . In some places we try to talk about NaN/not a number, but that gets confusing without the proper context. For example, openeo processes defines the processes is_nan(x) and nan() but have subtly different "overloaded" interpretations of "not a number": nan produces the IEEE-float NaN value, which only exist in a IEEE float context, while is_nan accepts anything in a broader JSON context: a string or array is also "not a number".

At the moment I don't see anything that needs fundamental fixing, it's probably a matter of being more explicit about some details and assumptions in the descriptions and docs.

@m-mohr
Copy link
Member Author

m-mohr commented Jan 3, 2024

Agreed, see PR #490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants