Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Date Math processes #247

Closed
soxofaan opened this issue May 10, 2021 · 21 comments · Fixed by #248
Closed

Date Math processes #247

soxofaan opened this issue May 10, 2021 · 21 comments · Fixed by #248
Assignees
Milestone

Comments

@soxofaan
Copy link
Member

soxofaan commented May 10, 2021

I just got this feature request:

Hi, I have a question on the usage of parameters in a UDP. Is it possible to do some date calculations on a parameter before using it later in the UDP? For example, this is a Python snippet that shows what I want to achieve:

 def pixel_inputs(vito_connection, bbox, start, end) -> DataCube:
      start_dt = pandas.to_datetime(start) - pandas.to_timedelta("90D")
      end_dt = pandas.to_datetime(end) + pandas.to_timedelta("90D")
      S2bands = vito_connection.load_collection(
           'TERRASCOPE_S2_TOC_V2',
           temporal_extent=[start_dt, end_dt]
           ....    

This works fine when I run this as a normal Python script. But when I try to include it in a UDP, the start and end are UDP parameters and the pandas.to_datetime() method doesn't know what to do with a Parameter object. Is there another way to achieve this?

I think this is an interesting use case to start introducing "date math" processes, so that they can be used inside the UDP of this use case.

A good one to start seems to be something like date_shift (or date_delta, date_offset), taking arguments:

and returning the shifted date (as string)

@m-mohr m-mohr transferred this issue from Open-EO/openeo-api May 10, 2021
@Open-EO Open-EO deleted a comment from soxofaan May 10, 2021
@m-mohr
Copy link
Member

m-mohr commented May 10, 2021

Indeed, we had date processes in the initial processes draft, because I thought that's an obvious use case, but there were also people arguing that you could do that via the client libraries better so that got removed. Anyway, I'm happy to make a draft - I think it will be based on ISO8601 though and then cli libraries hopefully make it easy enough to construct those?!

@m-mohr m-mohr self-assigned this May 10, 2021
@soxofaan
Copy link
Member Author

but there were also people arguing that you could do that via the client libraries better so that got removed.

Indeed, but those dynamics change once you start working with UDPs

@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

This seems not as trivial as it looks like. Some things that need to be considered:

  • During Summer/Winter time changes there may occur differences between +24 hours and +1 day (may lead to +23/25 hours in som libs, e.g. PHP) - probably it should only change the date component (1-31) and not anything that is of finer granularity (e.g hours).
  • Adding months (how many days? 28/29/30/31) and years (how many days? 365/365.25/366) - also deprecated in pandas, we may just want to say it as above that e.g. +1 year can only change the year component, but nothing else. Same for months, where +1 month would only change the month, but not the date part.

So we need language for that but not sure yet what most libraries actually do. Will dig into some examples.

@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

Another question is whether we support ISO 8601 timespans (e.g. date_calc(date, 'P3Y6M')) or concatenate date computations to achieve the same (e.g. date_calc(date_calc(date, 3, 'year'), 6, 'month'))

@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

Should we support strings that contain only date or time? That seems to make it much more bothersome to document and implement it seems, so I may restrict it to date AND time. Thoughts?

m-mohr added a commit that referenced this issue May 11, 2021
m-mohr added a commit that referenced this issue May 11, 2021
@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

@soxofaan PR #248 for a very simply approach

@m-mohr m-mohr linked a pull request May 11, 2021 that will close this issue
@soxofaan
Copy link
Member Author

During Summer/Winter time changes ...

We always work in UTC, right? So I don't think we should bother with summer-winter time changes.

Adding months (how many days? 28/29/30/31) and years (how many days? 365/365.25/366) - also deprecated in pandas, we may just want to say it as above that e.g. +1 year can only change the year component, but nothing else. Same for months, where +1 month would only change the month, but not the date part.

I think that can cause invalid dates, e.g.

  • 2020-02-29 (valid) + 1 year -> 2021-02-29 (invalid)
  • 2021-01-31 (valid) + 1 month -> 2021-02-31 (invalid)

Alternative solutions:

  • specify that "year" and "month" are not dynamic, but constant shortcuts for 265 days and 30 days respectively. (Like a week is constant shortcut for 7 days). Simple but will probably be annoying because for example 12 months != 1 year
  • when shifting year or month components: first round down to first of the month (save this offset for later), shift years and/or months, add initial day offset back we subtracted earlier. e.g.:
    2021-01-31 + 1 month -> 2021-01-01 (remember +30 days) + 1 month -> 2021-02-01 + 30 days -> 2021-03-03
    this will work predictable and intuitive for dates where the day component is 28 or lower and will give explainable results when the month flows over

@soxofaan
Copy link
Member Author

Another question is whether we support ISO 8601 timespans (e.g. date_calc(date, 'P3Y6M')) or concatenate date computations to achieve the same (e.g. date_calc(date_calc(date, 3, 'year'), 6, 'month'))

I'd expect that the proverbial 99% use case will be just shifting one date level, so I think that your latter proposal is just fine. Clients libraries can still jump in to hide the fact that you have to do multiple calls when you want to shift multiple date levels.

Should we support strings that contain only date or time? That seems to make it much more bothersome to document and implement it seems, so I may restrict it to date AND time. Thoughts?

I wonder how important the time level actually is in EO in general. In all VITO use cases I've seen we never bother about time granularity. That being said, I think it shouldn't be too hard to cover both the cases "date" and "date+time" simultaneously in the spec and implementation.

@jdries any thoughts?

@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

We always work in UTC, right? So I don't think we should bother with summer-winter time changes.

Unfortunately not (yet).

I think that can cause invalid dates

Indeed, okay, yet another "solution" that has disadvantages. There seems to be a good reason for panda deprecating it.

  • specify that "year" and "month" are not dynamic, but constant shortcuts for 265 days and 30 days respectively. (Like a week is constant shortcut for 7 days). Simple but will probably be annoying because for example 12 months != 1 year

Indeed, 12 months != 1 year would be an issue and can lead to quite unpredictable results. Then it's probably better to disallow month and year and let people compute the number of days on their own. At least then no unexpected surprises will happen.

  • when shifting year or month components: first round down to first of the month (save this offset for later), shift years and/or months, add initial day offset back we subtracted earlier. e.g.:
    2021-01-31 + 1 month -> 2021-01-01 (remember +30 days) + 1 month -> 2021-02-01 + 30 days -> 2021-03-03
    this will work predictable and intuitive for dates where the day component is 28 or lower and will give explainable results when the month flows over

Could be an option, but I think we should look into libraries more into detail and figure out what they are actually doing. If we find a common ground there, it's makes implementing the process easier in most cases.

@m-mohr
Copy link
Member

m-mohr commented May 11, 2021

I wonder how important the time level actually is in EO in general. In all VITO use cases I've seen we never bother about time granularity. That being said, I think it shouldn't be too hard to cover both the cases "date" and "date+time" simultaneously in the spec and implementation.

What would date_shift('2020-01-01', 1, 'hour') do? Throw an error or return '2020-01-01T01:00:00Z'? We already have the "defaults to 0" behavior for milliseconds...

@soxofaan
Copy link
Member Author

What would date_shift('2020-01-01', 1, 'hour') do?

I would return "2020-01-01": assume time 00:00:00, keep granularity for output and round down to do so

Likewise, for date_shift('2020-01-01', 25, 'hour') I would return "2020-01-02"

@soxofaan
Copy link
Member Author

soxofaan commented May 12, 2021

but I think we should look into libraries more into detail and figure out what they are actually doing. If we find a common ground there, it's makes implementing the process easier in most cases.

A couple of quick observations from the wild (Python):

  • Python builtin datetime.timedelta: largest scale to specify is day level, no support for weeks, months, years
  • pandas.to_timedelta doesn't seem to support month delta's (anymore?). Year delta's work, but it seems to be a constant offset of 365 days 05:49:12. For example:
    • "2020-01-01" + 1y -> '2020-12-31 05:49:12'
    • "2021-01-01" + 1y -> '2022-01-01 05:49:12'
      this doesn't seem like something we want to replicate
  • the dateutil package (which is used in various other date handling libs) has a relativedelta (https://dateutil.readthedocs.io/en/stable/relativedelta.html) that does support months and years, and the behavior it implements on month overflow is to stick to the last day of the month, e.g.:
    • "2020-01-31 + 1 month -> 2020-02-29
    • "2021-01-31 + 1 month -> 2021-02-28
    • "2020-02-29 + 1 year -> 2021-02-28

@soxofaan
Copy link
Member Author

we could also start with not supporting year and months, and only provide support for days, (maybe weeks), hours, minutes, ...

The discussion about years and months could be a different follow up issue

@soxofaan
Copy link
Member Author

On a more general note, what you commonly find in date math libraries like dateutil is date shifting (shifting the date some relative offset) and date replacing (setting certain date components , e.g. month, to a fixed value) functionality, provided in some symmetrical or even unified way.

Currently in #248 we have the idea to do the date shift for a single temporal level (day/hour/..) at a time because that should be enough for most use cases. But for date replacing, you probably want to set multiple components at the same time (e.g round a date down to first of month, or up to last of month).
With that in mind, and after seeing some usage patterns of some python libraries, another approach we could take is providing these two related processes:

  • date_shift(date, value)
  • date_replace(date, value)

where value is an object, with fields "day", "hour", ..., for example:

  • date_shift(date, value={"month":1, "day":-2})`: shift ahead 1 month and 2 days back
  • date_replace(date, value={"day":1, "hour":0, "minute":0, "second":0}): round down to first day of the month

@m-mohr
Copy link
Member

m-mohr commented May 12, 2021

I would return "2020-01-01": assume time 00:00:00, keep granularity for output and round down to do so

Keeping granularity sounds reasonable.

A couple of quick observations from the wild (Python):

Thanks for those observations. I'll try to add some JS, Java and R later.

we could also start with not supporting year and months, and only provide support for days, (maybe weeks), hours, minutes, ...

Indeed, although I feel like that in our domain year and month would be the core use-cases, right? I mean hours, minutes and (milli)seconds are not the granularity of usual EO processing so we'd be left with just days (and the shortcut weeks).
So if we do year and month it should support something that's not just working on a fixed number of days, but give an advantage that can't be easily computed without "looking" at a calendar.

  • date_shift(date, value={"month":1, "day":-2})`: shift ahead 1 month and 2 days back
  • date_replace(date, value={"day":1, "hour":0, "minute":0, "second":0}): round down to first day of the month

While this looks like a unified behavior on the first look, the JSON Schemas would still be different:

  • shift allows negative values and positive values without limit, excluding 0
  • replace allows the valid values for each part individually, e.g. day from 1 to 31 etc.

So an alignment doesn't seem to help a lot here and I think we can define date_shift mostly independently of a potential date_replace. Also, in process definitions I usually try to avoid objects as parameters as they are harder to grasp from a user perspective.

@m-mohr
Copy link
Member

m-mohr commented May 12, 2021

  • the dateutil package (which is used in various other date handling libs) has a relativedelta (https://dateutil.readthedocs.io/en/stable/relativedelta.html) that does support months and years, and the behavior it implements on month overflow is to stick to the last day of the month, e.g.:

    • "2020-01-31 + 1 month -> 2020-02-29
    • "2021-01-31 + 1 month -> 2021-02-28
    • "2020-02-29 + 1 year -> 2021-02-28

That behavior seems pretty logical and reasonable to me. Would that explain it well enough?

If any of the changes result in an invalid date, the corresponding part is rounded down to the next valid date. For example, adding a month to 2020-01-31 would result in 2020-02-29.

I found that much simpler to understand in contrast to:

when shifting year or month components: first round down to first of the month (save this offset for later), shift years and/or months, add initial day offset back we subtracted earlier. e.g.: 2021-01-31 + 1 month -> 2021-01-01

@soxofaan
Copy link
Member Author

soxofaan commented May 12, 2021

I feel like that in our domain year and month would be the core use-cases, right? I mean hours, minutes and (milli)seconds are not the granularity of usual EO processing so we'd be left with just days (and the shortcut weeks).
... it should support something that's not just working on a fixed number of days, but give an advantage that can't be easily computed without "looking" at a calendar.

I understand that it looks underwhelming to have a solution where a day level shift is the only practically relevant option, but in the original use case of this feature request (date manipulation in UDPs) that would already be a game changer.

the JSON Schemas would still be different ... So an alignment doesn't seem to help a lot here

Good point about the schemas, but that doesn't mean that you should drop a more loose form of uniform interface all together. Actually, in my former job I often had to do a lot of "date math" (using the arrow library) and that typically involved combinations of shift and replace (e.g. date.replace(day=1).shift(months=+1).shift(days=-1) to get last day of the month). Having a uniform interface for these methods made this a nice API to work with.

Also, in process definitions I usually try to avoid objects as parameters as they are harder to grasp from a user perspective.

True, but how would you propose to define date_replace, allowing multiple date components to be set at once?

The value of having both date_shift and date_replace (preferably in some uniform way) is that it offers the end user a lot of flexibility to implement the desired month/year overflow handling according to their use case.

dateutil ... on month overflow is to stick to the last day of the month

That behavior seems pretty logical and reasonable to me.

Yes I agree, from all options I've seen so far this one would also my preference now.

Also note that this behavior could also be used for date_replace : date_replace(day=31) would automatically snap to last day of month instead of giving some kind of value error. dateutil does this automatic snapping for example.

@m-mohr
Copy link
Member

m-mohr commented May 12, 2021

I've just pushed a new commit, please have another look. It mostly adds the date-only support.

I understand that it looks underwhelming to have a solution where a day level shift is the only practically relevant option, but in the original use case of this feature request (date manipulation in UDPs) that would already be a game changer.

The minutes/hours/seconds/milliseconds don't add much complexity so it's fine to keep them. With months and years we may seem to settle on something reasonable right now. I'll do further investigations in other languages next to verify that it's easy to implement though.

Good point about the schemas, but that doesn't mean that you should drop a more loose form of uniform interface all together. Actually, in my former job I often had to do a lot of "date math" (using the arrow library) and that typically involved combinations of shift and replace (e.g. date.replace(day=1).shift(months=+1).shift(days=-1) to get last day of the month). Having a uniform interface for these methods made this a nice API to work with.

On the other hand, we could also make date_replace work with the same API that is available in date_shift. For the usecase above that's totally fine at least.

True, but how would you propose to define date_replace, allowing multiple date components to be set at once?

In a world where there are only named parameters available, I'd go with parameters. But in JS for example that could lead to something like date_replace(date, null, null, 1) to change the day to 1 while in Python and R it's date_replace(date, day = 1). 🤨

@clausmichele
Copy link
Member

we could also start with not supporting year and months, and only provide support for days, (maybe weeks), hours, minutes, ...

The discussion about years and months could be a different follow up issue

I agree here, having the days would be already a good start and covering many use cases.

@m-mohr m-mohr added this to the 1.2.0 milestone May 18, 2021
@m-mohr
Copy link
Member

m-mohr commented May 19, 2021

So I had a look at R's lubridate (nice cheatsheet, by the way: https://rawgit.com/rstudio/cheatsheets/master/lubridate.pdf ), JS's moment.js and Java's java.time. For them, all examples work correctly, including leap month and leap second "snapping". From what you have written, I assume Pythons dateutil package will works, too. See the PR for a new version of the process, please.

@m-mohr
Copy link
Member

m-mohr commented May 20, 2021

date_shift has been merged as a new process, date_replace and/or date_get could be the next potential candidate. See #254

@m-mohr m-mohr closed this as completed May 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants