accessors for resampling/rolling/grouping? #9046

spirrobe · 2024-05-24T16:02:57Z

spirrobe
May 24, 2024

Hi

TL;DR: How to add accessors for resampling/groupby/rolling

I'm using xarray to handle netCDF data of a cloud radar. Many of the associated variables are usually shown in decibel (dB, i.e. log-space) while the original data is in linear space. As such, our reader (a wrapper around open_dataset with some potential preprocessing, attribute standardisation as it handles some different types/difference between netCDFs of each manufacturer etc.) by default converts the appropriate variables from linear- to log-space. So far so good.

Now, we often require some statistics, which should be calculated in linear space. For this purpose, I used the accessors, replacing mean/std/.. and adding skewness/kurtosis from scipy with versions that check for an existing attribute ("U/units") and whether they start with dB. (Admittedly, this isn't great and pint units would be better but it's doing the job for now even though I'm not sure what we would like is supported based on https://pint.readthedocs.io/en/stable/user/log_units.html.) For illustration in the __init__.py in the folder housing the reader:

import xarray as xr
import numpy as np
def _dbmean(self, **kwargs):
   # kwargs are passed to the "normal" mean
   if isinstance(self, xr.Dataset):
       return self.apply(_dbmean, keep_attrs=True, **kwargs)
   else:
       if 'units' in self.attrs and self.attrs['units'].startswith('dB'):
           return 10*np.log10(10**(self/10)._mean(**kwargs))
       return self._mean(**kwargs)

  @xr.register_dataset_accessor('mean')
   def mean_dataset_accessor(dataset, **kwargs):
       def mean(**kwargs):
           return _dbmean(dataset, **kwargs)
       doc = xr.core._aggregations.DatasetAggregations.mean.__doc__
       mean.__doc__ = extenddoc(doc)
       return mean

   @xr.register_dataset_accessor('_mean')
   def _mean_dataset_accessor(dataset, **kwargs):
       def _mean(**kwargs):
           return xr.core._aggregations.DatasetAggregations.mean(dataset, **kwargs)
       doc = xr.core._aggregations.DatasetAggregations.mean.__doc__
       _mean.__doc__ = extenddoc(doc)

       return _mean

The above in a way is simply for convenience as we could always use .reduce with the appropriate function and to reduce the need to import _dbmean whenever we deal with those data. (extenddoc simply copies the same information from the original function with some extra information about the log-space)

However, the same handling of linear- and log-space would be nice for rolling/grouping/resampling.
From looking at other discussions here and in the docs, I did not find anything related but I'm probably overlooking something (or asking for something that isn't sensible in the first place).

Any input is appreciated.

Answered by dcherian

May 24, 2024

I'm not an expert here but it seems a lot easier to keep the data in linear-space, and have all your computations be correct, and then transform as necessary to get the nice plot.

View full answer

dcherian · 2024-05-24T16:11:29Z

dcherian
May 24, 2024
Maintainer

Does pint do the right thing when calling mean on a log-unit variable? If so, I think that's the way to proceed.

If not, I'd look into writing an accessor that handled the plotting nicely for you. Or perhaps there's a way to have your data in linear-space as pint arrays, and have matplotlib render the data in log-units?

3 replies

spirrobe May 24, 2024
Author

Thanks for the input (way faster than I'd imagined :-)).

An accessor for plotting would solve those issues for visualisation yes (using norm=...) but derived stats/subsetting is the more relevant aspect.

Regarding pint and mean I think the first MWE and second indicate that it isn't working as hoped:

import numpy as np
from pint import UnitRegistry

ureg_autoconv = UnitRegistry(autoconvert_offset_to_baseunit=True)
ureg_noautoconv = UnitRegistry(autoconvert_offset_to_baseunit=False)
basearr = np.asarray([30, 20])
a = basearr * ureg_noautoconv.dB
b = basearr * ureg_autoconv.dB
c = 10**(basearr/10)
print(a.mean(), b.mean(), 10*np.log10(c.mean()))

import datetime
import pandas as pd
from pint import UnitRegistry
ureg_autoconv = UnitRegistry(autoconvert_offset_to_baseunit=True)
ureg_noautoconv = UnitRegistry(autoconvert_offset_to_baseunit=False)

t1 = datetime.datetime.now(datetime.UTC)
dt = datetime.timedelta(days=1)
timevec =  pd.date_range(t1-dt, t1, freq=dt/1440)
rangevec = np.arange(1000)*10
# absolutely made up values
Z  = (np.arange(rangevec.size*timevec.size).reshape((timevec.size, rangevec.size)) % rangevec.size) / rangevec.size
Z = Z + 1 # to avoid issues with log /nan
data = xr.Dataset(coords={'time': timevec, 'range': rangevec},
               )
data['Z_in_linear'] = ('time', 'range'), Z
data['Z_in_dB'] = 10*np.log10(data['Z_in_linear'])
data['Z_in_dB_with_pint_autoconv'] = data['Z_in_dB'] * ureg_autoconv.dB
data['Z_in_dB_with_pint_noautoconv'] = data['Z_in_dB'] * ureg_noautoconv.dB

print('Correct: ', 10*np.log10(data['Z_in_linear'].mean()).item())
print('Incorrect: ', np.nanmean(data['Z_in_dB']))
print('Also incorrect: ', data['Z_in_dB'].mean().item())
print('Sadly also incorrect: ', data['Z_in_dB_with_pint_autoconv'].data.mean())
print('Again incorrect with warning: ', data['Z_in_dB_with_pint_noautoconv'].data.mean().item())
# output
# Correct:  1.759464700955458
#Incorrect:  1.676149763312762
#Also incorrect:  1.676149763312762
#Sadly also incorrect:  1.676149763312762 decibel
#Again incorrect with warning:  1.676149763312762 decibel

dcherian May 24, 2024
Maintainer

I'm not an expert here but it seems a lot easier to keep the data in linear-space, and have all your computations be correct, and then transform as necessary to get the nice plot.

Answer selected by spirrobe

spirrobe May 24, 2024
Author

Thanks for your time and the valuable discussion. I agree, it seems that this would be a topic for pint at this stage with the mean of it.
I'll add in our codebase the note to use linear-space for groupby/resampling/rolling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

accessors for resampling/rolling/grouping? #9046

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

accessors for resampling/rolling/grouping? #9046

spirrobe May 24, 2024

Replies: 1 comment · 3 replies

dcherian May 24, 2024 Maintainer

spirrobe May 24, 2024 Author

dcherian May 24, 2024 Maintainer

spirrobe May 24, 2024 Author

spirrobe
May 24, 2024

Replies: 1 comment 3 replies

dcherian
May 24, 2024
Maintainer

spirrobe May 24, 2024
Author

dcherian May 24, 2024
Maintainer

spirrobe May 24, 2024
Author