Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] _channels_to_dataframe assigns wrong timestamps if channels have different start times #262

Open
ethan-tau opened this issue Feb 1, 2022 · 1 comment

Comments

@ethan-tau
Copy link

When reading a TDMS file with channels sampled at different rates, synchronization between channels is lost if _channels_to_dataframe is called with absolute_time=False. Each channel is read with a time series relative to itself, as opposed to relative to the first data point timestamp in all specified channels.

A proposed fix would be to read all channels with index = channel.time_track(absolute_time=True) if time_index else None, then calculate a time relative to the group if the argument provided to _channels_to_dataframe is absolute_time=False.

For example, a TDMS file that has several channels sampled at three different data rates and six different starting times:

import nptdms
import pandas as pd
tdms_file_path = "my_local_data_file.tdms"
tdms_file = nptdms.TdmsFile.open(tdms_file_path)
groups = tdms_file.groups()
channels = groups[0].channels()
time_tracks = {}
for channel in channels:
    try:
        time_track = channel.time_track(absolute_time=True)
        time_tracks[channel.name] = time_track
    except KeyError:
        pass
time_info_df = pd.DataFrame([(k, len(v), v[0], v[-1]) for k,v in time_tracks.items()], columns=['name','length','t_zero','t_final'])

Which results in

Row name samples t_zero t_final
0 Channel A 1000000 2022-02-01 01:30:33.268115 2022-02-01 01:30:34.268113999
1 Channel B 1000000 2022-02-01 01:30:33.268115 2022-02-01 01:30:34.268113999
2 Channel C 62500 2022-02-01 01:30:33.268126 2022-02-01 01:30:34.268109999
3 Channel X 62500 2022-02-01 01:30:33.268130 2022-02-01 01:30:34.268113999
4 Channel X 62500 2022-02-01 01:30:33.268134 2022-02-01 01:30:34.268117999
5 Channel X 50 2022-02-01 01:30:33.058532 2022-02-01 01:30:34.038532000
6 Channel X 50 2022-02-01 01:30:33.062192 2022-02-01 01:30:34.042192000
7 Channel X 50 2022-02-01 01:30:33.065852 2022-02-01 01:30:34.045852000

_channels_to_dataframe with default arguments left-justifies all the data on a channel-by-channel basis. This results in an event that is sampled at an absolute time of 01:30:33.058532 on one channel to be reported at the same relative time (time = 0) as an event that is sampled at time 01:30:33.268115.

Present implementation:

def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
    import pandas as pd


    dataframe_dict = OrderedDict()
    for column_name, channel in channels_to_export.items():
        index = channel.time_track(absolute_time) if time_index else None
        if scaled_data:
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
        elif channel.scaler_data_types:
            # Channel has DAQmx raw data
            raw_data = channel.read_data(scaled=False)
            for scale_id, scaler_data in raw_data.items():
                scaler_column_name = column_name + "[{0:d}]".format(scale_id)
                dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
        else:
            # Raw data for normal TDMS file
            raw_data = channel.read_data(scaled=False)
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
    return pd.DataFrame.from_dict(dataframe_dict)

Proposed implementation:

def _channels_to_dataframe(channels_to_export, time_index=False, absolute_time=False, scaled_data=True):
    import pandas as pd

    dataframe_dict = OrderedDict()
    for column_name, channel in channels_to_export.items():
        # This try/except block deals with a particular group of TDMS files 
        # I encountered that don't play nicely. Maybe it's better to allow the 
        # code to throw an exception, rather than silently discarding the channels.
        # If that's the case, only keep the `index = channel.....` line
        try:
            index = channel.time_track(absolute_time=True) if time_index else None
        except KeyError as e:
            if time_index==True:
                continue
            else:
                index = None
        if scaled_data:
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(channel[:]), index=index)
        elif channel.scaler_data_types:
            # Channel has DAQmx raw data
            raw_data = channel.read_data(scaled=False)
            for scale_id, scaler_data in raw_data.items():
                scaler_column_name = column_name + "[{0:d}]".format(scale_id)
                dataframe_dict[scaler_column_name] = pd.Series(data=scaler_data, index=index)
        else:
            # Raw data for normal TDMS file
            raw_data = channel.read_data(scaled=False)
            dataframe_dict[column_name] = pd.Series(data=_array_for_pd(raw_data), index=index)
    df = pd.DataFrame.from_dict(dataframe_dict)
    if (time_index==True and absolute_time==False):
        df.index -= df.index[0]
    return df
@adamreeve
Copy link
Owner

Hi, thanks for the bug report. Yes I think your proposed behaviour makes more sense, although it would be a breaking change as other people may be relying on the existing behaviour. So rather than changing the behaviour, it would be more appropriate to add a new parameter to control this behaviour and make the new approach opt-in. Then for a future 2.0 release we could consider making the new behaviour the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants