Skip to content

Latest commit

 

History

History
124 lines (91 loc) · 10.3 KB

data_description.md

File metadata and controls

124 lines (91 loc) · 10.3 KB

Data description for programmers/statisticians

This is a description of the measured values intended for people analyzing the data. Medical knowledge is not assumed, though the level of detail of the medical part would be shallow.

Currently this document describes the fake dataset generated by fake_data_grein which contain less markers and comorbidities than the actual data would. But the overall structure should stay the same. Also the document is based on consultations with clinicians, but was not checked by a clinician yet - all mistakes are my own.

Basics

The data include only hospitalized patients. We gather both patient data which correspond to the state of the patient upon admission to hospital and some summary values and disease progression data that are measured repeatedly over the course of the hospitalization. The dataset will be represented by a list that contains elements for each type of data.

The primary patient-centric measurement is the final outcome (discharged, deceased or continued hospitalization) and the breathing support the patient requires, this can be one of:

  1. AA (Ambient air, no support required)
  2. Oxygen (supplemental oxygen by a nasal tube or a light mask)
  3. NIPPV (Non-invasive positive pressure ventilation)
  4. MV (Mechanical, invasive ventilation)
  5. ECMO (Extra corporeal membrane oxygenation - the patient’s blood is oxygenated outside of their body).

Those are strictly ordered by severity.

Patient data

Patient data is stored in the patient_data list element.

Basic quantities:

  • patient_id a unique ID of the patient (unique across all study sites)
  • hospital_id a unique string identifying a study site. Study sites will be pseudonymized to increase patient anonymity, so the string will not be interpretable, but the IDs will be the same once more data arrives.
  • age age in years
  • age_norm normalized age
  • sex sex, M or F
  • outcome final outcome of the patient, one of Discharged, Hospitalized (still hospitalized at the date of data collection), Transferred (when transferred to a different hospital) and Death. For most purposes Transferred can be considered as the same as Hospitalized
  • last_record day for which we have the last record for the patient, i.e. the date to which the outcome column refers (0 is the day of hospital admission)
  • days_from_symptom_onset number of days between symptom onset and hospitalization, "symptom onset" is defined as the day the patient or their carer subjectively first noticed any symptoms associated with Covid-19. It might not be available and is not a very reliable marker, but is relevant to determine if the patient was treated very early in their disease or not.
  • admitted_for_covid whether the patient was originally admitted in relation to Covid diagnosis (for some patients the Covid diagnosis was discovered while treating something else)
  • best_supportive_care_from day when "best supportive care" was started. If the patient was determined to be too frail for some treatments (e.g. mechanical ventilation), this indicates the first day when treatment that would otherwise be chosen was avoided and best supportive care was initiated (0 is the day of hospital admission)
  • discontinued_medication discontinued any of the Covid medications due to adverse evetns? Boolean.

Comorbidites:

  • BMI the body mass index at hospital admission
  • ischemic_heart_disease
  • n_hypertension_drugs - the number of different anti-hypertensive drugs the patient uses regularly as a rough measure of the severity of the hypertension condition. Integer 0 means either not diagnost or not treated for hypertension.
  • has_hypertension_drugs boolean equivalent to n_hypertension_drugs > 0
  • heart_failure boolean
  • COPD boolean, Chronic obstructive pulmonary disease
  • asthma boolean
  • diabetes boolean
  • renal_disease boolean
  • liver_disease boolean
  • NYHA New York Heart Association score for heart failure, if available or deducible from documentation (“NA” otherwise). The score has 4 levels:
    • 1: No limitation of physical activity. Ordinary physical activity does not cause undue fatigue, palpitation, dyspnea
    • 2: Slight limitation of physical activity. Comfortable at rest. Ordinary physical activity results in fatigue, palpitation, dyspnea.
    • 3: Marked limitation of physical activity. Comfortable at rest. Less than ordinary activity causes fatigue, palpitation, or dyspnea.
    • 4: Unable to carry on any physical activity without discomfort. Symptoms of heart failure at rest. If any physical activity is undertaken, discomfort increases.
  • creatinin Concentration of creatinine in serum (μmol/L),
  • pt_inr Prothrombin time (Quick test) as International Normalized Ratio,
  • albumin Concentration of albumin in serum/plasma (g/l)
  • smoking - does the patient smoke? Boolean.

Derived quantities

Those are quantities derived from disease progression data that might be useful in analysis:

  • high_creatinin creatinin above 115 for males or above 97 for females
  • high_pt_inr PT INR above 1.2
  • low_albumin albumin below 36
  • heart_problems NYHA > 1,
  • obesity BMI > 30,
  • worst_breathing the worst breathing level recorded or Death for deceased patients
  • first_day_invasive, last_day_invasive the first and last days the patient was recorded as having invasive breathing support (MV or ECMO) - note that if the patient is removed from invasive ventilation and then deteriorates once more, this range will included some days without invasive ventilation. NA if never was invasive.
  • took_hcq/az/convalescent_plasma/antibiotics Took the given treatment at least once? Boolean
  • any_IL_6/d_dimer was IL-6/D-dimer ever measured? Boolean
  • comorbidities_sum number of all known comorbidities (NAs treated as not present)
  • comorbidities_sum_na number of all known comorbidities (NAs treated as half)

Disease progression data

The most important part of the disease progression data is the breathing data which contains the breathing support used for each day. Those data should not have any gaps and cover the whole hospitalization period. Breathing data is stored in the breathing_data list element. The columns are:

  • patient ID of the patient, matching patient_data
  • day day of hospitalization (starting with 0 - first day at hospital)
  • breathing an ordered factor representing the breathing level as described above, including Death and Discharged as levels.

Note that day can in some cases be negative when some data is availabe before hospitalization (this would almost certainly be only PCR test results).

Finally we collect a bunch of clinical markers of which most important are the drugs the patient used. Those are available in both long and wide formats (as marker_data and marker_data_wide). Markers are not measured every day and can be systematically missing for a whole site. The frequency of measurement of different markers can differ.

In the long format, the columns are:

  • patient ID of the patient
  • day day of hospitalization (starting with 0)
  • marker the name of the marker and/or drug taken
  • value double value of the marker
  • censored a character string indicating whether the marker observation was censored. One of left, right and none

In the wide format there is a column for each marker and for those, that can be censored an addtional xx_censored column.

The markers are:

  • pcr_positive whether the patient had a PCR test for virus presence positive
  • pcr_value if available, the Ct number of the PCR test, this is a rough indication of the viral load present. The higher Ct, the less virus was found. Ct number >35 is (mostly) considered a negative test. The concentration of viral RNA in the sample needs to increase roughly two-fold to make the Ct number drop by 1.
  • oxygen_flow when the patient is receiving supplemental oxygen (the Oxygen breathing support), this records how much oxygen they receive in liters/minute. Unfortunately, this can't be interpreted too strongly as a measure of severity, as the level of blood oxygenation achieved with the given flow must be considered (which is not yet included in the simulated dataset, will be added).
  • crp the C-reactive protein concentraion in blood in ng/ml. This is a non-specific marker of inflammation and lags the actual inflammation by 1-2 days. <3 is usually considered normal, > 30 indicates noticeable inflammation, viral pneumonias are associated with CRP around 50-100, bacterial pneumonia associated with CRP roughly 100-200 (bacterial superinfection is possible in Covid patients), CRP > 200 is associated with sepsis. The main advantage of CRP is that it changes by several order of magnitudes, much less than the measurement noise. Low levels can be censored but that's unlikely a major issue.
  • d_dimer the concentration of the D-dimer in blood which indicates the amount of blood coagulation happening, which can be a mark of complications (thrombosis, inflammation). D-dimer levels react with quite quickly to changes in patient's state. The normal level changes with age from roughly 0.5 for young healthy persons to around 0.8 for older patients. Values > 1 are generally considered pathological.
  • ferritin TODO, healthy patients have around 450. High levels can be censored.

For markers, missing values indicate the marker was not measured for the day.

The drugs are:

  • Compounds with suspected activity against the virus itself:
    • hcq Hydroxychloroquine
    • az Azithromycin - usually administered in combination with HCQ
    • kaletra Kaletra (Lopinavir/Ritonavir)
  • tocilizumab which is suspected to alleviate the immune reaction to the virus and shorten the severe phase of the disease

For drugs, the values indicate the dose. Missing values indicate the patient didn't take the drug the given day.

It probably doesn't make a lot of sense to distinguish different dosing regimes of the drugs (there won't be enough data). Also, the effect of the drugs should be longer than the days they were taken - this is especially true for HCQ which is only very slowly removed from the body and can stay quite long at therapeutic concentrations even after the patient stopped taking it. For this reason it probably makes sense to analyse only "days before taking the drug" and "days after taking the drug for the first time".