Imputation and aggregation of mapped ETD data¶

This file demonstrates the steps used to aggregate and impute mapped ETD data.

Currently this process uses pandas across most steps and is very memory intensive. It will be possible to optimize in the future.

In [ ]:
import pandas as pd

import etdmap
import etdtransform

# Set the standard etdmap options
etdmap.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdmap.options.bsv_metadata_file = "../../demodata/bsv_metadata_after_mapping.xlsx" # change to path to the Excel file with the data processor metadata (BSV is the data processor in this case)
etdmap.options.aggregate_folder_path = "../../demodata/aggregate"

etdtransform.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdtransform.options.aggregate_folder_path = "../../demodata/aggregate"

# aggregate all mapped data into a single file with 5 minute intervals
# Saves parquet file
etdtransform.aggregate.aggregate_hh_data_5min()

# read our aggregated data
df = etdtransform.aggregate.read_hh_data(interval="default")
In [2]:
df
Out[2]:
ReadingDate ElektriciteitNetgebruikHoog ElektriciteitNetgebruikLaag ElektriciteitTerugleveringHoog ElektriciteitTerugleveringLaag ElektriciteitVermogen Gasgebruik ElektriciteitsgebruikWTW ElektriciteitsgebruikWarmtepomp ElektriciteitsgebruikBooster ... Actuele consumptie P1 [NO unit, lbl unkown] Actuele opbrengst P1 [NO unit, lbl unkown] CDR-RH 1 Luchtvochtigheid [%] CDR-RH 1 Temperatuur [grC] CDR-RH 2 CO2 [ppm] CDR-RH 2 Luchtvochtigheid [%] CDR-RH 2 Temperatuur [grC] Vermogen warmtepomp [NO unit, lbl unkown] WarmteproductieWarmTapwater WarmteproductieCV
0 2023-01-01 00:00:00 822.16 1157.82 3397.66 2268.15 NaN <NA> 74.68 1234.34 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2023-01-01 00:05:00 822.16 1157.835 3397.66 2268.15 NaN <NA> <NA> 1234.34 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2023-01-01 00:10:00 822.16 1157.851 3397.66 2268.15 NaN <NA> <NA> 1234.35 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 2023-01-01 00:15:00 822.16 1157.869 3397.66 2268.15 NaN <NA> <NA> 1234.35 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 2023-01-01 00:20:00 822.16 1157.965 3397.66 2268.15 NaN <NA> <NA> 1234.43 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17868404 2023-12-31 23:40:00 7313.202 6236.055 4661.911 2500.666 NaN <NA> <NA> 5611.84 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17868405 2023-12-31 23:45:00 7313.202 6236.111 4661.911 2500.666 NaN <NA> <NA> 5611.84 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17868406 2023-12-31 23:50:00 7313.202 6236.48 4661.911 2500.666 NaN <NA> <NA> 5612.15 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17868407 2023-12-31 23:55:00 7313.202 6236.531 4661.911 2500.666 NaN <NA> <NA> 5612.15 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
17868408 2024-01-01 00:00:00 7313.202 6236.571 4661.911 2500.666 NaN <NA> <NA> 5612.16 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

17868409 rows × 107 columns

Now let's prepare our data for imputation to fill in gaps in data as best we can. For this we calculate average values per project per measurement by the reading date.

We will also specify which columns to impute of the cumulative columns in the model. This time, we are choosing all cumulative columns.

Generate project averages for changes in cumulative columns¶

These averages will support the imputation of missing data. The function will also generate logging error messages for ReadingDates that do not have values because ALL households are missing data at that time.

These values are instead imputed without resorting to the average with linear imputation or filling with 0s so the errors are not necessarily a problem but are important to track in the logs.

In [ ]:
# sort data by project, house id, and reading date
df = etdtransform.impute.sort_for_impute(df, project_id_column="ProjectIdBSV")

# get the list of cumulative columns; we can also choose a subset
cum_cols_list = etdmap.data_model.cumulative_columns

# Per project: calculate average change in cumulative columns
diff_columns, diffs_df, max_bounds_df = etdtransform.impute.prepare_diffs_for_impute(
    df,
    project_id_column="ProjectIdBSV",
    cumulative_columns=cum_cols_list,
    sorted=True,
)

diffs_df.head()

Impute¶

After this we will impute missing values.

In [ ]:
# Impute the household data (assuming 5 minute intervals)
df_imputed = etdtransform.aggregate.impute_hh_data_5min(
    df,
    cum_cols=cum_cols_list,
    sorted=True,
    diffs_calculated=True,
)

After this we will add the calculated variables. We've imputed all original variables in order to stay as close as possible to the original values before further calculations. In order to avoid too many issues, we've excluded already households that have a lot of missing data. Future analyses need to take this order of operations into account and may choose to use the _is_imputed variables to re-introduce missing values or adjust the imputation process.

Add calculated variables¶

In [ ]:
# Add calculated variables after imputation
df_calculated = etdtransform.aggregate.add_calculated_columns_to_hh_data(df_imputed)

After adding the calculated variables, we finally can resample our data to different intervals, aggregate over projects and also remove data that hasn't passed validation checks.

Resampling and aggregation¶

In [ ]:
# Resample data and remove all column data that does not pass minimum validation check
# Saves parquet files
etdtransform.aggregate.resample_hh_data(intervals=("60min", "15min", "5min"))

# Aggregation per project and save as parquet files
etdtransform.aggregate.aggregate_project_data(intervals=("60min", "15min", "5min"))