Imputation and aggregation of mapped ETD data¶

This file demonstrates the steps used to aggregate and impute mapped ETD data.

Currently this process uses pandas across most steps and is very memory intensive. It will be possible to optimize in the future.

In [ ]:

import pandas as pd

import etdmap
import etdtransform

# Set the standard etdmap options
etdmap.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdmap.options.bsv_metadata_file = "../../demodata/bsv_metadata_after_mapping.xlsx" # change to path to the Excel file with the data processor metadata (BSV is the data processor in this case)
etdmap.options.aggregate_folder_path = "../../demodata/aggregate"

etdtransform.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdtransform.options.aggregate_folder_path = "../../demodata/aggregate"

# aggregate all mapped data into a single file with 5 minute intervals
# Saves parquet file
etdtransform.aggregate.aggregate_hh_data_5min()

# read our aggregated data
df = etdtransform.aggregate.read_hh_data(interval="default")

In [2]:

df

Out[2]:

	ReadingDate	ElektriciteitNetgebruikHoog	ElektriciteitNetgebruikLaag	ElektriciteitTerugleveringHoog	ElektriciteitTerugleveringLaag	ElektriciteitVermogen	Gasgebruik	ElektriciteitsgebruikWTW	ElektriciteitsgebruikWarmtepomp	ElektriciteitsgebruikBooster	...	Actuele consumptie P1 [NO unit, lbl unkown]	Actuele opbrengst P1 [NO unit, lbl unkown]	CDR-RH 1 Luchtvochtigheid [%]	CDR-RH 1 Temperatuur [grC]	CDR-RH 2 CO2 [ppm]	CDR-RH 2 Luchtvochtigheid [%]	CDR-RH 2 Temperatuur [grC]	Vermogen warmtepomp [NO unit, lbl unkown]	WarmteproductieWarmTapwater	WarmteproductieCV
0	2023-01-01 00:00:00	822.16	1157.82	3397.66	2268.15	NaN	<NA>	74.68	1234.34	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2023-01-01 00:05:00	822.16	1157.835	3397.66	2268.15	NaN	<NA>	<NA>	1234.34	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	2023-01-01 00:10:00	822.16	1157.851	3397.66	2268.15	NaN	<NA>	<NA>	1234.35	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	2023-01-01 00:15:00	822.16	1157.869	3397.66	2268.15	NaN	<NA>	<NA>	1234.35	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	2023-01-01 00:20:00	822.16	1157.965	3397.66	2268.15	NaN	<NA>	<NA>	1234.43	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
17868404	2023-12-31 23:40:00	7313.202	6236.055	4661.911	2500.666	NaN	<NA>	<NA>	5611.84	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17868405	2023-12-31 23:45:00	7313.202	6236.111	4661.911	2500.666	NaN	<NA>	<NA>	5611.84	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17868406	2023-12-31 23:50:00	7313.202	6236.48	4661.911	2500.666	NaN	<NA>	<NA>	5612.15	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17868407	2023-12-31 23:55:00	7313.202	6236.531	4661.911	2500.666	NaN	<NA>	<NA>	5612.15	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
17868408	2024-01-01 00:00:00	7313.202	6236.571	4661.911	2500.666	NaN	<NA>	<NA>	5612.16	0.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

17868409 rows × 107 columns

Now let's prepare our data for imputation to fill in gaps in data as best we can. For this we calculate average values per project per measurement by the reading date.

We will also specify which columns to impute of the cumulative columns in the model. This time, we are choosing all cumulative columns.

Generate project averages for changes in cumulative columns¶

These averages will support the imputation of missing data. The function will also generate logging error messages for ReadingDates that do not have values because ALL households are missing data at that time.

These values are instead imputed without resorting to the average with linear imputation or filling with 0s so the errors are not necessarily a problem but are important to track in the logs.

In [ ]:

# sort data by project, house id, and reading date
df = etdtransform.impute.sort_for_impute(df, project_id_column="ProjectIdBSV")

# get the list of cumulative columns; we can also choose a subset
cum_cols_list = etdmap.data_model.cumulative_columns

# Per project: calculate average change in cumulative columns
diff_columns, diffs_df, max_bounds_df = etdtransform.impute.prepare_diffs_for_impute(
    df,
    project_id_column="ProjectIdBSV",
    cumulative_columns=cum_cols_list,
    sorted=True,
)

diffs_df.head()

Impute¶

After this we will impute missing values.

In [ ]:

# Impute the household data (assuming 5 minute intervals)
df_imputed = etdtransform.aggregate.impute_hh_data_5min(
    df,
    cum_cols=cum_cols_list,
    sorted=True,
    diffs_calculated=True,
)

After this we will add the calculated variables. We've imputed all original variables in order to stay as close as possible to the original values before further calculations. In order to avoid too many issues, we've excluded already households that have a lot of missing data. Future analyses need to take this order of operations into account and may choose to use the _is_imputed variables to re-introduce missing values or adjust the imputation process.

Add calculated variables¶

In [ ]:

# Add calculated variables after imputation
df_calculated = etdtransform.aggregate.add_calculated_columns_to_hh_data(df_imputed)

After adding the calculated variables, we finally can resample our data to different intervals, aggregate over projects and also remove data that hasn't passed validation checks.

Resampling and aggregation¶

In [ ]:

# Resample data and remove all column data that does not pass minimum validation check
# Saves parquet files
etdtransform.aggregate.resample_hh_data(intervals=("60min", "15min", "5min"))

# Aggregation per project and save as parquet files
etdtransform.aggregate.aggregate_project_data(intervals=("60min", "15min", "5min"))