Imputation and aggregation of mapped ETD data¶
This file demonstrates the steps used to aggregate and impute mapped ETD data.
Currently this process uses pandas across most steps and is very memory intensive. It will be possible to optimize in the future.
import pandas as pd
import etdmap
import etdtransform
# Set the standard etdmap options
etdmap.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdmap.options.bsv_metadata_file = "../../demodata/bsv_metadata_after_mapping.xlsx" # change to path to the Excel file with the data processor metadata (BSV is the data processor in this case)
etdmap.options.aggregate_folder_path = "../../demodata/aggregate"
etdtransform.options.mapped_folder_path = "../../demodata/mapped" # change to path to folder where you would like to store mapped data
etdtransform.options.aggregate_folder_path = "../../demodata/aggregate"
# aggregate all mapped data into a single file with 5 minute intervals
# Saves parquet file
etdtransform.aggregate.aggregate_hh_data_5min()
# read our aggregated data
df = etdtransform.aggregate.read_hh_data(interval="default")
df
ReadingDate | ElektriciteitNetgebruikHoog | ElektriciteitNetgebruikLaag | ElektriciteitTerugleveringHoog | ElektriciteitTerugleveringLaag | ElektriciteitVermogen | Gasgebruik | ElektriciteitsgebruikWTW | ElektriciteitsgebruikWarmtepomp | ElektriciteitsgebruikBooster | ... | Actuele consumptie P1 [NO unit, lbl unkown] | Actuele opbrengst P1 [NO unit, lbl unkown] | CDR-RH 1 Luchtvochtigheid [%] | CDR-RH 1 Temperatuur [grC] | CDR-RH 2 CO2 [ppm] | CDR-RH 2 Luchtvochtigheid [%] | CDR-RH 2 Temperatuur [grC] | Vermogen warmtepomp [NO unit, lbl unkown] | WarmteproductieWarmTapwater | WarmteproductieCV | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2023-01-01 00:00:00 | 822.16 | 1157.82 | 3397.66 | 2268.15 | NaN | <NA> | 74.68 | 1234.34 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2023-01-01 00:05:00 | 822.16 | 1157.835 | 3397.66 | 2268.15 | NaN | <NA> | <NA> | 1234.34 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 2023-01-01 00:10:00 | 822.16 | 1157.851 | 3397.66 | 2268.15 | NaN | <NA> | <NA> | 1234.35 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 2023-01-01 00:15:00 | 822.16 | 1157.869 | 3397.66 | 2268.15 | NaN | <NA> | <NA> | 1234.35 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 2023-01-01 00:20:00 | 822.16 | 1157.965 | 3397.66 | 2268.15 | NaN | <NA> | <NA> | 1234.43 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
17868404 | 2023-12-31 23:40:00 | 7313.202 | 6236.055 | 4661.911 | 2500.666 | NaN | <NA> | <NA> | 5611.84 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17868405 | 2023-12-31 23:45:00 | 7313.202 | 6236.111 | 4661.911 | 2500.666 | NaN | <NA> | <NA> | 5611.84 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17868406 | 2023-12-31 23:50:00 | 7313.202 | 6236.48 | 4661.911 | 2500.666 | NaN | <NA> | <NA> | 5612.15 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17868407 | 2023-12-31 23:55:00 | 7313.202 | 6236.531 | 4661.911 | 2500.666 | NaN | <NA> | <NA> | 5612.15 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17868408 | 2024-01-01 00:00:00 | 7313.202 | 6236.571 | 4661.911 | 2500.666 | NaN | <NA> | <NA> | 5612.16 | 0.0 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
17868409 rows × 107 columns
Now let's prepare our data for imputation to fill in gaps in data as best we can. For this we calculate average values per project per measurement by the reading date.
We will also specify which columns to impute of the cumulative columns in the model. This time, we are choosing all cumulative columns.
Generate project averages for changes in cumulative columns¶
These averages will support the imputation of missing data. The function will also generate logging error messages for ReadingDates that do not have values because ALL households are missing data at that time.
These values are instead imputed without resorting to the average with linear imputation or filling with 0s so the errors are not necessarily a problem but are important to track in the logs.
# sort data by project, house id, and reading date
df = etdtransform.impute.sort_for_impute(df, project_id_column="ProjectIdBSV")
# get the list of cumulative columns; we can also choose a subset
cum_cols_list = etdmap.data_model.cumulative_columns
# Per project: calculate average change in cumulative columns
diff_columns, diffs_df, max_bounds_df = etdtransform.impute.prepare_diffs_for_impute(
df,
project_id_column="ProjectIdBSV",
cumulative_columns=cum_cols_list,
sorted=True,
)
diffs_df.head()
Impute¶
After this we will impute missing values.
# Impute the household data (assuming 5 minute intervals)
df_imputed = etdtransform.aggregate.impute_hh_data_5min(
df,
cum_cols=cum_cols_list,
sorted=True,
diffs_calculated=True,
)
After this we will add the calculated variables. We've imputed all original variables in order to stay as close as possible to the original values before further calculations. In order to avoid too many issues, we've excluded already households that have a lot of missing data. Future analyses need to take this order of operations into account and may choose to use the _is_imputed
variables to re-introduce missing values or adjust the imputation process.
Add calculated variables¶
# Add calculated variables after imputation
df_calculated = etdtransform.aggregate.add_calculated_columns_to_hh_data(df_imputed)
After adding the calculated variables, we finally can resample our data to different intervals, aggregate over projects and also remove data that hasn't passed validation checks.
Resampling and aggregation¶
# Resample data and remove all column data that does not pass minimum validation check
# Saves parquet files
etdtransform.aggregate.resample_hh_data(intervals=("60min", "15min", "5min"))
# Aggregation per project and save as parquet files
etdtransform.aggregate.aggregate_project_data(intervals=("60min", "15min", "5min"))