etdmap - “Energietransitie Dataset” mapping package documentation
Functions
About etdmap
This package provides the required data model and helpers to map raw data to the Energietransitie Dataset
(ETD). The ETD is a model defining important variables for energy in the built environment, which are used to inform policy and planning decisions in the Netherlands. For an overview of the ETD and all documentation, see https://energietransitiedataset.nl/.
License
The package may only be used for open processing. You agree to publicly share the intended application, obtained insights, and applied calculation methods under the same license.
Citation
If you use this package or any code in this package or refer to output, please use the following citations for attribution:
Witkamp, Dickinson, Izeboud (2024). etdmap: A Python package for mapping raw data to the Energietransitie Dataset (ETD) model.
The data model
The dataset includes two primary types of variables:
Metadata Variables: Descriptive information about the project or house, such as identifiers, physical attributes, or installed HVAC systems.
Performance Variables: Time-series data describing the energy performance or environmental conditions.
Examples of Variables
Category |
Variable Name |
Type |
Description |
---|---|---|---|
Metadata |
|
String |
Unique identifier for the project assigned by the data provider. |
|
Integer |
Construction year of the house. |
|
|
String |
Type of heating system (e.g., hybrid heat pump). |
|
|
Integer |
Surface area in square meters. |
|
Performance |
|
Date |
Timestamp of the reading (Format: YYYY-MM-DD HH:MM:SS). |
|
Number |
Electricity consumption at high tariff (kWh, cumulative). |
|
|
Number |
Current living room temperature in degrees Celsius. |
|
|
Number |
Total solar energy generated (kWh, cumulative). |
This categorization ensures the dataset is comprehensive for energy-related policy and planning while maintaining clarity for data contributors and users.
See the section on data processing and mapping to learn more about how to access this metadata in a python script.
There are two levels of data ultimately used:
data for a connected unit in the built environment, such as a household or, perhaps in the future, a charging point, and
project level data, an aggregated collection of network connected units with similar characteristics.
Installation and quick start
Note: If only using ETD datasets for analysis, one would use the etdanalyze
and etdtranform
packages and install their requirements, which will automatically install etdmap
.
To use this package on its own and map raw data in a workflow, you can install it with pip:
git clone https://github.com/Stroomversnelling/etdmap.git
cd etdmap
pip install .
Configuration
In some cases, you may be using this package on its own and may have to configure some options:
import etdmap
# Required:
etdmap.options.mapped_folder = 'mapped_folder_path' # path to folder where mapped files are stored
etdmap.options.bsv_metadata_file = 'filepath_to_excel_with_bsv_metdata' # path to Excel sheet maintained with BSV/ETD metdata (not supplier metadata)
etdmap.options.aggregate_folder = 'aggregate_folder_path' # path to folder where aggregated files are stored
Note that the first time that the mapping of raw files is done, there will be a new index generated or an old index will be updated. The BSV metadata file is an Excel sheet where the auto-generated ids can then be manually mapped to additional metadata variables such as the internal ProjectIdBSV
that assigns households to specific projects and the Meenemen
boolean variable that indicates True
for household/units that should be included in analysis vs. False
for datasets that may need to be excluded. Some reasons to exclude a household may be:
By request from the data supplier
Due to missing data that could skew analyses or is insufficient
Misbehaving devices and poor quality data
Developing and contributing
If you would like to contribute to the package code, you can create an environment and install it in editable mode:
git clone https://github.com/Stroomversnelling/etdmap.git
cd etdmap
python -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
pip install -e .
Overview
etdmap
is a package that provides data mapping functionalities for energy-related datasets. It includes functions to map and transform data according to specific schemas, ensuring consistency across different sources. It also includes some utility variables like cumulative_columns
, which can be used.
The process of mapping raw data files to the ETD model involves several key functions. Here’s an overview of the main functions used in this process.
Data model and column definitions and thresholds
Note: This API is relatively stable but subject to change as it is still under development. Please follow our repository or, if necessary, use a tagged releases to ensure stability.
From data_model.py
:
load_etdmodel()
:Purpose: Loads the ETD model specification.
Description: Reads a CSV file into a Pandas dataframe that defines the structure and requirements of the ETD model.
load_thresholds()
:Purpose: Loads predefined thresholds for data validation. These include thresholds for cumulative (annual) values as well as those for 5 minute intervals and instantaneous measurements.
Description: Reads a CSV file into a Pandas dataframe containing threshold values for various variables used in data validation.
from etdmap.record_validators import load_thresholds
thresholds_df = load_thresholds()
Additionally, there are some other dictionaries and lists that are currently used such as a list of cumulative_columns
that can be used in other packages. Other useful definitions include model_column_order
, which defines the order of columns in a model, and model_column_type
, which provides the Pandas series types for each column.
These are defined in data_model.py
:
from etdmap.data_model import cumulative_columns, model_column_order, model_column_type
Managing the mapped data files
index_helpers.py
contains the functions used to manage metadata and add files to the mapped data file index. It also has functions for managing household, project metadata, and BSV (aka ETD) metadata.
From index_helpers.py
:
read_index()
:Purpose: Reads the current index file.
Description: Loads the existing index from a parquet file or creates a new one if it doesn’t exist.
get_household_id_pairs()
:Purpose: Generates pairs of HuisIdBSV and filenames in order to maintain unique ids in the dataset.
Description: Creates a list of tuples containing HuisIdBSV and corresponding filenames for new and existing entries.
update_index()
:Purpose: Updates the index with new entries.
Description: Adds new entries to the index and recalculates or adds flag columns for dataset validators.
add_supplier_metadata_to_index()
:Purpose: Adds metadata columns to the index.
Description: Updates the index with additional metadata from the supplier, matching on the HuisIdLeverancier column.
Mapping raw data to the model specification
From mapping_helpers.py
:
rearrange_model_columns()
:Purpose: Ensures all required columns are present and in the correct order.
Description: Adds missing columns, fills them with NA values, and arranges columns according to the model specification.
add_diff_columns()
:Purpose: Calculates difference columns for cumulative variables.
Description: Computes the difference between consecutive readings for cumulative variables and adds these as new columns.
ensure_intervals()
:Purpose: Ensures consistent time intervals in the data. Defaults are
date_column = 'ReadingDate', freq='5min'
Description: Adds missing timestamps to ensure a consistent time series, typically with 5-minute or 15-minute intervals.
fill_down_infrequent_devices()
:Purpose: Fills gaps in infrequently updated device data. At the moment, this defaults to: ‘ElektriciteitsgebruikBoilervat’, ‘ElektriciteitsgebruikRadiator’, ‘ElektriciteitsgebruikBooster’. This can be changed based on data source requirements.
Description: For certain columns that update infrequently, this function fills down the last known value.
validate_cumulative_variables()
:Purpose: Validates cumulative variables in a DataFrame for various quality checks.
Description:
Time gap check: Ensures there are no gaps in readings greater than a specified time delta (default is 1 hour).
Negative difference check: Verifies that cumulative values are not decreasing over time.
Unexpected zero check: Identifies and flags unexpected zero values in cumulative readings.
Data availability check: Ensures that at least a specified percentage of values (default 90%) are not NA.
Alignment of clocks from different devices and merging data (ALPHA - example code only)
There are functions in mapping_clock_helpers.py
that are setup to help align clocks from multiple devices and address situations where readings are spaced out in different intervals during the mapping process.
Validators
There are various validators available. There are two major categories:
Dataset validators: These are used to validate the entire dataset, including checks on cumulative columns and differences between consecutive records.
Record validators: These are used to validate individual records, including checks on instantaneous measurements and 5-minute intervals.
Guidance for suppliers of data and for data preparation
Prepping raw data
Use the Vereist
(required
) column in the data model in order to check if your metadata and your raw data from device readings are complete and correct. These are the currently required columns (as of 20 Feb 2025):
Entiteit |
Variabele |
Key |
Definitie |
Type variabele |
Vereist |
Resolutie |
Wie vult? |
Bron |
AVG gevoelig |
---|---|---|---|---|---|---|---|---|---|
Metadata |
ProjectIdLeverancier |
nee |
code toegekend door dataleverancier |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
ja |
Metadata |
HuisIdLeverancier |
ja |
code toegekend door dataleverancier |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
ja |
Metadata |
Weerstation |
nee |
dichtstbijzijnde KNMI weerstation |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
Oppervlakte |
nee |
Ag, oftewel m2 gebruiksoppervlak |
integer |
ja |
vaste waarde |
Dataleverancier |
BAG / EP-online / Energielabel / Vabi |
nee |
Metadata |
Compactheid |
nee |
Als/Ag, oftewel m2 verliesoppervlak / m2 gebruiksoppervlak |
number |
ja |
vaste waarde |
Dataleverancier |
EP-online / Energielabel |
nee |
Metadata |
Warmtebehoefte |
nee |
kWh_th/m2/jr, volgens NTA8800 |
number |
ja |
vaste waarde |
Dataleverancier |
EP-online / Energielabel |
nee |
Metadata |
PrimairFossielGebruik |
nee |
kWh/m2/jr, volgens NTA8800 |
number |
ja |
vaste waarde |
Dataleverancier |
EP-online / Energielabel |
nee |
Metadata |
Bouwjaar |
nee |
Oorspronkelijk bouwjaar |
integer |
ja |
vaste waarde |
Dataleverancier |
BAG/Vabi |
nee |
Metadata |
Renovatiejaar |
nee |
Jaar van renovatie |
integer |
ja |
vaste waarde |
Dataleverancier |
Vabi |
nee |
Metadata |
WoningType |
nee |
Wat voor woningbouw is het? Kiezen uit: EGW, MGW |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
WarmteopwekkerCategorie |
nee |
Hoe wordt verwarmd? Kiezen uit: Hybride warmtepomp, Lucht-water warmtepomp, Water-water warmtepomp, Lucht-lucht warmtepomp, Anders |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
Warmteopwekker |
nee |
Hoe wordt verwarmd? Kiezen uit: Collectief, Individueel |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
Ventilatiesysteem |
nee |
Hoe wordt geventileerd? Kiezen uit: Natuurlijke ventilatie, Type C, Type D, Type E |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
Kookinstallatie |
nee |
Hoe wordt gekookt? Kiezen uit: Gas, Inductie, Anders |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Metadata |
EPV |
nee |
Wordt EPV geïnd? Kiezen uit: EPV 1.0, EPV 2.0 Basis, EPV 2.0 Hoogwaardig, Nee, Niet bekend |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
nee |
Prestatiedata |
HuisIdLeverancier |
ja |
code toegekend door dataleverancier |
string |
ja |
vaste waarde |
Dataleverancier |
Dataleverancier |
ja |
Prestatiedata |
ReadingDate |
nee |
Format: YYYY-MM-DD HH:MM:SS |
date |
ja |
5 minuten |
Dataleverancier |
Gateway / Portal |
ja |
Prestatiedata |
ElektriciteitNetgebruikHoog |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Slimme meter |
ja |
Prestatiedata |
ElektriciteitNetgebruikLaag |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Slimme meter |
ja |
Prestatiedata |
ElektriciteitTerugleveringHoog |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Slimme meter |
ja |
Prestatiedata |
ElektriciteitTerugleveringLaag |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Slimme meter |
ja |
Prestatiedata |
ElektriciteitsgebruikWarmtepomp |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Gateway / Portal |
ja |
Prestatiedata |
Zon-opwekTotaal |
nee |
kWh (cumulatief) |
number |
ja |
5 minuten |
Dataleverancier |
Gateway / Portal |
ja |
Validation of columns
One should check the stats for each raw data column using the collect_column_stats()
function to ensure that the data meets expected statistical properties. The statistics include:
‘Identifier’: The identifier for the dataset.
‘column’: The name of the column.
‘type’: The data type of the column.
‘count’: The number of non-null values in the column.
‘missing’: The number of missing values in the column.
‘min’: The minimum value in the column, if applicable.
‘max’: The maximum value in the column, if applicable.
‘mean’: The mean value in the column, if applicable.
‘median’: The median value in the column, if applicable.
‘iqr’: The interquartile range (IQR) of the column, if applicable.
‘quantile_25’: The 25th percentile value in the column, if applicable.
‘quantile_75’: The 75th percentile value in the column, if applicable.
‘top5’: A dictionary with the top 5 most frequent values and their counts, if applicable.
This helps to validate whether or not a particular household/unit has sufficient data of sufficient quality to include in the mapped datasets.
Validation of cumulative columns
The etdmap
package provides helpers in mapping_helpers.py
to enable the processing and validation of columns.
Cumulative columns are processsed and also validated using the add_diff_columns()
function. It produces a diff
, a variable with the incremental increase in the cumulative variable and expects cumulative variables to monotonically increase. In other words, they should never decrease. This assumption might be incorrect in cases where meter readings naturally decrease and in those cases, custom logic may need to be written.
By default, add_diff_columns()
relies on the function validate_cumulative_variables()
to ensure that cumulative variables meet specific conditions:
90% available data
Max gap of 1 hour of consecutive missing data
No “unexpected” zero values in cumulative data
No decreasing cumulative values
The validate_cumulative_variables
function checks for logical consistency in cumulative variables but may not cover all edge cases or handle complex data scenarios effectively. To customize this validation behaviour, one may write a custom wrapper function or new function with different checks and pass it to add_diff_columns()
using the validate_func
parameter. If this validation is not correctly implemented, problems may not be reported. If drop_unvalidated
is True
then function may drop valid data or keep invalid data. By default, drop_unvalidated
is False
: data which does not pass these tests is kept but problems are reported in the log.
It is important to manually check the logs when adding new datasets. In addition, one should check the stats for each mapped column using the get_mapped_data_stats()
function to ensure that the data meets the expected statistical properties.
Filling data values for devices which report infrequently
Sometimes devices are simply not reporting because they are inactive. In order to ensure the column is not seen as mostly ‘missing’ data and rather as an unchanging value, the fill_down_infrequent_devices()
function uses forward fill followed by backward fill to impute missing values for specified columns.
This approach may not be suitable if devices if there are underlying issues with the device data, which would be amplified or ignored, especially if the data source is misbehaving. We recommend visually inspecting these columns / data before applying the function.
Time Series Consistency
At the moment, custom logic has been used for each data source to ensure the data provided is available with a fixed interval (typically 5 minutes). Example logic has been added to the file mapping_clock_helpers.py
. It is alpha quality/pseudo code and should not be used at the moment. This should be tested and available to use in a subsequent release.
After the data has fixed (5 min) interval, the ensure_intervals()
function attempts to ensure a consistent time series by adding missing intervals and removing excess records for the same moment. This process can lead to incorrect data if the raw data does not naturally fit into the expected frequency.
Specific function notes
add_diff_columns()
Negative differences:
Assumes that negative differences in cumulative columns are erroneous and attempts to correct them by setting values to
pd.NA
. This assumption might not hold true if the meter readings naturally decrease or if there are legitimate zero jumps.
Error handling:
Logs errors when it encounters negative differences after corrections but does not address underlying issues in the data collection process.
Handling of large gaps without rates of change:
Does not consider the time elapsed between readings, which can lead to incorrect assumptions about the nature of the data.
When there are larger gaps in the
diff
data (missing intermediate readings), the function may incorrectly assume that the next available reading is a continuation of the meter reading when perhaps a meter reset has taken place and no cumulative values have been collected for some time (bringing the reading back to the original reading level). If there is a suspicion of these errors, custom logic will need to be added and some assumption about the rate of change in the cumulative variable.
Known edge cases:
Edge cases where multiple negative dips occur consecutively or where there is a significant pause in data collection are not fully addressed.
fill_down_infrequent_devices()
Imputation Strategy:
Uses forward fill followed by backward fill to impute missing values, which can lead to inaccuracies if the device reports irregularly or if there are underlying issues with data collection.
Assumption of continuity:
Assumes that missing values should be filled based on the last available reading, which might not be appropriate for devices that report infrequently or have non-continuous usage patterns.
Final filling with zeroes:
Fills column with zeros if all values are NA. This assumption may not be correct if the device simply was not reporting.
Column specificity:
Processes specified variables by default:
ElektriciteitsgebruikBoilervat
,ElektriciteitsgebruikRadiator
,ElektriciteitsgebruikBooster
. Can handle other columns if passed as an argument. Some datasets may require different processing.
ensure_intervals()
Frequency assumption:
Assumes a fixed frequency for the time series data (e.g., 5-minute intervals) and the only check it does is for the number of expected records. If the raw data does not naturally fit into this frequency the errors may occur as additional pre-processing is required.
Data source issues:
Does not handle cases where the data source is misbehaving, such as reporting data at irregular intervals or with incorrect timestamps.
Adding vs. removing records:
Adds missing records to ensure consistent intervals but does not verify the correctness of these retained records where removed. It will report inconsistency with respect to the number of records expected.
Excess records:
Attempts to reduce excess records by performing a left merge, which might lead to data loss if the excess records contain valid information.
rearrange_model_columns()
Column order:
Ensures that the DataFrame columns are in a specific order with additional columns left at the end/right of the dataset.
Test development in VSCode (configuration)
We have developed tests mostly using VSCode. These files can be added to the .vscode
folder to setup the test environment.
Example of settings.json
to be able to run individual tests (and generate the test data fixture) in the VSCode debug tab for test development:
{
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
Example of launch.json
to be able to run individual tests (and generate the test data fixture) in the VSCode debug tab for test development:
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug pytest with copy-data",
"type": "debugpy",
"request": "launch",
"module": "pytest",
"args": [
"tests/test_raw_data.py::test_raw_data_fixture",
"--copy-data"
],
"console": "integratedTerminal"
}
]
}