etdmap package
Submodules
etdmap.data_model module
- etdmap.data_model.load_etdmodel()[source][source]
Load ETD model from the package ETD model definition CSV file.
- Returns:
A DataFrame containing the ETD model data.
- Return type:
pandas.DataFrame
etdmap.dataset_validators module
- etdmap.dataset_validators.create_validate_func_col(col: str, tresholds) callable [source][source]
Create a validation function for a specific column based on given thresholds.
- Parameters:
col (str) – The name of the column to be validated.
tresholds (dict) – A dictionary containing the threshold values for the specified column.
- Returns:
A function that validates the specified column in a DataFrame.
- Return type:
callable
Notes
This function creates a validation function that checks if the specified column is cumulative and falls within the given thresholds.
- etdmap.dataset_validators.create_validate_func_outliers_neg_cum(col: str) callable [source][source]
Create a validation function to check for outliers and negative cumulative differences in a specific column.
- Parameters:
col (str) – The name of the column to be validated.
- Returns:
A function that validates the specified column for outliers and negative cumulative differences in a DataFrame.
- Return type:
callable
Notes
This function creates a validation function that checks if all non-null values in the ‘validate_<col>Diff’ column are True.
- etdmap.dataset_validators.validate_approximately_one_year_of_records(df: DataFrame) bool [source][source]
Validate that a DataFrame contains approximately one year of records.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
- Returns:
True if the DataFrame contains approximately one year of records, pd.NA if validation cannot be performed.
- Return type:
bool or pd.NA
Notes
This function checks if the difference between the maximum and minimum dates in the ‘ReadingDate’ column falls within a range of approximately one year, allowing for a specified jitter.
- etdmap.dataset_validators.validate_column_exists(df: DataFrame, column_name: str) bool [source][source]
Validate that a specific column exists in a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame to be checked.
column_name (str) – The name of the column to check for existence.
- Returns:
True if the column exists in the DataFrame, False otherwise.
- Return type:
bool
Notes
This function simply checks if the specified column name is present in the DataFrame’s columns.
- etdmap.dataset_validators.validate_columns(df: DataFrame, columns: list, condition_func) bool [source][source]
Validate a dataset based on specified columns in a DataFrame based on a given condition function.
- Parameters:
df (DataFrame) – The input DataFrame to be validated.
columns (list) – A list of column names to be checked for validity.
condition_func (callable) – A function that takes a DataFrame as an argument and returns a boolean Series indicating which rows meet the validation criteria.
- Returns:
True if all valid rows meet the specified condition, pd.NA if no valid rows are found or if any of the specified columns do not exist in the DataFrame.
- Return type:
bool or pd.NA
Notes
This function checks if all specified columns exist in the DataFrame, applies the condition function to valid rows, and returns the overall validation result.
- etdmap.dataset_validators.validate_columns_exist(df: DataFrame) bool [source][source]
Validate that all required columns for data analysis exist in a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame to be checked.
- Returns:
True if all required columns exist in the DataFrame, False otherwise.
- Return type:
bool
Notes
This function checks if all columns specified in the ‘data_analysis_columns’ list are present in the DataFrame.
- etdmap.dataset_validators.validate_cumm_thesholds(df: DataFrame, col: str, thresholds: dict) bool [source][source]
Validate cumulative thresholds for a specific column in a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
col (str) – The name of the column to be validated.
thresholds (dict) – A dictionary containing the threshold values for the specified column.
- Returns:
True if the column values meet the specified thresholds, pd.NA if validation cannot be performed.
- Return type:
bool or pd.NA
Notes
This function checks if the differences between consecutive values in the specified column fall within the given thresholds.
- etdmap.dataset_validators.validate_cumulative_variable(df: DataFrame, column: str) bool [source][source]
Validate that a cumulative variable in a DataFrame is non-decreasing.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
column (str) – The name of the column to be validated.
- Returns:
True if the cumulative variable is non-decreasing, pd.NA if validation cannot be performed.
- Return type:
bool or pd.NA
Notes
This function checks if the differences between consecutive values in the specified column are non-negative.
- etdmap.dataset_validators.validate_energiegebruik_warmteopwekker(df: DataFrame) bool [source][source]
Validate the energy usage of the heat generator in a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
- Returns:
True if the calculated energy usage falls within the specified range, pd.NA if validation cannot be performed.
- Return type:
bool or pd.NA
Notes
This function calculates the total energy usage of the heat generator by summing the electricity usage of the heat pump, booster, and boiler tank, and then validates if this total falls within a specified range.
- etdmap.dataset_validators.validate_monitoring_data_counts(df: DataFrame) bool [source][source]
Validate the number of records in a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame to be validated.
- Returns:
True if the number of records falls within the specified range, pd.NA if the DataFrame is empty.
- Return type:
bool or pd.NA
Notes
This function checks if the number of records in the DataFrame is between 100,000 and 110,000.
- etdmap.dataset_validators.validate_no_readingdate_gap(df: DataFrame) bool [source][source]
Validate that there are no gaps in the reading dates of a DataFrame.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
- Returns:
True if there are no gaps in the reading dates, False otherwise.
- Return type:
bool
Notes
This function checks if the time difference between consecutive reading dates is consistently 300 seconds.
- etdmap.dataset_validators.validate_range(df: DataFrame, column: str, min_value: float, max_value: float) bool [source][source]
Validate that a column in a DataFrame falls within a specified range over approximately one year.
- Parameters:
df (DataFrame) – The input DataFrame containing the data to be validated.
column (str) – The name of the column to be validated.
min_value (float) – The minimum acceptable value for the yearly difference.
max_value (float) – The maximum acceptable value for the yearly difference.
- Returns:
True if the column values fall within the specified range over approximately one year, pd.NA if validation cannot be performed.
- Return type:
bool or pd.NA
Notes
This function checks if the difference between the first and last non-null values in the specified column, over a period of approximately one year, falls within the given range.
etdmap.index_helpers module
- etdmap.index_helpers.add_supplier_metadata_to_index(index_df: DataFrame, metadata_df: DataFrame, data_leverancier=None) DataFrame [source][source]
Adds metadata columns to the index matching on the HuisIdLeverancier column.
- Parameters:
index_df (pd.DataFrame) – The index DataFrame.
metadata_df (pd.DataFrame) – The metadata DataFrame to be added to the index.
data_leverancier (str, optional) – The data supplier name (required).
- Returns:
The updated index DataFrame.
- Return type:
pd.DataFrame
- etdmap.index_helpers.get_bsv_metadata()[source][source]
Reads and returns metadata from the BSV metadata file, ensuring that all required columns are present.
- Returns:
A pandas DataFrame containing the BSV metadata with the specified columns.
- Return type:
DataFrame
- Raises:
ValueError – If any of the required columns are missing in the metadata file.
Notes
The function relies on the read_metadata utility to read the file and check for required columns.
The path to the BSV metadata file is obtained from etdmap.options.bsv_metadata_file.
The required columns are defined in the bsv_metadata_columns list.
- etdmap.index_helpers.get_household_id_pairs(index_df: DataFrame, data_folder_path: str, data_provider: str, list_files_func: callable) list [source][source]
Generates pairs of HuisIdBSV and filenames for new and existing entries.
- Parameters:
index_df (pd.DataFrame) – The index DataFrame.
data_folder_path (str) – The path to the folder containing data files.
data_provider (str) – The name of the data provider.
list_files_func (callable) – A function to get a dictionary of id and the files in the data folder.
- Returns:
A list of tuples containing HuisIdBSV and filenames.
- Return type:
list
- etdmap.index_helpers.get_mapped_data(huis_id_bsv: int) DataFrame [source][source]
Retrieves the mapped household data for a given BSV household ID from the Parquet file.
- Parameters:
huis_id_bsv (int) – The BSV household ID.
- Returns:
The DataFrame containing the household data.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the specified file does not exist at the expected path.
- etdmap.index_helpers.get_mapped_file_path(huis_id_bsv: int) str [source][source]
Generates the file path for the mapped household data based on the BSV household ID.
- Parameters:
huis_id_bsv (int) – The BSV household ID.
- Returns:
The full file path to the mapped household data in Parquet format.
- Return type:
str
- etdmap.index_helpers.read_index() tuple[DataFrame, str] [source][source]
Reads the index parquet file from the specified folder path.
- Returns:
- A tuple containing:
DataFrame: The DataFrame of the index.
str: The path to the index file.
- Return type:
tuple
- etdmap.index_helpers.read_metadata(metadata_file: str, required_columns=None) DataFrame [source][source]
Read metadata from an Excel file and check for the presence of required columns.
- Parameters:
metadata_file (str) – The path to the Excel file containing the metadata for a data source.
required_columns (list, optional) – A list of column names that must be present in the metadata. Defaults to [‘HuisIdLeverancier’].
- Returns:
A DataFrame containing the metadata from the specified sheet.
- Return type:
pd.DataFrame
- Raises:
Exception – If not all required columns are found in the metadata file.
- etdmap.index_helpers.save_index_to_parquet(index_df: DataFrame) None [source][source]
Save the index DataFrame to a Parquet file.
- Parameters:
index_df (pd.DataFrame) – The DataFrame containing the index data.
- Return type:
None
Notes
This function saves the provided DataFrame to a Parquet file.
- etdmap.index_helpers.set_metadata_dtypes(metadata_df: DataFrame, strict: bool = False) DataFrame [source][source]
Set the data types of columns in the index or metdata DataFrame based on metadata_dtypes.
- Parameters:
metadata_df (pandas.DataFrame) – The DataFrame containing the metadata.
strict (bool, optional) – If True, raises an error if a column specified in metadata_dtypes is not found in the DataFrame. Default is False.
- Returns:
The DataFrame with updated column data types.
- Return type:
pandas.DataFrame
- etdmap.index_helpers.update_index(index_df: DataFrame, new_entry: dict, data_provider: str) DataFrame [source][source]
Update the index with new entries and recalculate or add flag columns for dataset validators.
- Parameters:
index_df (pd.DataFrame) – The index DataFrame.
new_entry (dict) – The new entry to be added or updated in the index.
data_provider (str) – The name of the data provider.
- Returns:
The updated index DataFrame.
- Return type:
pd.DataFrame
- etdmap.index_helpers.update_meenemen() DataFrame [source][source]
Updates the index DataFrame to include information about which households should be included in the “Meenemen” column based on BSV metadata.
This function performs the following steps: 1. Logs an informational message indicating the start of the update process. 2. Reads the current index DataFrame and its file path using the read_index function. 3. Removes any existing ‘Meenemen’ column from the index DataFrame. 4. Retrieves the BSV metadata DataFrame using the get_bsv_metadata function. 5. Extracts the ‘HuisIdBSV’ and ‘Meenemen’ columns from the BSV metadata DataFrame. 6. Merges the extracted BSV ‘Meenemen’ data with the index DataFrame on the ‘HuisIdBSV’ column. 7. Saves the updated index DataFrame back to its original file path in Parquet format using the PyArrow engine. 8. Returns the updated index DataFrame.
- Returns:
pd.DataFrame
- Return type:
The updated index DataFrame with the new “Meenemen” information included.
- etdmap.index_helpers.update_meta_validators(index_df)[source][source]
Updates the index DataFrame with a new column ‘validate_cumulative_diff_ok’ that indicates whether all cumulative difference columns in the DataFrame are valid.
- Parameters:
(pandas.DataFrame) (index_df)
- Returns:
pandas.DataFrame
- Return type:
The updated DataFrame with an additional column ‘validate_cumulative_diff_ok’.
Notes
The function constructs a list of column names based on the global variable cumulative_columns.
It checks if all these constructed column names exist in the input DataFrame.
If they do, it creates a new boolean column ‘validate_cumulative_diff_ok’ where each entry is True if all corresponding row-wise values in the cumulative difference columns are True (indicating validity).
If any of the expected columns are missing, it fills the ‘validate_cumulative_diff_ok’ column with pd.NA.
etdmap.mapping_clock_helpers module
- etdmap.mapping_clock_helpers.align_and_merge_dataframes(aligned_dfs: List[DataFrame], timestamp_col: str = 'aligned_timestamp', use_first_as_main: bool = False, freq: int = 300) DataFrame [source][source]
Aligns multiple dataframes by adjusting their first timestamps to minimize overall deviation, reports the shifts, and merges them into a single dataframe.
- Parameters:
aligned_dfs (List[pd.DataFrame]) – List of aligned dataframes to be processed.
timestamp_col (str, optional) – Name of the aligned timestamp column. Default is ‘aligned_timestamp’.
use_first_as_main (bool, optional) – If True, aligns all dataframes to the first one. Default is False.
freq (int, optional) – Expected frequency in seconds between timestamps. Default is 300.
- Returns:
Merged dataframe with aligned timestamps.
- Return type:
pd.DataFrame
Notes
This function is in ALPHA status (untested code, use at your own risk!).
- Raises:
ValueError – If any dataframe has an inconsistent frequency.
- etdmap.mapping_clock_helpers.align_timestamps(df: DataFrame, timestamp_col: str, start_time: Timestamp, freq: int, tolerance: int = 10, method: str = 'nearest', cumulative_columns: List[str] = None) DataFrame [source][source]
ALPHA status (untested code, use at your own risk!). Aligns timestamps in a dataframe based on the specified method.
- Parameters:
df (pd.DataFrame) – Input dataframe.
timestamp_col (str) – Name of the timestamp column.
start_time (pd.Timestamp) – Start time for alignment.
freq (int) – Frequency in seconds.
tolerance (int, optional) – Tolerance in seconds for alignment (default is 10).
method (str, optional) – Alignment method (‘nearest’ or ‘interpolation’, default is ‘nearest’).
cumulative_columns (List[str], optional) – List of cumulative column names.
- Returns:
Dataframe with aligned timestamps and adjusted values.
- Return type:
pd.DataFrame
Notes
For cumulative columns, uses custom interpolation within tolerance.
For non-cumulative columns, uses nearest value within tolerance.
Preserves pd.NA for values outside of tolerance range.
Ensures consistent frequency and handles edge cases.
Examples
>>> import pandas as pd >>> >>> # Create a sample dataframe >>> df = pd.DataFrame({ ... 'timestamp': pd.date_range('2023-01-01', periods=5, freq='7min'), ... 'value': [1, 2, 3, 4, 5], ... 'cumulative': [10, 20, 30, 40, 50] ... }) >>> >>> # Set alignment parameters >>> start_time = pd.Timestamp('2023-01-01') >>> freq = 300 # 5 minutes in seconds >>> >>> # Align timestamps >>> aligned_df = align_timestamps( ... df, ... 'timestamp', ... start_time, ... freq, ... tolerance=60, ... method='interpolation', ... cumulative_columns=['cumulative'] ... ) >>> >>> print(aligned_df)
This will produce an aligned dataframe with 5-minute intervals, interpolating the cumulative column and using nearest neighbor for the non-cumulative column within a 60-second tolerance.
- etdmap.mapping_clock_helpers.determine_dynamic_clocks(dataframes: List[DataFrame], timestamp_col: str, freq: int = 300) Dict[str, Timestamp] [source][source]
ALPHA status (untested code, use at your own risk!). Determines the ideal dynamic clock start for each device and across all devices.
- Parameters:
dataframes (list of pandas.DataFrame) – List of dataframes, one per device.
timestamp_col (str) – Name of the timestamp column.
freq (int, optional) – Frequency in seconds (default is 300 for 5-minute intervals).
- Returns:
dict of str – Dictionary with ideal clock starts for each device and overall.
- Return type:
pandas.Timestamp
- etdmap.mapping_clock_helpers.interpolate_cumulative(series: Series, target_timestamps: DatetimeIndex, tolerance: Timedelta) Series [source][source]
ALPHA status (untested code, use at your own risk!). Interpolate cumulative data within tolerance, preserving pd.NA outside of tolerance ranges.
- Parameters:
series (pd.Series) – The original cumulative data series.
target_timestamps (pd.DatetimeIndex) – The target timestamps for alignment.
tolerance (pd.Timedelta) – The tolerance for considering nearby values.
- Returns:
Interpolated series aligned with target_timestamps.
- Return type:
pd.Series
Notes
Interpolates only when there are at least two values within the tolerance range.
Preserves pd.NA for timestamps without nearby values.
Raises an error if decreasing cumulative values are detected.
- etdmap.mapping_clock_helpers.report_tolerance_impact(dataframes: List[DataFrame], timestamp_col: str, ideal_starts: Dict[str, Timestamp], tolerances: List[int] = [10, 30, 150], freq: int = 300) Dict[str, Dict[str, Dict[int, int]]] [source][source]
ALPHA status (untested code, use at your own risk!). Reports on how different tolerances affect the number of values per column.
- Parameters:
dataframes (list of pandas.DataFrame) – List of dataframes, one per device.
timestamp_col (str) – Name of the timestamp column.
ideal_starts (dict of str: pandas.Timestamp) – Dictionary of ideal starts from determine_dynamic_clocks.
tolerances (list of int, optional) – List of tolerances in seconds to check (default is [10, 30, 150]).
freq (int, optional) – Frequency in seconds (default is 300 for 5-minute intervals).
- Returns:
dict of str – Nested dictionary with counts for each device, column, and tolerance.
- Return type:
dict of str: dict of int: int
etdmap.mapping_helpers module
- etdmap.mapping_helpers.add_diff_columns(data: ~pandas.core.frame.DataFrame, id_column: str = None, validate_func=<function validate_cumulative_variables>, context: str = '', drop_unvalidated: bool = False) DataFrame [source][source]
Add difference columns for cumulative variables and handle some data inconsistencies.
This function calculates the difference between consecutive readings for cumulative columns, validates the data, and handles various inconsistencies such as negative differences and unexpected zeros.
- Parameters:
data (pd.DataFrame or pd.core.groupby.DataFrameGroupBy) – The input data, either as a DataFrame or a GroupBy object.
id_column (str, optional) – The name of the column to use for grouping if data is a DataFrame, by default None.
validate_func (callable, optional) – A function to validate the data, by default validate_cumulative_variables.
context (str, optional) – A string to prepend to log messages for context, by default ‘’.
drop_unvalidated (bool, optional) – If True, drop groups that fail validation; if False, keep them with warnings, by default False.
- Returns:
A DataFrame with added difference columns for cumulative variables.
- Return type:
pd.DataFrame
- Raises:
TypeError – If the input data is neither a DataFrame nor a GroupBy object.
Notes
The function uses the global variable cumulative_columns to determine which columns to process.
It handles various data inconsistencies: - Removes unexpected zeros between valid readings. - Handles cases where the meter appears to have been reset. - Removes data after a negative difference if no subsequent increases are found.
Extensive logging is used to document the data cleaning process.
If the meter has had negative dip and after that there were no subsequent increases, we choose to ignore all other values from an apparently broken meter by setting them to pd.NA
If the meter has a negative dip and the meter simply jumps back up to the last value before the negative dip (or above) then we assume there is one bad value to remove. This cases does not consider time, so may miss edge cases, for example that it did not jump back up but rather so much time passed that the next reading is much higher - this may be addressed in the future but requires assumption about rate of growth.
- etdmap.mapping_helpers.collect_column_stats(identifier, column_data)[source][source]
Collect summary statistics for a given column of data.
- Parameters:
identifier (str or int) – The identifier for the dataset.
column_data (pd.Series) – The data in the column to analyze.
- Returns:
- A dictionary containing summary statistics for the column. The keys are:
’Identifier’: The identifier for the dataset.
’column’: The name of the column.
’type’: The data type of the column.
’count’: The number of non-null values in the column.
’missing’: The number of missing values in the column.
’errors’: The number of errors (NA) in the column.
’min’: The minimum value in the column, if applicable.
’max’: The maximum value in the column, if applicable.
’mean’: The mean value in the column, if applicable.
’median’: The median value in the column, if applicable.
’iqr’: The interquartile range (IQR) of the column, if applicable.
’quantile_25’: The 25th percentile value in the column, if applicable.
’quantile_75’: The 75th percentile value in the column, if applicable.
’top5’: A dictionary with the top 5 most frequent values and their counts, if applicable.
- Return type:
dict
Notes
This function handles different data types (numeric, boolean, datetime, object) and computes relevant statistics accordingly.
At the moment there is no effective difference between missing and errors.
- etdmap.mapping_helpers.collect_mapped_data_stats(huis_id_bsv)[source][source]
Collect statistics for each column in the DataFrame corresponding to a specific HuisIdBSV.
This function retrieves data for a given huis_id_bsv, processes it, and collects summary statistics for each column. It logs errors if any issues occur during processing.
- Parameters:
huis_id_bsv (str or int) – The identifier for the household to process.
- Returns:
A list of dictionaries, where each dictionary contains summary statistics for a column in the DataFrame. Each dictionary has keys ‘column_name’, ‘mean’, ‘std’, ‘min’, and ‘max’.
- Return type:
list of dict
Notes
The function uses get_mapped_data to retrieve the data for the given huis_id_bsv.
It logs errors if there are issues retrieving or processing the data.
- etdmap.mapping_helpers.ensure_intervals(df: DataFrame, date_column: str = 'ReadingDate', freq='5min') DataFrame [source][source]
Ensure that the DataFrame has a consistent number of records and expected time intervals. It will add missing intervals or remove excess records to ensure consistency.
This function checks if the input DataFrame has the expected number of records based on its date range and the specified frequency. If not, it adds missing intervals or removes excess records.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing the time series data.
date_column (str, optional) – The name of the column containing the datetime information, by default ‘ReadingDate’.
freq (str, optional) – The expected frequency of the time series, by default ‘5min’.
- Returns:
A DataFrame with consistent time intervals.
- Return type:
pd.DataFrame
Notes
If the number of records matches the expected number, the function returns the input DataFrame unchanged.
If there are fewer records than expected, the function adds missing intervals.
If there are more records than expected, the function performs a left merge to reduce the number of records.
The function uses logging to inform about the actions taken.
Warning
If there are more records than expected, this might indicate issues with the data source. The function will log an error in this case.
This function assumes that an effort has already been made to prepare the data source in the intervals.
If raw data has more frequent data or if it records are coming in at a variable or different frequence, it will first need to be processed to meet the given interval.
- etdmap.mapping_helpers.fill_down_infrequent_devices(df: DataFrame, columns=('ElektriciteitsgebruikBoilervat', 'ElektriciteitsgebruikRadiator', 'ElektriciteitsgebruikBooster'))[source][source]
Fill down (forward fill) and then up (backward fill) values for specified columns.
This function is used to impute missing values for devices that report infrequently. It first forward fills (ffill) the values, then backward fills (bfill) any remaining NAs, and finally replaces any remaining NAs with 0.0.
- Parameters:
df (pd.DataFrame) – The input DataFrame containing the device data.
columns (tuple of str, optional) – The names of the columns to fill. Default is (‘ElektriciteitsgebruikBoilervat’, ‘ElektriciteitsgebruikRadiator’, ‘ElektriciteitsgebruikBooster’).
- Returns:
The input DataFrame with the specified columns filled.
- Return type:
pd.DataFrame
Notes
This function may be problematic if the data source or devices are misbehaving, as the imputation will still be performed.
The imputation order is: forward fill, backward fill, then fill remaining NAs with 0.0.
- etdmap.mapping_helpers.get_mapped_data_stats(multi=False, max_workers=2)[source][source]
Collect and aggregate statistics for all columns in the DataFrame corresponding to each HuisIdBSV.
- Parameters:
multi (bool, optional) – If True, use multiprocessing to collect stats. Default is False.
max_workers (int, optional) – The maximum number of workers to use for multiprocessing. Default is 2.
- Returns:
A DataFrame containing the aggregated statistics for each column in the DataFrame corresponding to each HuisIdBSV. Each row represents a column from a specific HuisIdBSV and contains summary statistics.
- Return type:
pd.DataFrame
Notes
The function uses read_index to retrieve the index of households.
It logs errors if there are issues retrieving or processing the data.
- etdmap.mapping_helpers.get_raw_data_stats(raw_data_folder_path, multi=False, max_workers=2)[source][source]
Collect and aggregate statistics for all columns in the DataFrame corresponding to each raw data file in a folder.
- Parameters:
raw_data_folder_path (str) – The path to the folder containing the raw data files.
multi (bool, optional) – If True, use multiprocessing to collect stats. Default is False.
max_workers (int, optional) – The maximum number of workers to use for multiprocessing. Default is 2.
- Returns:
A DataFrame containing the aggregated statistics for each column in the DataFrame corresponding to each file name. Each row represents a column from a specific file and contains summary statistics.
- Return type:
pd.DataFrame
Notes
Only parquet files are supported
It logs errors if there are issues retrieving or processing the data.
- etdmap.mapping_helpers.rearrange_model_columns(household_df: DataFrame, add_columns: bool = True, context: str = '') DataFrame [source][source]
Rearrange and validate columns in a DataFrame according to a predefined model.
This function performs the following operations: 1. Validates and coerces column types to match expected types. 2. Rearranges columns to match the order defined in model_column_order. 3. Keeps original columns that are not included in the ETD data model at the end of the dataframe. 4. Optionally adds missing columns with NA values.
- Parameters:
household_df (pd.DataFrame) – The input DataFrame containing household data.
add_columns (bool, optional) – If True, add missing columns from model_column_order to the DataFrame. If False, only keep columns that are in both the DataFrame and model_column_order. Default is True.
context (str, optional) – A string to prepend to log messages for context. If provided, a colon and space will be appended to it. Default is an empty string.
- Returns:
A new DataFrame with rearranged and validated columns.
- Return type:
pd.DataFrame
- Raises:
ValueError – If type coercion fails for any column.
Notes
The function uses the global variables model_column_type and model_column_order.
Columns not in model_column_order are appended at the end of the DataFrame.
When coercing types, any values that fail to convert are replaced with pd.NA.
Logging is used to warn about type mismatches and missing columns.
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': ['x', 'y', 'z']}) >>> rearranged_df = rearrange_model_columns(df, add_columns=True, context='Example')
- etdmap.mapping_helpers.validate_cumulative_variables(group: DataFrame, timedelta=Timedelta('0 days 01:00:00'), available=0.9, context='') bool [source][source]
Validate cumulative variables in a DataFrame group.
This function performs several checks on cumulative columns: 1. Checks for gaps greater than the specified timedelta. 2. Checks for decreasing cumulative values. 3. Checks for unexpected zero values. 4. Checks if at least 90% of the values are not NA.
- Parameters:
group (pd.DataFrame) – The DataFrame group to validate.
timedelta (pd.Timedelta, optional) – The maximum allowed time gap between readings, by default 1 hour.
available (float, optional) – The minimum fraction of non-NA values required, by default 0.9 (90%).
context (str, optional) – A string to prepend to log messages for context, by default ‘’.
- Returns:
A dictionary with boolean values indicating the results of various checks: - ‘column_found’: True if all expected columns are present. - ‘max_delta_allowed’: True if no gaps exceed the specified timedelta. - ‘no_negative_diff’: True if no decreasing cumulative values are found. - ‘no_unexpected_zero’: True if no unexpected zero values are found. - ‘enough_values’: True if at least 90% of values are non-NA.
- Return type:
dict
Notes
The function uses the global variable cumulative_columns to determine which columns to check.
Logging is used to warn about any issues found during validation.
etdmap.record_validators module
- etdmap.record_validators.condition_func_threshold(col: str) Callable[[DataFrame], bool] [source][source]
Define condition function for column based on min, max from thresholdscsv.
- Parameters:
col (str) – Column name for which the function holds.
- Returns:
Condition function for the specified column, or None if the column is not in the thresholds dictionary.
- Return type:
function or None
- etdmap.record_validators.create_validate_cumulative(cumulative_columns: list, record_flag_conditions: dict) None [source][source]
Create validation functions for cumulative columns and add them to the record_flag_conditions dictionary.
- Parameters:
cumulative_columns (list) – List of column names for cumulative data.
record_flag_conditions (dict) – Dictionary to store the validation functions.
- Return type:
None
- etdmap.record_validators.create_validate_momentaan(columns_5min_momentaan: list, record_flag_conditions: dict) None [source][source]
Create validation functions for momentary columns and add them to the record_flag_conditions dictionary.
- Parameters:
columns_5min_momentaan (list) – List of column names for momentary data.
record_flag_conditions (dict) – Dictionary to store the validation functions.
- Return type:
None
- etdmap.record_validators.get_columns_threshold_validator(cols)[source][source]
Get a validator function for thresholds of specified columns.
- Parameters:
cols (list) – List of column names to validate.
- Returns:
A function that validates thresholds for the specified columns.
- Return type:
function
- etdmap.record_validators.validate_300sec(df: DataFrame) Series [source][source]
Validate that the time difference between consecutive ReadingDate values is 300 seconds.
- Parameters:
df (DataFrame) – The DataFrame to validate.
- Returns:
A boolean Series indicating which rows have a 300-second difference from the previous row.
- Return type:
Series
- etdmap.record_validators.validate_columns(df: DataFrame, columns: list, condition_func) Series [source][source]
Helper function to validate columns with a given condition function.
- Parameters:
df (DataFrame) – The DataFrame containing the columns to validate.
columns (list) – List of column names to validate.
condition_func (function) – The condition function to apply for validation.
- Returns:
A boolean Series indicating which rows meet the condition.
- Return type:
Series
- etdmap.record_validators.validate_elektriciteitgebruik(df: DataFrame) Series [source][source]
Validate that ElektriciteitsgebruikHuishoudelijk is less than or equal to the sum of Zon-opwekTotaal, ElektriciteitNetgebruikHoog, and ElektriciteitNetgebruikLaag.
- Parameters:
df (DataFrame) – The DataFrame to validate.
- Returns:
A boolean Series indicating which rows meet the condition.
- Return type:
Series
- etdmap.record_validators.validate_not_outliers(x: DataFrame)[source][source]
Validate that values in the DataFrame are not outliers.
- Parameters:
x (pd.DataFrame) – A DataFrame with one column to check for outliers.
- Returns:
A boolean Series indicating which values are not outliers.
- Return type:
Series
- Raises:
ValueError – If the input is not a DataFrame with one column.
TypeError – If the column contains non-numeric values.
- etdmap.record_validators.validate_reading_date_uniek(df: DataFrame) Series [source][source]
Validate that ReadingDate column has only unique values.
- Parameters:
df (DataFrame) – The DataFrame to validate.
- Returns:
A boolean Series indicating which rows have unique ReadingDate values.
- Return type:
Series
- etdmap.record_validators.validate_thresholds_combined(df: DataFrame) Series [source][source]
Per row, determine if at least one value fall outside of the Thresholds.
- Parameters:
df (pd.DataFrame) – Dataframe with columns to be checked.
- Returns:
Booleans, True when at least one value is outside bounds.
- Return type:
pd.Series
- etdmap.record_validators.validate_warmteproductie(df: DataFrame) Series [source][source]
Validate that WarmteproductieWarmtepomp is greater than or equal to WarmteproductieWarmTapwater.
- Parameters:
df (DataFrame) – The DataFrame to validate.
- Returns:
A boolean Series indicating which rows meet the condition.
- Return type:
Series
Module contents
- etdmap.read_index() tuple[DataFrame, str] [source][source]
Reads the index parquet file from the specified folder path.
- Returns:
- A tuple containing:
DataFrame: The DataFrame of the index.
str: The path to the index file.
- Return type:
tuple
- etdmap.read_metadata(metadata_file: str, required_columns=None) DataFrame [source][source]
Read metadata from an Excel file and check for the presence of required columns.
- Parameters:
metadata_file (str) – The path to the Excel file containing the metadata for a data source.
required_columns (list, optional) – A list of column names that must be present in the metadata. Defaults to [‘HuisIdLeverancier’].
- Returns:
A DataFrame containing the metadata from the specified sheet.
- Return type:
pd.DataFrame
- Raises:
Exception – If not all required columns are found in the metadata file.
- etdmap.update_index(index_df: DataFrame, new_entry: dict, data_provider: str) DataFrame [source][source]
Update the index with new entries and recalculate or add flag columns for dataset validators.
- Parameters:
index_df (pd.DataFrame) – The index DataFrame.
new_entry (dict) – The new entry to be added or updated in the index.
data_provider (str) – The name of the data provider.
- Returns:
The updated index DataFrame.
- Return type:
pd.DataFrame