Clean#

This module provides tools to clean data in preparation for modeling. It contains functions to convert data types, convert unites of measurement, convert time values, reduce multicollinearity, and split out the outliers from a dataset.

Functions:

convert_data_values() - Convert mixed data values (ex: GB, MB, KB) to a common unit of measurement.
convert_dtypes() - Convert specified columns in a DataFrame to the desired data type.
convert_time_values() - Convert time values in specified columns of a DataFrame to a target format.
reduce_multicollinearity() - Reduce multicollinearity in a DataFrame by removing highly correlated features.
split_outliers() - Split a DataFrame into two based on the presence of outliers.

datawaza.clean.convert_data_values(df: DataFrame, cols: List[str], target_unit: str = 'MB', show_results: bool = False, inplace: bool = False, decimal: int = 4, conversion_dict: Dict[str, int] | None = None) → DataFrame | None[source]#

Convert mixed data values (ex: GB, MB, KB) to a common unit of measurement.

This function converts values in the specified columns of the input DataFrame to the desired target unit. If inplace is set to True, the conversion is done in place, modifying the original DataFrame. If inplace is False (default), a new DataFrame with the converted values is returned. The string suffix is dropped and the column is converted to a float. It handles inconsistent suffix strings, with or without spaces after the numbers (ex: ‘10GB’, ‘10 Gb’). A variety of spelling options are supported (ex: ‘GB’, ‘Gigabytes’), but you can pass a custom dictionary as conversion_dict if desired. To display a summary of the changes made, set show_results to True.

Use this to clean up messy data that has a variety of units of measurement appended as text strings to the numeric values. The result will be columns with a common unit of measurement as floats (with no text suffixes).

Parameters:

df (pd.DataFrame) – The DataFrame containing the columns to be converted.
cols (List[str]) – List of column names in df to apply the conversion.
target_unit (str, optional) – The target unit for the conversion. Default is ‘MB’. Possible values are: ‘B’, ‘KB’, ‘MB’, ‘GB’, ‘TB’, ‘PB’, ‘EB’.
show_results (bool, optional) – If set to True, prints the before-and-after conversion values for each cell. Default is False.
inplace (bool, optional) – If set to True, the conversion is done in place, modifying the original DataFrame. If False (default), a new DataFrame with the converted values is returned.
decimal (int, optional) – The number of decimal places to show when show_results is set to True. Default is 4.
conversion_dict (Dict[str, int], optional) – A custom dictionary mapping units to their corresponding powers of 2. If provided, it will override the default unit_conversion dictionary. Default is None.

Returns:

If inplace is False (default), returns a new DataFrame with the specified columns converted to the target unit. If inplace is True, returns None as the original DataFrame is modified in place.

Return type:

pd.DataFrame or None

Examples

Prepare some sloppy data for the examples:

>>> df = pd.DataFrame({
...     'A': ['5 gb', '500 kb', '2tb'],
...     'B': ['3GB', '0.5 MB', '200KB'],
...     'C': ['10 Gigabytes', '250 kilo', '1 PB']
... })
>>> cols = ['A', 'B', 'C']

Example 1: Convert values in specified columns to GB and assign to a new df:

>>> df_converted = convert_data_values(df, cols, target_unit='GB')
>>> df_converted
             A         B             C
0     5.000000  3.000000  1.000000e+01
1     0.000477  0.000488  2.384186e-04
2  2048.000000  0.000191  1.048576e+06

Example 2: Convert data values to MB in place, modifying the existing df, and show a summary of the changes:

>>> convert_data_values(df, cols, target_unit='MB', inplace=True,
... show_results=True, decimal=8)
Original: 5 gb -> Converted: 5120.00000000 MB
Original: 500 kb -> Converted: 0.48828125 MB
Original: 2tb -> Converted: 2097152.00000000 MB
Original: 3GB -> Converted: 3072.00000000 MB
Original: 0.5 MB -> Converted: 0.50000000 MB
Original: 200KB -> Converted: 0.19531250 MB
Original: 10 Gigabytes -> Converted: 10240.00000000 MB
Original: 250 kilo -> Converted: 0.24414062 MB
Original: 1 PB -> Converted: 1073741824.00000000 MB
>>> df
              A            B             C
0  5.120000e+03  3072.000000  1.024000e+04
1  4.882812e-01     0.500000  2.441406e-01
2  2.097152e+06     0.195312  1.073742e+09

datawaza.clean.convert_dtypes(df: DataFrame, cols: List[str], target_dtype: type | str, show_results: bool = False, inplace: bool = True) → DataFrame | None[source]#

Convert specified columns in a DataFrame to the desired data type.

This function converts the data type of the specified columns in the input DataFrame to the desired target data type. It supports both base Python data types (e.g., int, float, str) and Pandas-specific data types (e.g., ‘int64’, ‘float64’, ‘object’, ‘bool’, ‘datetime64’, ‘timedelta[ns]’, ‘category’). If inplace is set to True (default), the conversion is done in place, modifying the original DataFrame. If inplace is False, a new DataFrame with the converted columns is returned. If show_results is set to True, it will print the results of each successful conversion and any error messages for columns that could not be converted.

Use this function when you need to convert the data types of specific columns in a DataFrame to a consistent target data type, especially when dealing with multiple columns at once and identifying columns that require further data cleaning.

Parameters:

df (pd.DataFrame) – The DataFrame containing the columns to be converted.
cols (List[str]) – List of column names in df to be converted.
target_dtype (type or str) – The desired data type for the conversion. Can be a base Python data type (e.g., int, float, str) or a string representation of a Pandas data type (e.g., ‘int64’, ‘float64’, ‘object’, ‘bool’, ‘datetime64’, ‘timedelta[ns]’, ‘category’).
show_results (bool, optional) – If True, will print the results of each successful conversion and any error messages for columns that could not be converted. Default is False.
inplace (bool, optional) – If True (default), the conversion is done in place, modifying the original DataFrame. If False, a new DataFrame with the converted columns is returned.

Returns:

If inplace is False, returns a new DataFrame with the specified columns converted to the target data type. If inplace is True, returns None as the original DataFrame is modified in place.

Return type:

pd.DataFrame or None

Examples

Prepare data for examples:

>>> df = pd.DataFrame({
...     'A': [1, 2, 3],
...     'B': ['0', '23.4 MB', '3.71 GB'],
...     'C': ['4.5', '3.7', '12.15'],
...     'D': [True, False, True],
...     'E': ['Yes', 'No', ''],
...     'F': ['Low', 'Medium', 'High']
... })
>>> num_columns = ['A', 'B', 'C']
>>> cat_columns = ['D', 'E', 'F']
>>> df.dtypes
A     int64
B    object
C    object
D      bool
E    object
F    object
dtype: object

Example 1: Convert columns ‘A’, ‘B’, and ‘C’ to float and show the results:

>>> convert_dtypes(df, num_columns, 'float')
Error converting column: B (Current dtype: object). Error message: could not convert string to float: '23.4 MB'
>>> df.dtypes
A    float64
B     object
C    float64
D       bool
E     object
F     object
dtype: object

Example 2: Convert columns ‘D’, ‘E’, and ‘F’ to category, show the results, and return a new DataFrame:

>>> new_df = convert_dtypes(df, cat_columns, 'category', inplace=False,
... show_results=True)
Successfully converted column 'D' from bool to category.
Successfully converted column 'E' from object to category.
Successfully converted column 'F' from object to category.
>>> new_df.dtypes
A     float64
B      object
C     float64
D    category
E    category
F    category
dtype: object

datawaza.clean.convert_time_values(df: DataFrame, cols: List[str], target_format: str = '%Y-%m-%d %H:%M:%S', target_dtype: str | type | dtype = 'datetime64[ns]', show_results: bool = False, inplace: bool = False, zero_to_nan: bool = False, pattern_list: List[str] | None = None) → DataFrame | None[source]#

Convert time values in columns of a DataFrame to a target format and data type.

This function converts time values in the specified columns of the input DataFrame to the desired target format and data type. If inplace is set to True, the conversion is done in place, modifying the original DataFrame. If inplace is False (default), a new DataFrame with the converted values is returned.

The function can handle time values in various formats, including: 1. Excel serial format (e.g., ‘45161.23458’) 2. String format (e.g., ‘YYYY-MM-DD’) 3. UNIX epoch in milliseconds (e.g., ‘1640304000000.0’)

If your format is not supported, you can define pattern_list as a list of custom datetime patterns.

If zero_to_nan is set to True, values of ‘0’, ‘0.0’, ‘0.00’, 0, 0.0, or 0.00 will be replaced with NaN. Otherwise, zero values will be detected as a Unix Epoch format with value 1970-01-01 00:00:00.

You can use the default target_format of ‘%Y-%m-%d %H:%M:%S’, or specify a different format. The target_dtype parameter allows you to specify the desired data type for the converted columns. When working with formatted date strings, the applicable data types are ‘str’, ‘object’, or ‘datetime64[ns]’.

If target_dtype is set to ‘str’ or ‘object’, the converted data will be stored as strings in the specified target_format. Missing values will be represented as ‘NaN’.
If target_dtype is set to ‘datetime64[ns]’, the converted data will be stored as pandas datetime objects. The target_format has no effect in this scenario. Missing values will be represented as ‘NaT’.

To display a summary of the changes made, set show_results to True.

Parameters:

df (pd.DataFrame) – The DataFrame containing the columns to be converted.
cols (List[str]) – List of column names in df to apply the conversion.
target_format (str, optional) – The desired datetime format for the conversion. Uses format codes such as: %Y: 4-digit year, %m: Month as zero-padded decimal, %d: Day of the month, %H: Hour (24-hour clock), %M: Minute, %S: Second. Default format is ‘%Y-%m-%d %H:%M:%S’.
target_dtype (Union[str, type, np.dtype], optional) – The desired data type for the converted columns. Applicable data types are ‘str’, ‘object’, or ‘datetime64[ns]’. Default is ‘datetime64[ns]’.
show_results (bool, optional) – If set to True, prints the before-and-after conversion values for each cell. Default is False.
inplace (bool, optional) – If set to True, the conversion is done in place, modifying the original DataFrame. If False (default), a new DataFrame with the converted values is returned.
zero_to_nan (bool, optional) – If set to True, values of ‘0’, ‘0.0’, ‘0.00’, 0, 0.0, or 0.00 will be replaced with NaN or NaT, depending on target_dtype. Default is False.
pattern_list (List[str], optional) – A list of custom datetime patterns to override the default patterns. If provided, it will be used instead of the default patterns. Default is None.

Returns:

If inplace is False (default), returns a new DataFrame with the specified columns converted to the target format and data type. If inplace is True, returns None as the original DataFrame is modified in place.

Return type:

pd.DataFrame or None

Examples

Prepare some sloppy data for the examples:

>>> df = pd.DataFrame({
...     'A': ['45161.23458', '2019-02-09', '1640304000000.0'],
...     'B': ['2022-01-01', '45000.5', '1577836800000.0'],
...     'C': ['0', '45161.23458', '2019-02-09']
... })
>>> cols = ['A', 'B', 'C']

Example 1: Convert time values in specified columns to the default format and data type:

>>> df_converted = convert_time_values(df, cols)
>>> df_converted
                    A                   B                   C
0 2023-08-25 05:37:47 2022-01-01 00:00:00 1970-01-01 00:00:00
1 2019-02-09 00:00:00 2023-03-17 12:00:00 2023-08-25 05:37:47
2 2021-12-24 00:00:00 2020-01-01 00:00:00 2019-02-09 00:00:00
>>> df_converted.dtypes
A    datetime64[ns]
B    datetime64[ns]
C    datetime64[ns]
dtype: object

Example 2: Convert time values in specified columns to the default format and data type, showing a summary of changes:

>>> df_converted = convert_time_values(df, cols, show_results=True)
Original: 45161.23458 (Excel Serial) -> Converted: 2023-08-25 05:37:47
Original: 2019-02-09 (Standard Datetime String) -> Converted: 2019-02-09 00:00:00
Original: 1640304000000.0 (UNIX Epoch in milliseconds) -> Converted: 2021-12-24 00:00:00
Original: 2022-01-01 (Standard Datetime String) -> Converted: 2022-01-01 00:00:00
Original: 45000.5 (Excel Serial) -> Converted: 2023-03-17 12:00:00
Original: 1577836800000.0 (UNIX Epoch in milliseconds) -> Converted: 2020-01-01 00:00:00
Original: 0 (UNIX Epoch) -> Converted: 1970-01-01 00:00:00
Original: 45161.23458 (Excel Serial) -> Converted: 2023-08-25 05:37:47
Original: 2019-02-09 (Standard Datetime String) -> Converted: 2019-02-09 00:00:00
>>> df_converted
                    A                   B                   C
0 2023-08-25 05:37:47 2022-01-01 00:00:00 1970-01-01 00:00:00
1 2019-02-09 00:00:00 2023-03-17 12:00:00 2023-08-25 05:37:47
2 2021-12-24 00:00:00 2020-01-01 00:00:00 2019-02-09 00:00:00

Example 3: Convert time values to a custom format and data type in place, replacing zeros with NaN, and showing a summary of changes:

>>> convert_time_values(df, cols, target_format='%d/%m/%Y', target_dtype='str',
... inplace=True, show_results=True, zero_to_nan=True)
Original: 45161.23458 (Excel Serial) -> Converted: 25/08/2023
Original: 2019-02-09 (Standard Datetime String) -> Converted: 09/02/2019
Original: 1640304000000.0 (UNIX Epoch in milliseconds) -> Converted: 24/12/2021
Original: 2022-01-01 (Standard Datetime String) -> Converted: 01/01/2022
Original: 45000.5 (Excel Serial) -> Converted: 17/03/2023
Original: 1577836800000.0 (UNIX Epoch in milliseconds) -> Converted: 01/01/2020
Original: 0 (Zero) -> Converted: NaN
Original: 45161.23458 (Excel Serial) -> Converted: 25/08/2023
Original: 2019-02-09 (Standard Datetime String) -> Converted: 09/02/2019
>>> df
            A           B           C
0  25/08/2023  01/01/2022         NaN
1  09/02/2019  17/03/2023  25/08/2023
2  24/12/2021  01/01/2020  09/02/2019
>>> df.dtypes
A    object
B    object
C    object
dtype: object

datawaza.clean.reduce_multicollinearity(df: DataFrame, target_col: str, corr_threshold: float = 0.9, consider_nan: bool = False, consider_zero: bool = False, diff_threshold: float = 0.1, decimal: int = 2) → DataFrame[source]#

Reduce multicollinearity in a DataFrame by removing highly correlated features.

This function iteratively evaluates pairs of features in a DataFrame based on their correlation to each other and to a specified target column. If two features are highly correlated (above corr_threshold), the one with the lower correlation to the target column is removed. The number of NaN and/or zero values can also be considered (prefering removal of features with more) by setting consider_nan or consider_zero to True. The threshold for significant differences (diff_threshold) can also be adjusted. Sometimes it might appear as if the correlations are the same, but it says one is greater. Adjust decimal to a larger number to see more precision in the correlation.

Use this function to remove redundant features, and reduce a large feature set to a smaller one that contains the features most correlated with the target. This should improve the model’s ability to learn from the dataset, improve performance, and increase interpretability of results.

Parameters:

df (DataFrame) – The DataFrame to evaluate. It should have features and the target column.
target_col (str) – The name of the target column against which feature correlations are evaluated.
corr_threshold (float, optional) – The correlation threshold above which a pair of features is considered highly correlated. Default is 0.9.
consider_nan (bool, optional) – If True, considers the number of NaN values in the decision process. Default is False.
consider_zero (bool, optional) – If True, considers the number of zero values in the decision process. Default is False.
diff_threshold (float, optional) – The threshold for considering the difference in NaN/zero counts as significant. Default is 0.1.
decimal (int, optional) – The number of decimals to round to when displaying output. Default is 2.

Returns:

A modified DataFrame with reduced multicollinearity, containing only the features that were kept after evaluation.

Return type:

DataFrame

Examples

Prepare the data for the examples:

>>> np.random.seed(0)  # For reproducibility
>>> A = np.random.rand(100)
>>> B = A + np.random.normal(0, 0.01, 100)  # B is highly correlated with A
>>> C = np.random.rand(100)
>>> D = B + np.random.rand(100) # D is highly correlated with B
>>> Target = A * 0.1 + C * 0.5 + D * 0.8 + np.random.normal(0, 0.1, 100)
>>> df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D': D, 'Target': Target})

Example 1: Reduce multicollinearity in a DataFrame with correlated features and a target column:

>>> reduced_df = reduce_multicollinearity(df, 'Target', decimal=4)
Evaluating pair: 'B' and 'A' (1.0) - 5 kept features
 - Correlation with target: 0.6509, 0.6468
 - Keeping 'B' (higher correlation, lower or equal count)

Example 2: Reduce multicollinearity with a lower correlation threshold and considering NaN and Zero values:

>>> reduced_df = reduce_multicollinearity(df, 'Target', corr_threshold=0.6,
... consider_zero=True, consider_nan=True, diff_threshold=0.2)
Evaluating pair: 'D' and 'A' (0.66) - 5 kept features
 - Correlation with target: 0.86, 0.65
 - NaN/0 counts: 0, 0
 - Keeping 'D' (higher correlation, lower or equal count)

Evaluating pair: 'D' and 'B' (0.66) - 4 kept features
 - Correlation with target: 0.86, 0.65
 - NaN/0 counts: 0, 0
 - Keeping 'D' (higher correlation, lower or equal count)

datawaza.clean.split_outliers(df: DataFrame, columns: List[str] | None = None, iqr_multiplier: float = 1.5) → Tuple[DataFrame, DataFrame][source]#

Split a DataFrame into two based on the presence of outliers.

This function identifies outliers in the specified columns of the input DataFrame using the Interquartile Range (IQR) method. It then splits the DataFrame into two: one containing rows without outliers and another containing only the rows with outliers.

Use this function when you need to separate outliers from the main data for further analysis or processing.

Parameters:

df (pd.DataFrame) – The input DataFrame containing the data to be split.
columns (List[str], optional) – List of column names to consider for outlier detection. If None, all columns in the DataFrame will be considered. Default is None.
iqr_multiplier (float, optional) – The multiplier for the IQR range to determine outliers. Default is 1.5.

Returns:

A tuple containing two DataFrames: The first DataFrame (df_no_outliers) contains rows without outliers. The second DataFrame (df_outliers) contains only the rows with outliers.

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Examples

Prepare the data for the examples:

>>> df = pd.DataFrame({
...     'A': [1, 2, 3, 4, 5, 100],
...     'B': [10, 20, 30, 40, 50, 600],
...     'C': [-30, 4, 3, 2, 1, 3]
... })

Example 1: Split outliers considering all columns:

>>> df_no_outliers, df_outliers = split_outliers(df)
>>> df_no_outliers
   A   B  C
1  2  20  4
2  3  30  3
3  4  40  2
4  5  50  1
>>> df_outliers
     A    B   C
0    1   10 -30
5  100  600   3

Example 2: Split outliers considering specific columns:

>>> df_no_outliers, df_outliers = split_outliers(df, columns=['A', 'B'])
>>> df_no_outliers
   A   B   C
0  1  10 -30
1  2  20   4
2  3  30   3
3  4  40   2
4  5  50   1
>>> df_outliers
     A    B  C
5  100  600  3