Explore#

This module provides tools to streamline exploratory data analysis. It contains functions to find unique values, plot distributions, detect outliers, extract the top correlations, plot correlations, plot 3D charts, and plot data on a map of California.

Functions:

get_corr() - Display the top n positive and negative correlations with a target variable in a DataFrame.
get_outliers() - Detects and summarizes outliers for the specified numeric columns in a DataFrame, based on an IQR ratio.
get_unique() - Print the unique values of all variables below a threshold n, including counts and percentages.
plot_3d() - Create a 3D scatter plot using Plotly Express.
plot_charts() - Display multiple bar plots and histograms for categorical and/or continuous variables in a DataFrame, with an option to dimension by the specified hue.
plot_corr() - Plot the top n correlations of one variable against others in a DataFrame.
plot_map_ca() - Plot longitude and latitude data on a geographic map of California.
plot_scatt() - Create a scatter plot using Seaborn’s scatterplot function.

datawaza.explore.get_corr(df: DataFrame, n: int = 5, var: str | None = None, show_results: bool = True, return_arrays: bool = False) → None | Tuple[ndarray, ndarray][source]#

Display the top n positive and negative correlations with a target variable in a DataFrame.

This function computes the correlation matrix for the provided DataFrame, and identifies the top n positively and negatively correlated pairs of variables. By default, it prints a summary of these correlations. Optionally, it can return arrays of the variable names involved in these top correlations, avoiding duplicates.

Use this to quickly identify the strongest correlations with a target variable. You can also use this to reduce a DataFrame with a large number of features down to just the top n correlated features. Extract the names of the top correlated features into 2 separate arrays (one for positive, one for negative). Concatenate those variable lists and append the target variable. Use this concatenated array to create a new DataFrame.

Parameters:

df (pandas.DataFrame) – The DataFrame to analyze for correlations.
n (int, optional) – The number of top positive and negative correlations to list. Default is 5.
var (str, optional) – A specific variable of interest. If provided, the function will only consider correlations involving this variable. Default is None.
show_results (bool, optional) – Flag to indicate if the results should be printed. Default is True.
return_arrays (bool, optional) – Flag to indicate if the function should return arrays of variable names involved in the top correlations. Default is False.

Returns:

If return_arrays is set to True, the function returns a tuple containing two arrays: (1) positive_variables: An array of variable names involved in the top n positive correlations. (2) negative_variables: An array of variable names involved in the top n negative correlations. If return_arrays is False, the function returns nothing.

Return type:

tuple, optional

Examples

Prepare the data for the examples:

>>> np.random.seed(0)  # For reproducibility
>>> n_samples = 100
>>> # Create variables
>>> temp = np.linspace(10, 30, n_samples) + np.random.normal(0, 2, n_samples)
>>> sales = temp * 3 + np.random.normal(0, 10, n_samples)
>>> fuel = 100 - temp * 2 + np.random.normal(0, 5, n_samples)
>>> humidity = 70 - temp * 1.5 + np.random.normal(0, 4, n_samples)
>>> ac_units_sold = temp * 2 + np.random.normal(0, 15, n_samples)
>>> # Create DataFrame
>>> df = pd.DataFrame({'Temp': temp, 'Sales': sales, 'Fuel': fuel,
...                    'Humidity': humidity, 'AC_Units_Sold': ac_units_sold})

Example 1: Print the top ‘n’ correlations, both positive and negative:

>>> get_corr(df, n=2, var='Temp')
Top 2 positive correlations:
      Variable 1 Variable 2  Correlation
0          Sales       Temp         0.85
1  AC_Units_Sold       Temp         0.62

Top 2 negative correlations:
  Variable 1 Variable 2  Correlation
0       Fuel       Temp        -0.92
1   Humidity       Temp        -0.92

Example 2: Create arrays with the top correlated feature names:

>>> (top_pos, top_neg) = get_corr(df, n=1, var='Temp', show_results=False,
...     return_arrays=True)
>>> print(top_pos)
['Sales']
>>> print(top_neg)
['Fuel']

Example 3: Create a dataframe of top correlated features from those arrays:

>>> top_features = np.concatenate((top_pos, top_neg, ['Temp']))
>>> df_top_features = df[top_features]
>>> print(df_top_features[:2])
       Sales       Fuel       Temp
0  59.415821  71.097881  13.528105
1  19.529413  76.798435  11.002335

datawaza.explore.get_outliers(df: DataFrame, num_columns: List[str], ratio: float = 1.5, exclude_zeros: bool = False, plot: bool = False, width: int = 15, height: int = 2) → DataFrame[source]#

Detects and summarizes outliers for the specified numeric columns in a DataFrame, based on an IQR ratio.

This function identifies outliers using Tukey’s method, where outliers are considered to be those data points that fall below Q1 - ratio * IQR or above Q3 + ratio * IQR. You can exclude zeros from the calculations, as they can appear as outliers and skew your results. You can also change the default IQR ratio of 1.5. If outliers are found, they will be summarized in the returned DataFrame. In addition, the distributions of the variables with outliers can be plotted as boxplots.

Use this function to identify outliers during the early stages of exploratory data analysis. With one line, you can see: total non-null, total zero values, zero percent, outlier count, outlier percent, skewness, and kurtosis. You can also visually spot outliers outside of the whiskers in the boxplots. Then you can decide how you want to handle the outliers (ex: log transform, drop, etc.)

Parameters:

df (DataFrame) – The DataFrame to analyze for outliers.
num_columns (List[str]) – List of column names in df to check for outliers. These should be names of columns with numerical data.
ratio (float, optional) – The multiplier for IQR to determine the threshold for outliers. Default is 1.5.
exclude_zeros (bool, optional) – If set to True, zeros are excluded from the outlier calculation. Default is False.
plot (bool, optional) – If set to True, box plots of outlier distributions are displayed. Default is False.
width (int, optional) – The width of the plot figure. This parameter only has an effect if plot is True. Default is 15.
height (int, optional) – The height of each subplot row. This parameter only has an effect if plot is True. Default is 2.

Returns:

A DataFrame summarizing the outliers found in each specified column, including the number of non-null and zero values, percentage of zero values, count of outliers, percentage of outliers, and measures of skewness and kurtosis.

Return type:

pd.DataFrame

Examples

Prepare the data for the examples:

>>> np.random.seed(0)  # For reproducibility
>>> pd.set_option('display.max_columns', None)  # For test consistency
>>> pd.set_option('display.width', None)  # For test consistency
>>> df = pd.DataFrame({
...     'A': np.random.randn(100),
...     'B': np.random.exponential(scale=2.0, size=100),
...     'C': np.random.randn(100)
... })
>>> df.at[2, 'A'] = 0; df.at[5, 'A'] = 0  # Assign some zeros
>>> df.at[3, 'B'] = np.nan; df.at[7, 'B'] = np.nan  # Assign some NaNs
>>> num_columns = ['A', 'B', 'C']  # Store numeric columns

Example 1: Create a dataframe that lists outlier statistics:

>>> outlier_summary = get_outliers(df, num_columns)
>>> print(outlier_summary)
  Column  Total Non-Null  Total Zero  Zero Percent  Outlier Count  Outlier Percent  Skewness  Kurtosis
1      B              98           0           0.0              4             4.08      2.62     10.48
0      A             100           2           2.0              1             1.00      0.01     -0.25
2      C             100           0           0.0              1             1.00     -0.03      0.19

Example 2: Create a dataframe that lists outlier statistics, excluding zeros and plot the box plots:

>>> outlier_summary = get_outliers(df, num_columns, exclude_zeros=True,
...                                plot=True, width=14, height=3)
>>> print(outlier_summary)
  Column  Total Non-Null  Total Zero  Zero Percent  Outlier Count  Outlier Percent  Skewness  Kurtosis
0      B              98           0           0.0              4             4.08      2.62     10.48
1      C             100           0           0.0              1             1.00     -0.03      0.19

datawaza.explore.get_unique(df: DataFrame, n: int = 20, sort: str = 'count', show_list: bool = True, count: bool = True, percent: bool = True, plot: bool = False, cont: bool = False, strip: bool = False, dropna: bool = False, fig_size: Tuple[int, int] = (6, 4), rotation: int = 45) → None[source]#

Print the unique values of all variables below a threshold n, including counts and percentages.

This function examines the unique values of all the variables in a DataFrame. If the number is below a threshold n, it will list their unique values. For each value, it prints out the count and percentage of the dataset with that value. You can change the sort, and there are options to strip single quotes from the variable names, or exclude NaN values. You can optionally show descriptive statistics for the continuous variables able the ‘n’ threshold, or display simple plots.

Use this to quickly examine the features of your dataset at the beginning of exploratory data analysis. Use df.nunique() to first determine how many unique values each variable has, and identify a number that likely separates the categorical from continuous numeric variables. Then run get_unique using that number as n (this avoids iterating over continuous data).

Parameters:

df (DataFrame) – The dataframe that contains the variables you want to analyze.
n (int, optional) – The maximum number of unique values to consider. This helps to avoid iterating over continuous data. Default is 20.
sort (str, optional) – Determines the sorting of unique values: - ‘name’ - sorts alphabetically/numerically, - ‘count’ - sorts by count of unique values (descending), - ‘percent’ - sorts by percentage of each unique value (descending). Default is ‘count’.
show_list (bool, optional) – If True, shows the list of unique values. Default is True.
count (bool, optional) – If True, shows counts of each unique value. Default is False.
percent (bool, optional) – If True, shows the percentage of each unique value. Default is False.
plot (bool, optional) – If True, shows a basic chart for each variable. Default is False.
cont (bool, optional) – If True, analyzes variables with unique values greater than ‘n’ as continuous data. Default is False.
strip (bool, optional) – If True, removes single quotes from the variable names. Default is False.
dropna (bool, optional) – If True, excludes NaN values from unique value lists. Default is False.
fig_size (tuple, optional) – Size of figure if plotting is enabled. Default is (6, 4).
rotation (int, optional) – Rotation angle of X axis ticks if plotting is enabled. Default is 45.

Returns:

The function prints the analysis directly.

Return type:

None

Examples

Prepare the data for the examples:

>>> df = pd.DataFrame({'Animal': ["'Cat'", "'Dog'", "'Cat'", "'Mountain Lion'",
...     "'Dog'", "'Dog'"],
...     'Sex': ['Male', 'Female', 'Male', 'Male', 'Female', np.nan],
...     'Weight': [6.5, 12.5, 7.7, 84.1, 22.3, 29.2]
... })

Example 1: Print unique values below a threshold of 3:

>>> get_unique(df, n=3)

CATEGORICAL: Variables with unique values equal to or below: 3

Animal has 3 unique values:

    'Dog'               3   50.0%
    'Cat'               2   33.33%
    'Mountain Lion'     1   16.67%

Sex has 3 unique values:

    Male         3   50.0%
    Female       2   33.33%
    nan          1   16.67%

Example 2: Sort values by name, strip single quotes, drop NaN:

>>> get_unique(df, n=3, sort='name', strip=True, dropna=True)

CATEGORICAL: Variables with unique values equal to or below: 3

Animal has 3 unique values:

    Cat                 2   33.33%
    Dog                 3   50.0%
    Mountain Lion       1   16.67%

Sex has 2 unique values:

    Female       2   40.0%
    Male         3   60.0%

Example 3: Sort values by percent, plot charts, and show the continuous statistics for those over the ‘n’ threshold:

>>> get_unique(df, n=3, sort='percent', plot=True, cont=True)

CATEGORICAL: Variables with unique values equal to or below: 3

Animal has 3 unique values:

    'Dog'               3   50.0%
    'Cat'               2   33.33%
    'Mountain Lion'     1   16.67%



Sex has 3 unique values:

    Male         3   50.0%
    Female       2   33.33%
    nan          1   16.67%



CONTINUOUS: Variables with unique values greater than: 3

Weight has 6 unique values:

Weight
count     6.000000
mean     27.050000
std      29.292712
min       6.500000
25%       8.900000
50%      17.400000
75%      27.475000
max      84.100000
Name: Weight, dtype: float64

datawaza.explore.plot_3d(df: DataFrame, x: str, y: str, z: str, color: str | None = None, color_discrete_sequence: List[str] | None = None, color_discrete_map: Dict[str, str] | None = None, color_continuous_scale: List[str] | None = None, x_scale: str = 'linear', y_scale: str = 'linear', z_scale: str = 'linear', height: int = 600, width: int = 1000, font_size: int = 10) → None[source]#

Create a 3D scatter plot using Plotly Express.

This function generates an interactive 3D scatter plot using the Plotly Express library. It allows for customization of the x, y, and z axes, as well as color coding of the points based on the column specified for color (similar to the hue parameter in Seaborn). A color_discrete_map dictionary can be passed to map specific values of the color column to colors. Alternatively, you can just pass a color_discrete_map or color_continuous_scale depending on the type of values in the color column. Onlye 1 of these 3 coloring methods should be used at a time. The plot can also be displayed with either a linear or logarithmic scale on each axis by setting x_scale, y_scale, or z_scale from ‘linear’ to ‘log’.

Use this function to visualize and explore relationships between three variables in a dataset, with the option to color code the points based on a fourth variable. It is a great way to visualize the top 3 principal components, dimensioned by the target variable.

Parameters:

df (pd.DataFrame) – The input DataFrame containing the data to be plotted.
x (str) – The column name to be used for the x-axis.
y (str) – The column name to be used for the y-axis.
z (str) – The column name to be used for the z-axis.
color (str, optional) – The column name to be used for color coding the points. Default is None.
color_discrete_sequence (List[str], optional) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders. Various color sequences are available in plotly.express.colors.qualitative. Default is None.
color_discrete_map (Dict[str, str], optional) – String values should define valid CSS-colors. Used to assign specific colors values in the color column. Default is None.
color_continuous_scale (List[str], optional) – Strings should define valid CSS-colors. This list is used to build a continuous color scale when the color column contains numeric data. Various color scales are available in plotly.express.colors.sequential, plotly.express.colors.diverging, and plotly.express.colors.cyclical. Default is None.
x_scale (str, optional) – The scale type for the X axis. Use ‘log’ for logarithmic scale. Default is ‘linear’.
y_scale (str, optional) – The scale type for the Y axis. Use ‘log’ for logarithmic scale. Default is ‘linear’.
z_scale (str, optional) – The scale type for the Z axis. Use ‘log’ for logarithmic scale. Default is ‘linear’.
height (int, optional) – The height of the plot in pixels. Default is 600.
width (int, optional) – The width of the plot in pixels. Default is 1000.
font_size (int, optional) – The size of the font used in the plot. Default is 10.

Returns:

The function displays the interactive 3D scatter plot using Plotly Express.

Return type:

None

Examples

Prepare the data for the examples:

>>> df = pd.DataFrame({
...     'X': [1, 2, 3, 4, 5],
...     'Y': [2, 4, 6, 8, 10],
...     'Z': [3, 6, 9, 12, 15],
...     'Category': ['A', 'B', 'A', 'B', 'A'],
...     'Continuous': [10, 20, 30, 40, 50]
... })

Example 1: Create a basic 3D scatter plot:

>>> plot_3d(df, x='X', y='Y', z='Z')

Example 2: Create a 3D scatter plot with default color coding, and log scale on the X axis:

>>> plot_3d(df, x='X', y='Y', z='Z', color='Category', x_scale='log')

Example 3: Create a 3D scatter plot with a discrete color palette:

>>> plot_3d(df, x='X', y='Y', z='Z', color='Category',
...         color_discrete_sequence=px.colors.qualitative.Prism)

Example 4: Create a 3D scatter plot with a continuous color palette:

>>> plot_3d(df, x='X', y='Y', z='Z', color='Continuous',
...         color_continuous_scale=px.colors.sequential.Viridis)

Example 5: Create a 3D scatter plot with a custom discrete color map, and adjust the height and width:

>>> category_color_map = {'A': px.colors.qualitative.D3[0],
...                       'B': px.colors.qualitative.D3[1]}
>>> plot_3d(df, x='X', y='Y', z='Z', color='Category',
...         color_discrete_map=category_color_map,
...         height=800, width=1200)

datawaza.explore.plot_charts(df: DataFrame, plot_type: str = 'both', n: int = 10, ncols: int = 2, fig_width: int = 15, subplot_height: int = 4, rotation: int = 0, cat_cols: List[str] | None = None, cont_cols: List[str] | None = None, dtype_check: bool = True, sample_size: int | float | None = None, random_state: int = 42, hue: str | None = None, color_discrete_map: Dict | None = None, normalize: bool = False, kde: bool = False, multiple: str = 'layer', log_scale: bool = False, ignore_zero: bool = False) → None[source]#

Display multiple bar plots and histograms for categorical and/or continuous variables in a DataFrame, with an option to dimension by the specified hue.

This function allows you to plot a large number of distributions with one line of code. You choose which type of plots to create by setting plot_type to cat, cont, or both. Categorical variables are plotted with sns.barplot ordered by descending value counts for a clean appearance. Continuous variables are plotted with sns.histplot. There are two approaches to identifying categorical vs. continuous variables: (a) you can specify cat_cols and cont_cols as lists of the respective column names, or (b) you can specify n as the dividing line, and any variable with n or lower unique values will be treated as categorical. In addition, you can enable dtype_check on the continuous columns to only include columns of data type int64 or float64.

For each type of variable, it creates a subplot layout that has ncols columns, and is fig_width wide. It calculates how many rows are required to display all the plots, and each row is subplot_height high. Specify hue if you want to dimension the plots by another variable. You can set color_discrete_map to a color mapping dictionary for the values of the hue variable. You can also customize some parameters of the plots, such as rotation of the X axis tick labels. For categorical variables, you can normalize the plots to show proportions instead of counts by setting normalize to True.

For histograms, you can display KDE lines with kde, and change how the hue variable appears by setting multiple. If you have a large amount of data that is taking too long to process, you can take a random sample of your data by setting sample_size to either a count or proportion. To handle skewed data, you have two options: (a) you can enable log scale on the X axis with log_scale, and (b) you can ignore zero values with ignore_zero (these can sometimes dominate the left end of a chart).

Use this function to quickly visualize the distributions of your data during exploratory data analysis. With one line, you can produce a comprehensive series of plots that can help you spot issues that will require handling during data cleaning. By setting hue to your target y variable, you might be able to catch glimpses of potential correlations or relationships.

Parameters:

df (pd.DataFrame) – The dataframe containing the variables to be analyzed.
plot_type (str, optional) – The type of charts to plot: ‘cat’ for categorical, ‘cont’ for continuous, or ‘both’. Default is ‘both’.
n (int, optional) – Threshold for distinguishing between categorical (≤ n unique values) and continuous (> n unique values) variables. Default is 10.
ncols (int, optional) – The number of columns in the subplot grid. Default is 2.
fig_width (int, optional) – The width of the entire plot figure (not individual subplots). Default 15.
subplot_height (int, optional) – The height of each subplot. Default is 4.
rotation (int, optional) – The rotation angle for x-axis labels. Default is 0.
cat_cols (List[str], optional) – List of column names to treat as categorical variables. Inferred from unique count if not provided.
cont_cols (List[str], optional) – List of column names to treat as continuous variables. Inferred from unique count if not provided.
dtype_check (bool, optional) – If True, considers only numeric types for continuous variables. Default is True.
sample_size (int or float, optional) – If provided, indicates the fraction (if < 1) or number (if ≥ 1) of samples to draw from the dataframe for histograms.
random_state (int, optional. Default is 42) – Set random state for reproducibility if using sample_size to perform a random sample for histograms.
hue (str, optional) – Name of the column for hue-based dimensioning in the plots.
color_discrete_map (Dict, optional) – A color mapping dictionary for the values in the ‘hue’ variable.
normalize (bool, optional) – If True, normalizes categorical plots to show proportions instead of counts. Default is False.
kde (bool, optional) – If True, shows Kernel Density Estimate (KDE) line on continuous histograms. Default is False.
multiple (str, optional) – Method to handle the hue variable in countplots. Options are ‘layer’, ‘dodge’, ‘stack’, ‘fill’. Default is ‘layer’.
log_scale (bool, optional) – If True, uses log scale for continuous histograms. Default is False.
ignore_zero (bool, optional) – If True, ignores zero values in continuous histograms. Default is False.

Returns:

Creates and displays plots without returning any value.

Return type:

None

Examples

Prepare the data for the examples:

>>> df = pd.DataFrame({
...     'Category A': np.random.choice(['A', 'B', 'C'], size=100),
...     'Category B': np.random.choice(['D', 'E', 'F', 'G', 'H', 'I', 'J'],
...                                    size=100),
...     'Category C': np.random.choice(['K', 'L', 'M', 'N', 'O', 'P', 'Q',
...                                     'R', 'S', 'T', 'U', 'V', 'W', 'X'],
...                                    size=100),
...     'Measure 1': np.random.randn(100),
...     'Measure 2': np.random.exponential(scale=2.0, size=100),
...     'Target': np.random.choice(['Yes', 'No'], size=100)
... })
>>> cat_cols = ['Category A', 'Category B', 'Target']
>>> num_cols = ['Measure 1', 'Measure 2']

Example 1: Plot both categorical and continuous variables based on a boundary of n unique values:

>>> plot_charts(df, n=7)

Example 2: Plot only categorical variables using a column list, dimensioned by hue:

>>> plot_charts(df, plot_type='cat', cat_cols=cat_cols, hue='Target')

Example 3: Customize the subplot width, number of columns, and rotation of the X axis tick labels:

>>> plot_charts(df, plot_type='both', n=7, fig_width=20, ncols=3, rotation=90)

Example 4: Plot only histograms dimensioned by hue (stacked values), with KDE lines, X axis in log scale, and check data types:

>>> plot_charts(df, plot_type='cont', cont_cols=num_cols, hue='Target',
...             multiple='stack', kde=True, log_scale=True, dtype_check=True)

Example 5: Take a sample of the data and plot only histograms dimensioned by hue (layer values), ignore zero values:

>>> plot_charts(df, plot_type='cont', cont_cols=num_cols, hue='Target',
...             multiple='layer', sample_size=0.5, ignore_zero=True)

Example 6: Normalize the values and plot categorical values:

>>> plot_charts(df, plot_type='cat', cat_cols=cat_cols, hue='Target',
...             normalize=True)

Example 7: Plot more than 10 categorical values, specify just one column name, and increase the figure size, with just one column:

>>> plot_charts(df, plot_type='cat', cat_cols=['Category C'],
...             fig_width=12, subplot_height=7, ncols=1, rotation=90)

datawaza.explore.plot_corr(df: DataFrame, column: str, n: int, method: str = 'pearson', size: Tuple[int, int] = (15, 8), rotation: int = 45, palette: str = 'RdYlGn', decimal: int = 2) → None[source]#

Plot the top n correlations of one variable against others in a DataFrame.

This function generates a barplot that visually represents the correlations of a specified column with other numeric columns in a DataFrame. It displays both the strength (height of the bars) and the nature (color) of the correlations (positive or negative). The function computes correlations using the specified method and presents the strongest positive and negative correlations up to the number specified by n. Correlations are ordered from strongest to lowest, from the outside in.

Use this to communicate the correlations of one particular variable (ex: target y) in relation to others with a very clean design. It’s much easier to scan this correlation chart vs. trying to find the variable of interest in a heatmap. The fixed Y axis scales, and Red-Yellow-Green color palette, ensure the actual magnitudes of the positive or negative correlations are clear and not misinterpreted.

Parameters:

df (pd.DataFrame) – The DataFrame containing the variables for correlation analysis.
column (str) – The name of the column to evaluate correlations against.
n (int) – The number of correlations to display, split evenly between positive and negative correlations.
method (str, optional) – The method of correlation calculation, as per df.corr() method options (‘pearson’, ‘kendall’, ‘spearman’). Default is ‘pearson’.
size (Tuple[int, int], optional) – The size of the resulting plot. Default is (15, 8).
rotation (int, optional) – The rotation angle for x-axis labels. Default is 45 degrees.
palette (str, optional) – The colormap for representing correlation values. Default is ‘RdYlGn’.
decimal (int, optional) – The number of decimal places for rounding correlation values. Default is 2.

Returns:

Displays the barplot but does not return any value.

Return type:

None

Examples

>>> df = pd.DataFrame({
...     'A': np.random.rand(50),
...     'B': np.random.rand(50),
...     'C': np.random.rand(50)
... })
>>> plot_corr(df, 'A', n=4)

This will display a barplot of the top 2 positive and top 2 negative correlations of column ‘A’ with columns ‘B’ and ‘C’.

datawaza.explore.plot_map_ca(df: DataFrame, lon: str = 'Longitude', lat: str = 'Latitude', hue: str | None = None, size: str | None = None, size_range: Tuple[int, int] = (50, 200), title: str = 'Geographic Chart', dot_size: int | None = None, alpha: float = 0.8, color_map: str | None = None, fig_size: Tuple[int, int] = (12, 12)) → None[source]#

Plot longitude and latitude data on a geographic map of California.

This function creates a geographic map of California using Cartopy and overlays data points from a DataFrame. The map includes major cities, county boundaries, and geographic terrain features. Specify the columns in the dataframe that map to the longitude (lon) and the latitude (lat). Then specify an optional hue column to see changes in this variable by color, and/or a size column to see changes in this varible by dot size. So two variables can be visualized at once.

A few parameters can be customized, such as the range of the dot sizes (size_range) if you’re using size. You can also just use dot_size to specify a fixed size for all the dots on the map. The alpha transparency can be adjusted, to make sure you at least have a chance of seeing dots of a different color that may be covered up by the top-most layer. You can also customize the color_map for the hue parameter.

Use this function to visualize geospatial data related to California on a clean map.

Note: This function requires a few libraries to be installed: Cartopy, Geopandas, and Matplotlib (pyplot and patheffects). In addition, it uses the 2018 Census Bureau’s 5-meter county map files, which can be found here: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_county_5m.zip

Parameters:

df (pd.DataFrame) – DataFrame containing the data to be plotted.
lon (str, optional) – Column name in df representing the longitude coordinates. Default is ‘Longitude’.
lat (str, optional) – Column name in df representing the latitude coordinates. Default is ‘Latitude’.
hue (str, optional) – Column name in df for color-coding the points. Default is None.
size (str, optional) – Column name in df to scale the size of points. Default is None.
size_range (Tuple[int, int], optional) – Range of sizes if the size parameter is used. Default is (50, 200).
title (str, optional) – Title of the plot. Default is ‘Geographic Chart’.
dot_size (int, optional) – Size of all dots if you want them to be uniform. Default is None.
alpha (float, optional) – Transparency of the points. Default is 0.8.
color_map (colors.Colormap, optional) – Colormap to be used if hue is specified. Default is None.
fig_size (Tuple[int, int], optional) – Size of the figure. Default is (12, 12).

Return type:

None

Examples

>>> import pandas as pd
>>> data = {
...     'longitude': [-122.23, -122.22, -122.24, -122.25, -122.25],
...     'latitude': [37.88, 37.86, 37.85, 37.85, 37.85],
...     'housing_median_age': [41.0, 21.0, 52.0, 52.0, 52.0],
...     'total_rooms': [880.0, 7099.0, 1467.0, 1274.0, 1627.0],
...     'total_bedrooms': [129.0, 1106.0, 190.0, 235.0, 280.0],
...     'population': [322.0, 2401.0, 496.0, 558.0, 565.0],
...     'households': [126.0, 1138.0, 177.0, 219.0, 259.0],
...     'median_income': [8.3252, 8.3014, 7.2574, 5.6431, 3.8462],
...     'median_house_value': [452600.0, 358500.0, 352100.0, 341300.0, 342200.0],
...     'ocean_proximity': ['NEAR BAY', 'NEAR BAY', 'NEAR BAY', 'NEAR BAY', 'NEAR BAY']
... }
>>> df = pd.DataFrame(data)
>>> plot_map_ca(df, lon='longitude', lat='latitude',
...             hue='ocean_proximity', size='median_house_value',
...             size_range=(100, 500), alpha=0.6,
...             title='California Housing Data')

datawaza.explore.plot_scatt(df: DataFrame, x: str, y: str, hue: str | None = None, hue_order: List[str] | None = None, size: str | int | None = None, size_range: Tuple[int, int] | None = None, title: str | None = None, title_fontsize: int = 18, title_pad: int = 15, x_label: str | None = None, x_format: str | None = None, x_scale: str | None = None, x_lim: Tuple[float, float] | None = None, y_label: str | None = None, y_format: str | None = None, y_scale: str | None = None, y_lim: Tuple[float, float] | None = None, label_fontsize: int = 14, label_pad: int = 10, grid: bool = False, legend: bool = True, legend_title: str | None = None, legend_loc: str = 'best', fig_size: Tuple[int, int] = (12, 6), decimal: int = 2, save: bool = False, **kwargs) → None[source]#

Create a scatter plot using Seaborn’s scatterplot function.

This function generates a scatter plot using the Seaborn library. It allows for customization of the x and y axes, as well as the hue and size dimensions. The hue parameter is used to color the points based on a categorical column, while the size parameter is used to vary the size of the points based on a numerical column or a fixed value. You can also set the range of sizes with size_range, and the title of the plot with title. The alpha parameter controls the transparency of the points. You can also specify a color map with color_map to change the color scheme of the plot. The fig_size parameter allows you to set the size of the figure.

Use this function to visualize relationships between two variables in a dataset, with the option to color and size the points based on additional variables. It is a great way to explore correlations between variables and identify patterns in the data.

Parameters:

df (pd.DataFrame) – The input DataFrame containing the data to be plotted.
x (str) – The column name to be used for the x-axis.
y (str) – The column name to be used for the y-axis.
hue (str, optional) – The column name to be used for color coding the points. Default is None.
hue_order (List[str], optional) – The order of the hue variable levels. Default is None.
size (str or int, optional) – The column name to be used for varying the size of the points, or a fixed size value for all points. Default is None.
size_range (Tuple[int, int], optional) – The range of sizes for the points. Default is None.
title (str, optional) – The title of the plot. Default is None.
title_fontsize (int, optional) – The font size of the title. Default is 18.
title_pad (int, optional) – The padding of the title. Default is 15.
x_label (str, optional) – The label for the x-axis. Default is None.
x_format (str, optional) – The format of the x-axis labels. Default is None.
x_scale (str, optional) – The scale of the x-axis. Default is None.
x_lim (Tuple[float, float], optional) – The limits of the x-axis. Default is None.
y_label (str, optional) – The label for the y-axis. Default is None.
y_format (str, optional) – The format of the y-axis labels. Default is None.
y_scale (str, optional) – The scale of the y-axis. Default is None.
y_lim (Tuple[float, float], optional) – The limits of the y-axis. Default is None.
label_fontsize (int, optional) – The font size of the axis labels. Default is 14.
label_pad (int, optional) – The padding of the axis labels. Default is 10.
grid (bool, optional) – Whether to display a grid on the plot. Default is False.
legend (bool, optional) – Whether to display a legend on the plot. Default is True.
legend_title (str, optional) – The title of the legend. Default is None.
legend_loc (str, optional) – The location of the legend. Default is ‘best’.
fig_size (Tuple[int, int], optional) – The size of the figure. Default is (12, 6).
decimal (int, optional) – The number of decimal places to display on the axis labels. Default is 2.
save (bool, optional) – Whether to save the plot as an image file. Default is False.
**kwargs – Additional keyword arguments to be passed to the underlying sns.scatterplot() function. This allows for more flexibility and customization of the scatter plot.

Returns:

Displays the scatter plot using Seaborn.

Return type:

None

Examples

Prepare some data for the examples:

>>> np.random.seed(42)  # For reproducibility
>>> x = np.linspace(0, 10, 50)
>>> y = 2 * x**2 + 3 * x + 1 + np.random.normal(0, 100, 50)
>>> categories = np.where(x < 5, 'A', 'B')
>>> sizes = np.where(x < 5, 30, 60)
>>> df = pd.DataFrame({
...     'X': x,
...     'Y': y,
...     'Category': categories,
...     'Size': sizes
... })
>>> color_palette = {'A': 'red', 'B': 'green'}

Example 1: Create a basic scatter plot with a fixed size for all points:

>>> plot_scatt(df, x='X', y='Y', size=50, alpha=0.7)

Example 2: Create a scatter plot with color coding based on a category and varying point sizes based on a numerical column:

>>> plot_scatt(df, x='X', y='Y', hue='Category', size='Size',
...            size_range=(20, 100))

Example 3: Create a scatter plot with a custom title, color map, axis labels, and legend:

>>> plot_scatt(df, x='X', y='Y', title='Polynomial Trend', hue='Size',
...            palette='viridis', x_label='X Axis', y_label='Y Axis',
...            legend=True)

Example 4: Create a scatter plot with custom x and y limits and axis formats: >>> plot_scatt(df, x=’X’, y=’Y’, x_lim=(0, 8), y_lim=(0, 400), … x_format=’small_number’, y_format=’{:,.2f}’)

Example 5: Create a scatter plot with varying point sizes based on a numerical column and save it to a file:

>>> plot_scatt(df, x='X', y='Y', size='Size', size_range=(20, 100),
...            title='Scatter Plot', save=True)

Example 6: Create a scatter plot with varying marker styles based on a categorical column:

>>> plot_scatt(df, x='X', y='Y', hue='Category', style='Category')

Example 7: Create a scatter plot with a custom marker style and color palette:

>>> plot_scatt(df, x='X', y='Y', marker='D', hue='Category',
...            palette=color_palette)