Add Data analysis steps: data-cleaning, data-outlier-detection #30

memona008 · 2024-03-19T07:46:02Z

This pull request introduces two new library steps aimed at enhancing data preprocessing and outlier detection capabilities within our project.

Step 1: Data Cleaning

Implemented a data cleaning step capable of handling various parameters:
remove_null: Removes null values from the dataset if enabled.
null_lookup_columns: Allows specifying columns for null value lookup, providing flexibility in data cleansing.
duplicate_lookup_columns: Facilitates specifying columns for duplicate value lookup, enhancing data integrity checks.
clear_formatting: Offers an option to clear formatting from the dataset for consistency.
output_file_name: Enables customization of the cleaned output file name and path.
remove_duplicate_rows: Incorporates functionality to eliminate duplicate rows for streamlined data processing.

Step 2: Outlier Detection

Developed an outlier detection step employing four methods:

Z-score:

Identifies outliers based on standard deviation from the mean.

IQR (Interquartile Range):

Detects outliers using the range between the first and third quartiles.

Isolation Forest:

Implements an ensemble method for detecting anomalies in data points.

Autoencoder:

Utilizes deep learning techniques to reconstruct input data, flagging outliers based on reconstruction error.

Additionally, the step generates visualizations including

Scatter plot
Box plot
Histogram
to aid in outlier analysis and interpretation via visualizing the data

Add data-cleaning & outlier-detection steps

ef2fbde

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Data analysis steps: data-cleaning, data-outlier-detection #30

Add Data analysis steps: data-cleaning, data-outlier-detection #30

memona008 commented Mar 19, 2024 •

edited

Loading

Add Data analysis steps: data-cleaning, data-outlier-detection #30

Are you sure you want to change the base?

Add Data analysis steps: data-cleaning, data-outlier-detection #30

Conversation

memona008 commented Mar 19, 2024 • edited Loading

Step 1: Data Cleaning

Step 2: Outlier Detection

Z-score:

IQR (Interquartile Range):

Isolation Forest:

Autoencoder:

memona008 commented Mar 19, 2024 •

edited

Loading