Harnessing ChatGPT for Automated Data Cleaning and Preprocessing



Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Image by Author
 

ChatGPT has become a swiss-army knife that can be used for multitude of applications, and there’s abundant scope to integrate ChatGPT into data science workflows.


If you’ve ever trained a machine learning model on a real-world dataset, you know that the steps of data cleaning and preprocessing are important for building robust machine learning models. In this guide, we’ll see how we can use ChatGPT to perform these tasks on an example dataset.


We’ll use the bank marketing dataset from the UCI machine learning repository and prompt ChatGPT to do the following:



  • Fetch and load the dataset

  • Check for missing values

  • Encode categorical variables


You can follow along by signing in to your OpenAI account and starting a new ChatGPT session. If you prefer, you can also code along in Google Colab.


 



 


There are a few different versions of the dataset. We’ll use bank-full.csv.


 



Prompt: Today we’ll perform data cleaning and preprocessing on a real-world dataset. Write the code to do the following: 1. Fetch the bank marketing dataset from the UCI machine learning repository. Use libraries like requests or urllib. 2. Download the zip file, unzip the contents, and read in the bank-full.csv file into a pandas dataframe (call it data).


 


The code to fetch and read in the dataset is as follows:



import pandas as pd
import urllib
import zipfile

# URL of the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip"

# Fetch and extract the dataset
urllib.request.urlretrieve(url, "bank.zip")

with zipfile.ZipFile("bank.zip", "r") as zip_ref:
zip_ref.extractall()

# Load the dataset into a pandas DataFrame
data = pd.read_csv("bank-full.csv", delimiter=";",skipinitialspace=True)


 



 


Let's understand the dataset better.



# Display the first few rows of the dataset
print(data.head())


 


This outputs:

Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Truncated output of data.head()
 

 



Prompt: Use pandas to get the dimension of the dataframe, descriptive statistics on the columns, and data types of various columns.


 


This step doesn’t really require prompting ChatGPT as the pandas methods are pretty simple.



# Get the dimensions of the dataset (rows, columns)
print(data.shape)


 



 


We have over 45000 records and 16 features (as 17 is inclusive of the output label as well).



# Get statistical summary of numerical columns
print(data.describe())


 


Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Truncated output of data.describe()
 

It’s also helpful to get an overview of the data types of the various columns. We’ll take a closer look at them when we encode categorical variables.



# Check data types of each column
print(data.dtypes)


 



Output >>>

age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object


 


So far we’ve gained an understanding of the bank marketing dataset. The output label denotes whether or not a customer will subscribe to a term deposit. The dataset contains several features such as the age, month, education, marital status, outcome of previous campaigns, and more.


 



 



Prompt: I want to know the number of missing values in each column. Please give me the code to do it. Use pandas.


 



# Check for missing values in each column
missing_values = data.isnull().sum()
print(missing_values)


 


Here’s the output:



Output >>>

age 0
job 0
marital 0
education 0
default 0
balance 0
housing 0
loan 0
contact 0
day 0
month 0
duration 0
campaign 0
pdays 0
previous 0
poutcome 0
y 0
dtype: int64


 


This version of the bank marketing dataset—with over 45000 records—does not have any missing values. In practice, though, most real-world datasets have missing values. You should handle missing values using suitable imputation techniques.


As an optional exercise , you can add a step here prompting ChatGPT to drop a small fraction of values from a subset of columns so you can practice how to handle missing values.


 



 


The next step is to encode categorical variables in the dataset. We’ll start by getting the list of all categorical columns.


 



Prompt: Give the code to get the list of all categorical columns in this dataset.


 



# Count the number of categorical variables
categorical_columns = data.select_dtypes(include=['object']).columns
print(categorical_columns)
num_categorical_variables = len(categorical_columns)

# Display the count
print("Number of categorical variables:", num_categorical_variables)


 


The categorical columns list also includes the output label y. But let's focus on the other categorical variables first.



Output >>>
Index(['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
'month', 'poutcome', 'y'],
dtype="object")
Number of categorical variables: 10


 


A Closer Look at the Values of Categorical Variables


 


Next, let’s see the values that each categorical variable takes.


 



Prompt: I want to understand the various values that each categorical variable takes. Give the pandas code to do it.


 



# Get the different values and their counts for each categorical column
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
for column in categorical_columns:
value_counts = data[column].value_counts()
print(f"Column: column")
print(value_counts,"\n")


 


The output is hard to parse:



Output >>>

Column: job
blue-collar 9732
management 9458
technician 7597
admin. 5171
services 4154
retired 2264
self-employed 1579
entrepreneur 1487
unemployed 1303
housemaid 1240
student 938
unknown 288
Name: job, dtype: int64

...
...
...
Column: poutcome
unknown 36959
failure 4901
other 1840
success 1511
Name: poutcome, dtype: int64


 


Visualizing the Values of Categorical Variables


 


Let’s prompt ChatGPT to create a visualization so it’s easier to understand what values the categorical variables take.


 



Prompt: I want a plot that shows the values that each categorical variable takes. Exclude the output label y. Include all other categorical variables.


 


And here we go!



import matplotlib.pyplot as plt

# Get the different values and their counts for each categorical column
categorical_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']

# Set up the subplot grid
fig, axes = plt.subplots(3, 3, figsize=(10,8))

# Iterate over each categorical column and plot the value counts
for i, column in enumerate(categorical_columns):
row = i // 3
col = i % 3
value_counts = data[column].value_counts()
ax = axes[row, col]
value_counts.plot(kind='bar', ax=ax)
ax.set_xlabel(column)
ax.set_ylabel('Count')
ax.set_title(f'column.title() Value Counts')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()


 


Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Values of Categorical Variables 
 

One-Hot Encoding


 


For a subset of categorical columns, we can use one-hot encoding.


 



Prompt: Give the pandas code to perform one-hot encoding for a list of columns.


 


We use get_dummies from pandas to one-hot encode the following columns:



# Select the categorical columns (excluding education and poutcome)
categorical_columns = ['job', 'marital', 'default', 'housing', 'loan', 'contact']

# Perform one-hot encoding
encoded_data = pd.get_dummies(data, columns=categorical_columns)


 



Prompt: I want to print out the first few rows for the newly added columns. Write the code to do the same.


 



# Select the newly added columns
new_columns = encoded_data.columns.difference(data.columns)

# Print the head of the newly added columns
print(encoded_data[new_columns].head())


 


Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Truncated output of encoded_data['new_colums'].head()
 

Defining a Custom Mapping


 


For columns for variables like ‘education’ and ‘poutcome’ (previous outcome), rather than one-hot encoding, it would be better to use custom mapping to have a degree of comparison between the values.


Also, when we use one-hot encoding to encode a categorical variable that takes k distinct values k new columns. For a categorical column like ‘month’ this will be excessive without adding much information.


So for both of these cases, we will define a custom mapping and then transform the values. We can prompt ChatGPT to get a generic code snippet which we can then modify to define the mapping for the specific columns.


 



Prompt: I want to do a custom mapping for encoding some categorical variables. First give me a generic code snippet that defines a custom mapping from unique values in a categorical column to a different set of output values. We should then use this mapping to transform the values in the columns.


 



# Define the custom mapping dictionary
custom_mapping =
'category1': 'value1',
'category2': 'value2',
'category3': 'value3',
# Add more categories and corresponding values as needed


# Specify the categorical column
categorical_column = 'column_name' # Replace 'column_name' with the actual column name

# Perform the mapping using the custom mapping dictionary
data[categorical_column] = data[categorical_column].map(custom_mapping)


 


For the ‘month’ column, let us transform the month strings like so:



# Define the mapping dictionary for months
month_mapping =
'jan': 1,
'feb': 2,
'mar': 3,
'apr': 4,
'may': 5,
'jun': 6,
'jul': 7,
'aug': 8,
'sep': 9,
'oct': 10,
'nov': 11,
'dec': 12


# Map the values in the month column
encoded_data['month'] = encoded_data['month'].map(month_mapping)


 


Let’s map the ‘poutcome’ and ‘education’ columns to numerical values as shown:



# Define the custom mapping for poutcome and education
poutcome_mapping =
'unknown': 0,
'failure': 1,
'other': 2,
'success': 3


education_mapping =
'unknown': 0,
'primary': 1,
'secondary': 2,
'tertiary': 3


# Perform ordinal encoding using pandas map
encoded_data['poutcome'] = encoded_data['poutcome'].map(poutcome_mapping)
encoded_data['education'] = encoded_data['education'].map(education_mapping)


# Select the newly added columns
new_columns = ['month','poutcome', 'education']

# Print the head of the newly added columns
print(encoded_data[new_columns].head(10))


 


Encoding the Output Label


 


Let's also map the output call and why the output values yes and no to 190 respectively to one and zero respectively



encoded_data['y'] = encoded_data['y'].replace('no': 0, 'yes': 1)
print(encoded_data['y'])


 



Output >>>

0 0
1 0
2 0
3 0
4 0
..
45206 1
45207 1
45208 1
45209 0
45210 0
Name: y, Length: 45211, dtype: int64


 


Recall that we used one-hot encoding for ‘housing’, ‘default’, and ‘loan’ columns. Because these columns also take ‘yes’ and ‘no’ values, you can also map ‘yes’ and ‘no’ to 1 and 0, respectively.


 


Visualizing the Distribution of Class Labels


 


It’s also helpful to check the distribution of class labels so that we can account for class imbalance when building models if needed.


 



Prompt: I would like to understand the distribution of class labels. Please give me the code to generate a suitable plot for the same.


 


Here’s the code to generate a countplot in Seaborn:



import seaborn as sns
import matplotlib.pyplot as plt

# Plot the distribution of output labels
sns.countplot(data=data, x='y')
plt.xlabel('Output Label')
plt.ylabel('Count')
plt.title('Output Label Distribution')
plt.show()


 


Harnessing ChatGPT for Automated Data Cleaning and Preprocessing
Distribution of Class Labels 
 

We see that there is class imbalance. So if you are building a model, you should be sure to address class imbalance using appropriate resampling methods.


 



 


Though data cleaning and preprocessing require efforts from the developer, let’s try to get a generic sequence of steps that works reasonably well for a simple dataset.


 



Prompt: Can you give me a generic data cleaning and preprocessing pipeline based on what we’ve done so far. Get basic info on the dataset, check for and handle missing values, identify categorical columns, and encode categorical columns. Use only pandas.


 


And here it is:



import pandas as pd

# Load the dataset into a DataFrame
data = pd.read_csv('data.csv')

# Step 1: Basic Info
print(data.info())

# Step 2: Missing Values
missing_values = data.isnull().sum()
print(missing_values)

# Step 3: Handle Missing Values
# Identify columns with missing values
columns_with_missing_values = missing_values[missing_values > 0].index

# Impute missing values using fillna() or other suitable method
data[columns_with_missing_values] = data[columns_with_missing_values].fillna(data[columns_with_missing_values].mean())

# Step 4: Get Categorical Columns
categorical_columns = data.select_dtypes(include=['object']).columns

# Step 5: Encode Categorical Columns
# Perform ordinal encoding or one-hot encoding using pandas' get_dummies() function
encoded_data = pd.get_dummies(data, columns=categorical_columns)

# Print the encoded data
print(encoded_data.head())


 



 


As seen, data science expertise in synergy with ChatGPT can help make data cleaning and preprocessing simpler and faster. Now that you have the preprocessed dataset ready, you can take this further by building a simple predictive model on this bank marketing dataset.


If interested, you can also explore how to leverage ChatGPT for data exploration.


 



 


The bank marketing dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.


Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.


 
 
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.


 





Post a Comment

0 Comments