Exploratory Data Analysis (EDA) for beginners — Part I
Exploratory data analysis is the first step for an analyst to understand what the data is trying to tell us. The steps of procedure and analysis that is discussed in this introductory article includes some simple and powerful statistical methods which can give us good insights on understanding the various facts, trends and insights from the data being analyzed.
As the title suggest the aim of Exploratory Data Analysis (EDA) is to go on an exploration journey on the data based on our instincts and confirm our intuition with proven methods. In this article I will be using python to perform the exploratory analysis on a housing dataset collected online manually and loaded into an excel
I will be using 4 basic and important analytic procedures used in Exploratory Data Analysis (EDA) of a dataset, these topics will be divided into 2 articles.
- Data inspection & wrangling (Part 1) —
- Inspecting data
- Handling duplicates
- Handling missing values
2. Univariate analysis — Analysis of single column(variable) in a dataset (Part 1)
- Descriptive statistics
- Central tendency for quantitative data
- Spread for quantitative data
- Visualizing quantitative data
- Central tendency for categorical data
- Spread for categorical data
- Visualizing categorical data
3. Multivariate analysis — Analysis of two or more columns(variables) in a dataset together (Part 2)
4. Hypothesis testing (Part 2)
1. Data inspection & wrangling
A simple inspection of the dataset can give a lot of ideas and questions to an analyst, which are very important to begin the exploratory journey. The ideas help form foundation from which we will be branching out to more ideas and intuitive questions regarding the data. The python code
import pandas as pd
#Loading our dataset using read_csv function of pandas
df = pd.read_csv('Housing_Data.csv')
#Inspecting the first 10 rows of the dataset
print(df.head(10))
Looking at the above output some of the inferences that can be generated are mainly regarding the variables i.e., total variables, type of variables and to see if there are any patterns apparent. Some inferences we can immediately draw are
- There are total 17 variables present in the dataset
- Categorical variables — bhk, bathrooms, balconies, furnishing_type, location, property_type
- Numerical variables — price, area_sqft, floors
It is important to understand the variables which are being analyzed here to ensure we can apply the right statistical methods when we try to understand more about them and the relation between them.
Data wrangling and tidying is one of the most important aspects of analyzing data, performing this step ensures we don’t end up with issues going ahead while performing operations and also clean data can always be trusted to give right insights without errors. Here we consider 2 aspects of data tidying i.e., Identifying duplicate entries and handling missing values.
Handling duplicates
#Printing total duplicate rows
print(df.duplicated().sum())
#To remove the identified duplicate rows and verify the total entries before and after dropping
print('Shape of dataframe before dropping duplicates' + str(df.shape))
df = df.drop_duplicates()
print('Share of dataframe after dropping duplicate' + str(df.shape))
Handling missing values
Data is rarely found in an ideal cleaned up format, more often we usually find data having values for some of the variables missing. Hence it is essential to understand how to handle this situation. The common approach to this problem is by replacing the null value with the central tendency of the variable column i.e., with the mean, median or mode. For quantitative variables we can use either of the above mentioned central tendencies statistics, however for categorical variables mode is used to replace the missing values. There are lot of sophisticated methods which can be employed to replace the missing values such as machine learning, regression, imputation etc. The scope of this article will include only simple replacement using central tendencies, the above methods can also be used.
It is also worth noting that if the dataset is huge consisting of thousands of rows and the rows consisting of missing values are relatively less, then the rows consisting these missing values can be dropped directly, since the impact from doing this is negligible.
Below is the code snippet to summarize the variables having null values and replacing it with appropriate central tendency value.
#Checking for total number of null values in each column
print(df.isnull().sum())
As we can see with the output image on the left, there are 3 variables with null values
area_sqf(Numerical), bathrooms(Categorical) & furnishing_type(Categorical)
Which consists of missing values. We handle this as shown below
#We replace missing values of bathrooms column with mode as it is a categorical variable
df['bathrooms'].fillna(df.bathrooms.mode()[0],inplace=True)
#We replace missing values of area_sqft column with mode as it is a numerical variable
df['area_sqft'].fillna(df.area_sqft.mean(),inplace=True)
#We replace missing values of furnishing_type column with mode as it is a categorical variable
df['furnishing_type'].fillna(df.furnishing_type.mode()[0],inplace=True)
2. Univariate analysis
Summarizing variable present in our dataset separately with numbers and visualization describing the central tendency and spread using which we can draw useful insights and trends primarily forms part of the univariate analysis. I will be going through the process below with our loan applications dataset using python.
Descriptive statistics
Pandas module in python comes with a very useful function which can generate instantly the descriptive statistics of a dataset for all columns and put the output in the form of a table.
#Function to describe the dataset with basic summary statistics
print(df.describe())
#To include categorical variables in the output
print(df.describe(include = 'all'))
Central tendency for quantitative data
Taking the column — price to calculate various central tendency measurements we use
Mean is the average value of the variable, calculated as the sum of all values divided by the number of values
mean_price = df.price_lac.mean()
print(mean_price)
#Output = 56.48
Mode is the most frequent value of the variable, we can use below code in python to find its value
The middle value of the variable when sorted, we can use below code in python to find its value
mode_price = df.price_lac.mode()[0]
print(mode_price)
#Output = 55
Median is the middle value of the variable when sorted
median_price = df.price_lac.median()
print(median_price)
#Output = 48
Spread for quantitative data
I use the same column price_lac to find measure of spread using below descriptive statistics
Range is the difference between the maximum and minimum values of a variable
#RANGE - The difference between the maximum and minimum values of a variable
range_price = df.price_lac.max() - df.price_lac.min()
print(range_price)
#Output = 81000
InterQuartile Range(IQR) is the difference between the 75th and 25th percentile values.
#We import iqr function from scipy stats to calculate iqr value for the variable
from scipy.stats import iqr
iqr_price = iqr(df.price_lac)
print(iqr_price)
#Output = 32
Variance is the average of the squared distance from each data point to the mean
var_price_lac = df.price_lac.var()
print(var_price_lac)
#Output = 1508.67
Standard deviation is the square root of variance
stddev_price_lac = df.price_lac.std()
print(stddev_price_lac)
#Output = 38.84
Mean absolute deviation (MAD) is the mean absolute value of the distance between each data point and the mean.
mad_price_lac = df.price_lac.mad()
print(mad_price_lac)
#Output = 27.44
Visualizing quantitative data
The above numbers from the summary statistics can help generate a rough idea about our variable under consideration, however visualizing this can give an even more clearer picture about the variable for the interpretation of the analyst. Hence visualization forms a powerful element of our Exploratory Data Analysis.
Usually to understand the nature of a numerical variable in a dataset we use histogram or boxplot, I am using seaborn library built on top of matplotlib in python to plot these 2 graphs for “price_lac” column
#Plotting Histogramfrom matplotlib.pyplot import figure
import seaborn as sns
#Setting figure dimensions
figure(figsize=(6, 4), dpi=200)
#Plotting histogram
sns.histplot(x = 'price_lac', data = df, color = 'Blue')
plt.show()
plt.close()
Inference: Histogram shows us the density or concentration of the values for the variable being analyzed with frequency on y axis against the variable values of the variable in x axis. We can see here that majority of our values for price lies from 25 to 75 lacs INR, this shows an increased availability and can also imply the increased demand of properties in this price range.
#Plotting Boxplotfrom matplotlib.pyplot import figure
import seaborn as sns
#Setting color palette
sns.set_palette("BuPu")
#Setting figure dimensions
figure(figsize=(6, 4), dpi=200)
#Plotting boxplot
sns.boxplot(x = 'price_lac', data = df)
plt.show()
plt.close()
Inference: Boxplot gives us lot of information regarding the spread of the quantitative variable, the various features of boxplot are the central line which is the median, edges of the box being 1st and 3rd quartile, lines outside the box showing maximum(Q3 + 1.5*IQR) and minimum(Q3-1.5*IQR) and points outside the lines depicting the outliers. Based on the vast amount of data depicted in the boxplot we can again confirm that the spread of monthly income is mainly concentrated between 25 and 75 lac INR it is also worthwhile noting that there are outliers present in the data spanning from 100 to 175 lac INR.
Central tendency for categorical data
When it comes to central tendency for categorical data we have limited options from summary statistics, the most commonly used measure is mode
For our analysis I am taking the categorical variable “property_type” from our dataset
#MODEmode_prop_type = df.property_type.mode()[0]
#Output
'Apartment'
Spread for categorical data
To understand spread of categorical data we can use the frequency or the proportions of the various categories present in the data.
#VALUE COUNTS#finding frequency of the categories present in race column
count_propty_type = df.property_type.value_counts()
print(count_prop_type)
#output
Apartment 55
Independept house 37
Builder floor 2
Name: property_type, dtype: int64#VALUE PROPORTIONS#This function will print out the proportion of each categorical variable
propty_prop_type = df.property_type.value_counts(normalize = True)
print(propty_prop_type)
#Output
Apartment 0.585106
Independent house 0.393617
Builder floor 0.021277
Name: property_type, dtype: float64
Visualizing categorical data
To visualize categorical data we mainly use bar chart or pie charts(not recommended for over 4 categories) to visualize the count or proportion of the categories in our variable
#Plotting bar chartfrom matplotlib.pyplot import figure
import seaborn as sns
figure(figsize=(6, 4), dpi=200)
#Setting the color sche,e for the histplot
sns.set_palette("BuPu_r")
#Calling histplot function from seaborn to visualize frequency of race
sns.countplot(data = df, x = 'property_type')
plt.show()
plt.close()
Inference: By looking at this graph it is clear that apartment and independent houses are the most available options among the property types as this is natural since builder floors are usually created one per apartment and is usually not built at all. Out of the other 2 categories we can see higher percentage of apartments available, this shows a general trend of population moving towards apartments which may be due to reasons such as community, safety, view and amenities which are generally offered.
This ends the univariate analysis and data wrangling part of our exploratory data analysis where we went through important concepts such as
- Data inspection
- Handling missing values
- Handling duplicate rows
- Descriptive statistics
- Measure of central tendency for quantitative variables
- Measure of spread for quantitative variables
- Visualizing quantitative variables and interpretation
- Measure of central tendency for categorical variables
- Measure of spread for categorical variables
- Visualizing categorical variables and interpretation
This concludes part 1 of my article on approach exploratory data analysis for beginners, in part 2 I will be going through multivariate analysis and hypothesis testing.
References
Jupyter notebook for the EDA and the dataset