Exploratory Data Analysis (EDA) with hypothesis testing for beginners — Part II (With code and example)
The first part of the blog touched upon topics related to
1. Data Inspection and wrangling — An understanding of basic data inspection and manipulation techniques
2. Univariate analysis — How to understand data related to a single column i.e., by quantifying the central tendency, spread and visualizing the variables.
These topics can be refreshed using this link.
In this blog, the below topics shall be touched upon with the help of examples from datasets, code snippets, output and inference. All the items shall be made available in GitHub.
3. Multivariate analysis — Understanding the relationship between two or more variables with help of statistical measures and visualization techniques
4. Hypothesis testing — Proving our intuition and instincts about the relationship between variables using statistical tools.
Now to continue this journey of exploring data…….
1. Multivariate analysis
Multivariate analysis is the analysis of two or more variables to identify relationships, dependencies and trends. We understood univariate analysis which consisted of single column statistics and visualization. However, in real world scenario we are almost always faced with datasets consisting of more than one column. I will be going through some techniques of multivariate analysis via the different combinations of variable types we may face
- Relationship between a Categorical variable and another Categorical variable
- Relationship between a Categorical variable and a Numerical variable
- Relationship between a Numerical variable and another Numerical variable
Dataset — 1
For the first part of this article, a publicly available toy dataset is used taken from Kaggle. The metadata for the dataset is as below
Income dataset
Dimension of the dataset: There are a total of 150,000 observations and 6 variables, whose description is as below
‘Number’: A simple index number for each row
‘City’ (Categorical variable): The location of a person (Dallas, New York City, Los Angeles, Mountain View, Boston, Washington D.C., San Diego and Austin)
‘Gender’ (Categorical variable): Gender of a person (Male or Female)
‘Age’ (Numerical variable): The age of a person (Ranging from 25 to 65 years)
‘Income’ (Numerical variable): Annual income of a person (Ranging from -674 to 177175)
‘Illness’ (Categorical variable): Is the person Ill? (Yes or No)
1. Categorical — Categorical variable relationship
Relationship between 2 categorical variables cannot be easily apparent, whereas in other cases we can see the number quantified against the categorical or numeric variable. Hence it is essential to summarize the column to understand the trends. This is shown below with help of two measures.
i. Contingency table
The contingency table shows the frequency count of a category from one variable with respect to a category from another variable. This is a powerful measure as a simple inspection of this can give the analyst a lot of ideas regarding the relationship of the variables. This measure exposes the concentration of categorical values with respect to other categorical values.
The python code to create a contingency table is below which considers two categorical variables — City and Gender — for the analysis
import pandas as pd
#loading dataframe from csv dataset
df = pd.read_csv('toy_dataset.csv')
#creating contingency table for the variables city and gender
contingency_table = pd.crosstab(df.City, df.Gender)
The output for the above code is:
Inference: Some of the observations that can be made from the above table are
- New York has the highest datapoints in the dataset with the majority being male
- San Diego has the least datapoints
- In all the cities, the number of males is always more than the females, highlighting the gender difference.
ii. Marginal proportions
This measure is closely related to the contingency table, where the relation between categorical values is represented as frequency divided by the total count in that variable. This calculation provides us with proportions which can put the numbers into better perspective while understanding the relation.
The python code to create marginal proportions table is below which considers 2 categorical variables — City and Gender — for the analysis
#formula to convert contingency table into marginal proportions
marginal_proportions = (contingency_table/len(df))*100
The output for the above code is:
Inference: Most of the observations reaffirm the relations already derived through the contingency table. However, the same observations are more readily apparent in the above view and can be arrived at quickly.
2. Categorical — Numerical variable relationship
This measure — as the title suggests — helps understand the relationship between a categorical and a numerical variable. This relationship can be understood with statistical measures and visual representations which are shown below.
i. Mean and median difference
This measure helps in understanding the degree of relationship between two categories in a categorical variable with respect to a numerical variable. It can be considered in general that the more the mean/median difference (with respect to spread of the data),the stronger the association between the variables. This can be found by calculating the mean of the numerical variable against various categories in the categorical variable and finding the difference between them.
The python code to find the mean and median difference is below which considers the categorical variable — Gender and the numerical variable — Age
#separating age into a separate series
male_age = df.Age[df.Gender == 'Male']
female_age = df.Age[df.Gender == 'Female']
#calculate means for each group:
male_age_mean = np.mean(male_age)
female_age_mean = np.mean(female_age)
#calculating the difference in mean for both genders
mean_diff = male_age_mean - female_age_mean
#print mean difference
print(f'Mean difference: {mean_diff}')
The output is as below
Inference: The output shows a very small difference in the mean between the 2 categories in Gender against Age. Therefore, it can be concluded that Gender and Age are not strongly related.
ii. Side by side box plots
This measure provides the spread comparison between categories against a numerical variable. This measure can help conclude with certainty with the aid of mean difference if there is a strong or weak association between variables. The degree of difference can be decided to be small or large by inspecting the box plots for context with spread.
The python code to plot the side by side box plot is below which considers the categorical variable — City and the numerical variable — Income
import seaborn as sns
import matplotlib.pyplot as plt
#plotting box plot for the variables city and income
ax = sns.boxplot(data = df, x = df['City'], y = df['Income'])
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
The output is as below
Inference: The above plot is helps derive a more definitive conclusion for the relationship between variables, some of which are
- The income range varies heavily with respect to the city, hence it can be said both of these variables are closely related.
- Mountain view city has the higher range of income and also the highest median income among those considered
- Dallas city has a low income range and also the lowest median out of all considered.
3. Numerical— Numerical variable relationship
This measure helps in understanding the relationship between two numerical variables. The relationship can be understood with statistical measures and also trends can be seen with help of trendlines in a visual representation which shall be discussed in detail below
i. Correlation coefficient
Is the statistical measure of the degree to which a change in one variable can predict the change in another. The output of this measure ranges from -1 to +1, with -1 indicating a strong negative correlation, +1 showing a strong positive correlation while 0 indicates no correlation
The python code to find the correlation coefficient is below which considers the numerical variables — Age and Income
#finding correlation between the variables age and income
corr_age_income = df.Age.corr(df.Income)
print(f'The correlation coefficient value is {corr_age_income}')
#Note that the default correlation coefficient formula taken into consideration is pearson coefficient in the pandas correlation function
The output is as below
Inference: As explained above in the explanation, since the output for this measure is almost 0, it can be said that there is no strong correlation or associativity between the variables Age and Income.
ii. Scatter plot
This is a visualization technique which can help understand the trendline and patterns in the data (if present) by plotting the numerical variable along x and y axis.
The python code to plot the scatter plot is shown below which considers the numerical variables — Age and Income
#plotting scatterplot between the numerical variables income and age
sns.scatterplot(data=df, x = df.Income, y = df.Age)
The output is as below
Inference: The above plot clearly shows there seems to be no association between the variables age and income. Hence this confirms the output obtained with the correlation coefficient statistics.
An example of a good scatter plot which can clearly differentiate trend line and hence associativity between numerical variables is below
2. Hypothesis testing
Hypothesis testing is a very powerful tool in statistics. With the aid of hypothesis testing - intuitions about the association of different variables in the data can be confirmed, or to understand if the sample is representative of the population or not.
With hypothesis testing, there are two hypotheses that is formed generally which are mainly the Null hypothesis and the Alternative hypothesis. Where the Null hypothesis contradicts the assumption made in the alternative hypothesis. We set a significance threshold and test for 2 possible scenarios which are to either
i. Reject the null hypothesis — Indicates that our sample provides sufficient evidence with respect to the defined threshold that the effect exists
ii. Fail to reject the null hypothesis — Indicates that our sample with respect to the defined threshold does not provide sufficient evidence that the effect exists.
The wording of the interpretation of the scenarios has to be carefully noted, as we do not claim directly that we reject the alternative hypothesis as our test was not designed to do that. The output of the hypothesis test is generally a statistical value called the p-value. This value is the probability of occurrence of the null hypothesis. There is a threshold preset (generally 5%) where if the p — value is less than 0.05 or 5% we reject the null hypothesis and if more than 0.05 we fail to reject.
To give more context, this is explained below with three different hypothesis testing techniques and examples.
Dataset — 2
The dataset is publicly available and is downloaded from Kaggle, the metadata is —
California housing dataset
Dimension of the dataset: There are 20641 observations and 10 variables. The variables are
‘longitude’: block group longitude
‘latitude’: block group latitude
‘housing_median_age’: median house age in block group
‘total_rooms’: block group total rooms
‘total_bedrooms’: block group total bedrooms
‘population’: block group population
‘households’: block group total households
‘median_income’: median income in block group
‘median_house_value’: median house value in block group
1. One sample t — test
This test helps confirm if a representative sample population average is comparable to that of a population. For this test as we are dealing with census data from 1990, an obvious question or assumption that would raise regarding the price of the house is — If the sample still represents the current scenario? however to prove and gain confirmation that the sample set does not represent the situation as of 2021 one sample t-test can be used. The null and alternative hypotheses for this would be
Null hypothesis: The mean of both the sample data set house value in the year 1990 and the house value in 2021 are equal
Alternative hypothesis: The mean of the sample dataset house value and the house value in 2021 are different.
The mean value of houses in California for the year 2021 is found to be 786700$ (source: Norada real estate)
The test can be run with python as shown below
from scipy.stats import ttest_1samp
import numpy as np
#applying the function for one sample t test to obtain p value
tstat, pval = ttest_1samp(df_house.median_house_value, 786700)
print(f'the resulting p-value for the one sample t-test is {pval}')
The output is as below
Inference: The result provides a value of 0 which is lesser than the significance threshold of 0.05, using which we can conclude for our hypothesis test — we reject the null hypothesis. Hence we can assume our alternative hypothesis to be true which says — the mean of the sample dataset house value and the house value in 2021 are different.
2. Two sample t — test
This test helps confirm if the mean of 2 samples is equal to each other. As opposed to one sample t-test where one sample is compared to a population mean, here 2 different samples are compared. For this case based on the dataset, a question can be generated regarding the median house price based on ocean proximity. Hence we split the dataset based on 2 classifications which are ‘Near Bay’ and ‘Near Ocean’ and run the two sample t — test. The hypothesis for this is
Null hypothesis: The mean of the sample with ocean proximity ‘Near Bay’ and ‘Near Ocean’ are equal
Alternative hypothesis: The mean of the sample with ocean proximity ‘Near Bay’ and ‘Near Ocean’ are not equal
The test can be run with python as shown below
df_house_1hocean = df_house[df_house.ocean_proximity == 'NEAR OCEAN']
df_house_inland = df_house[df_house.ocean_proximity == 'NEAR BAY']
from scipy.stats import ttest_ind
import numpy as np
#applying two sample t test function
tstat, pval = ttest_ind(df_house_1hocean.median_house_value, df_house_inland.median_house_value)
print(f'the resulting p-value for the two sample t-test is {pval}')
The output is as below
Inference: The result provides a p value of 0.0051 which is lesser than the significance threshold of 0.05, using which we can conclude for our hypothesis test — we reject the null hypothesis. Hence we can assume our alternative hypothesis to be true which says — The mean of the sample with ocean proximity ‘Near Bay’ and ‘Near Ocean’ are not equal
3. Chi — Square test
This test can help us determine if 2 categorical variables are associated with each other. As the housing dataset does not contain 2 categorical variables which can be used we will go back to the income dataset where there are 2 variables — ‘City’ and ‘Gender’. An assumption regarding these 2 variables will be that they are not related to each other as it does not make sense in a usual case. To confirm this assumption chi square test is the right candidate as it tests for the associativity between 2 categorical variables, the hypothesis for this would be
Null hypothesis: There is no significant association between the 2 variables — City and Gender
Alternative hypothesis: There is a significant association between the 2 variables — City and Gender
The test can be run with python as shown below
from scipy.stats import chi2_contingency
#creating contingency table for the variables city and gender
contingency_table = pd.crosstab(df.City, df.Gender)
#applying chi square function to obtain p value by passing contingency table to the function
res = chi2_contingency(contingency_table)
print(f'the resulting p-value for the chi square test is {res.pvalue}')
The output is as below
Inference: The result provides a p value of 0.5821 which is greater than the significance threshold of 0.05, using which we can conclude for our hypothesis test — we fail to reject the null hypothesis. Hence we can assume our null hypothesis to be true which says — There is no significant association between the 2 variables — City and Gender
The code and datasets discussed above is available on GitHub.