EDA for Beginners with Python: Life Expectancy and GDP

Soo Reed
Jan 29, 2022
5 min read

Updated: Oct 3, 2022

From CodeCademy, I got Life Expectancy and GDP file for Exploratory Data Analysis (EDA). All I know of the data is that it has to do with GDP and Life expectancy, so let’s dive in together to demonstrate Exploratory Data Analysis. You can just read through this but I highly recommend downloading/copying the jupyter notebook file here at Github and read the actual code I wrote and play with the data and the parameters. Let’s dive in!

EDA step by step

import libraries and load the data

First, I am going to import a few libraries that I often use along with the data using pd.read_csv().

Data inspection

Since I am not familiar with the data, I need to inspect it using df.head(), df.info(), and df.describe().

I see that countries have 6 unique values, so I will use df.Country.unique() to see what they are. I am also seeing that year is between 2000 and 2015. There are 96 data points, which is not big. All the data format is clean; year is integer, life expectancy and GDP are float, and countries are objects, which is string.

Questions to explore

The beauty and challenge of Exploratory Data Analysis (EDA) is that it does not come with a set of questions. So you as an analyst will have to look at the data and start asking questions yourself! Looking at the data, the questions that pops up immediately in my mind are:

Is there a relationship between life expectancy and GDP?
Is there a stronger or weaker relationship per country?
Is there a stronger or weaker relationship per year?
Has the life expectancy and GDP trend changed over the years?

How do we come up with these questions? These questions are from the fact that we have two dimensions (categorical data, so country and year) and two metrics (quantitative data, life expectancy and GDP). Which means, we can look at the data in the following manner: 1. Overall, 2. By each categorical dimension (country OR year) and 3. By both categorical dimensions (country AND year).

So let’s begin with overall exploration.

plotting to explore data

Visualizing data is essential for exploratory data analysis, since it will help you learn the trend quickly and give you the direction of further analysis. Since we have two quantitative data, I am going to be using scatter plots.

Looking at the chart, I see that there are at least four very distinctive trends. Now, I am wondering if these trends have to do with either country or year. So I will add hue to compare.

Adding hue = “Country” confirms our suspicion that there were distinctive trends. Here you can clearly see that each country has their own clusters. With these country distinctions in mind, now let’s look at the years.

Adding hue = “Year” let us know that all these countries’ life expectancy went higher over the years. Zimbabwe had the most notable increase in life expectancy, even though it is still significantly lower than the other five countries.

For the future reference and also to demonstrate how to have all this information in one chart, I want to show both distinctions in country and year in the scatter plot. I will add hue and style to make this happen. I decided to use hue for Year, because it automatically detects the numerical increase and shows the increase in the color by year instead of a random choosing of colors, which helps us take in the data quickly.

Now we have a pretty good idea on the trend overall looking at the chart. But if you were to talk about the strength of the correlation, the chart is not very helpful as some countries’ data points end up too clustered together. So let’s create a chart per country. We could write sns.scatterplot() for each country, but instead, I wrote a short for loop to add scatter plots and their titles.

Immediately, we notice Zimbabwe showing the positive relationship between GDP and Life expectancy just like other countries; this was not observable in the earlier chart because their GDP movement was minimal compared to other countries but looking at each country, we see the relationship clearly. Additionally for Zimbabwe, we can see that most of the GDP and Life expectancy growth came in recent years. Most of the light colored points (approximately ~2007) are clustered at the bottom left of the chart while the darker points are far away from the cluster, creating the linear shape, showing the positive correlation between GDP and Life Expectancy. It’s notable that the relationship was not present in Zimbabwe until about 2008.

We are also seeing the general straight line of positive relationship (linear) between GDP and Life Expectancy in all countries, except for China. For China, you can see the significant growth in life expectancy in earlier years per GDP which tapered down as the year went by.

Since the relationship between GDP and Life expectancy per country seems straightforward, I want to draw a regression line for each country. This will help us predict the future value of GDP based on Life expectancy and vice versa. Additionally, it will show us the deviation from the regression line, informing us which country has the strongest or the weakest relationship, and if there is any trend in deviation. To do so, I am using the same for loop code, but swapping out sns.scatterplot to sns.regplot().

Looking at the result, you can see that the US has the strongest relationship with the least shaded area (shaded area showing deviation). We also notice that Mexico’s deviation area is getting bigger as GDP grows (and as years went by since we determined earlier that GDP generally grew over the years). We may be tempted to say Mexico’s regression line shows the most deviation in the latest year, but that actually may not be true. One of the things to be careful with charts like these are that they all have different x and y ticks – so to speak, they are zoomed in data for each country. So even though Mexico’s variation may appear large, its y – label range only shows 2, compared to some other countries with the range 17.5+ like Zimbabwe. So the chart I created above shows each country’s regression line and variation effectively, but not necessarily comparing the countries.

Conclusion:

We marked a few interesting trends in the data, such as each country forming a cluster regarding GDP and Life expectancy, and Zimbabwe not showing the relationship until recent years, and how its GDP is so minimal despite the growth, and its life expectancy is still significantly lower than others despite its exponential growth. We noticed that GDP generally grew as years went by regardless of the country. We also determined that there is a positive correlation between GDP and Life expectancy regardless of the country. But depending on the country, the strength of the correlation differs.

For this project, the data was clean and small so I highly recommend trying this out if you are a beginner. This is a great one to practice seaborn charts, especially. Download/copy the jupyter notebook file here and play around with it yourself!

#seaborn #EDA #pandas #python #analytics