EDA for Beginners with Python: Biodiversity in Parks

Soo Reed
Feb 6, 2022
6 min read

Updated: Oct 3, 2022

exploratory data analysis on biodiversity in National parks using Python

From CodeCademy, I got these files (biodiversity_observations.csv, biodiversity_species_info.csv) for Exploratory Data Analysis (EDA). All I know of the data is that it has to do with biodiversity in National Parks, so let’s dive in together to demonstrate Exploratory Data Analysis. You can just read through this but I highly recommend copying/downloading the jupyter notebook file here (biodiversity.ipynb, biodiversity_observations.csv, biodiversity_species_info.csv) and read the actual code I wrote and play with the data and the parameters. Let’s dive in!

Import libraries and data

First, I am going to import a few libraries that I often use along with the data using pd.read_csv().

Inspecting and Cleaning the data

1. inspecting & cleaning Observations

Since I am not familiar with the datasets, I need to inspect them using df.head(), df.info(), and df.describe(). Let’s start with the observations dataframe.

Looking at the data, the data format looks clean – object for string and integer for observations. But there is something to note about observations. There are 23,296 rows of data. The data format for the columns are clean and there is no null value for any of the columns. However, we see an issue – there are 5,541 unique values of scientific names for four different national parks, so there should be maximum 5,541 *4 = 22,164 rows of data but the data frame has 23,296 rows. That’s an issue because it shows that the number of observations are not summed up per scientific name for each park. So we will have to clean up the data by summing up the observations per scientific name for each park. We will use groupby() to make this happen. Then, I am going to print observations.info() to check the change.

Now it shows 22,164 rows as we want! Let’s inspect the next data frame, species.

2. inspecting & cleaning species

Just like earlier, I am using .info(), .describe(), and .head() to inspect the species data.

Inspecting the data, we see a similar issue for the data frame species. There are 5,541 unique scientific names which match observations, but there are 5,824 rows of data. Since each row should contain information on one species, there should be only 5,541 rows. Let’s see which columns are creating the extra rows for the same scientific name.

After using value_counts() per scientific_name, I am examining one of the examples in the data frame. This example shows us that common_names column is creating duplicates. So now, let’s remove the common_names and then de duplicate, and then see how many rows there are.

Unfortunately there are still two extra rows! Let’s examine both species in the dataframe like how we did earlier.

We can see that conservation_status is creating extra rows. We will have to choose which rows are relevant; since for Oncorhynchus mykiss, one of the Convservation_status is NaN, it should be “Threatened” only. For Canis lupus, if you can verify which one is correct, it’s the best. When you cannot, you have to choose one and make a note. I would choose “In Recovery” since the logical order between Endangered and In recovery suggests in recovery to be the more recent value. If you have a better way of choosing which value, follow your way and make a note. So now let’s get rid of these two rows from species dataframe.

Now that these two extra rows are deleted, species dataframe is also now cleaned for us to use with 5,541 rows.

3. merging: observations & species

Before we start visualizing and analyzing the data, we need to do one more thing. The species dataframe provides additional information on each scientific name, so it is more useful when it is joined with observations dataframe. So let’s join the two to create a dataframe to use and I am going to use head() to make sure the dataframe looks good.

***here, I can do ‘left’ join but I wanted the column order to be the way it is in the example with the observations being the last column just for my convenience.

Questions to explore

The cleaned dataframe “observations_species” have four dimensions (categorical data) and one metric (quantitative data). Scientific name column has 5,541 unique values so looking at the data by that dimension would not be very helpful. So considering that, we can ask a few questions like:

Which park has the most observations of endangered species?
Which park has the most observations of mammals?
Which category of the species has the most observations in recovery status?

There are more explorable questions involving other dimensions and metric, but for our EDA, we will stick to these questions and demonstrate how to explore their answers.

Visualizing and exploring the data

Q1. Which park has the most observations of endangered species?

Using sns.barplot(), we can immediately tell Yellowstone has the most observations of endangered species. What about other conservation status? To perform a quick analysis, we can get rid of the conservation status as the filter for the dataframe and add it as a hue.

Looking at the chart, we can now say Yellowstone has the most observations for all four of the conservation statuses, and then Yosemite, Bryce, and Great Smoky in that order.

Q2. Which park has the most observations of mammals?

This is similar to Q1 but instead of conservation status, we will look at categories. Instead of looking at just mammals, let’s look at all the categories like earlier, using hue.

Nothing too interesting here – just like in Q1, we are seeing the same order of park for observations (Yellowstone > Yosemite > Bryce > Great Smoky). Distribution among categories are fairly even, with one tiny noticeable trend: mammals are the most observed in all the parks, and then vascular plants and birds. But the difference is very minimal except for mammals.

Q3. Which category of the species has the most observations in recovery status?

Now let’s get away from parks and focus on category and status. Just like before, I created a barplot using sns.barplot() but this time, x is category instead of park name.

And here’s something finally interesting. In the four national parks, the number of observations per conservation status varies significantly based on the category. Here, reptile and nonvascular plants are in good shape – they have species of concern but no threatened, endangered, or in recovery. Meanwhile, fish, amphibian, and vascular plants need extra effort – they have endangered, threatened, concerned species but none in recovery. Bird is missing the “threatened” category and mammal is the only category with observations in all the statuses.

Now I am curious if the chart is a reflection of the number of the unique species per category per status (since the chart we created has observations for y-axis). So I am going to create a countplot to see if it looks similar.

And the answer is a resounding no. The number of observations per category per status is not a reflection of the number of unique species per category per status. For instance, you can see that the bird category has more than 250 species logged as “species of concern” and yet there are less than 150 observations. Also, for mammals, there is a very small number of species in recovery and yet you can see more than 150 of them in observations, which makes sense considering what “in recovery” means. Before we finish, I want to create a table for the last chart because “species of concern” for bird, vascular plant, and mammal makes the y axis too broad for us to examine the rest of the data. So I am going to use pd.crosstab() to make the table.

Conclusion:

In this EDA example, I spent much time examining and cleaning the data for demonstration. Cleaning the dataset is one of the most time consuming but also the foundation of any analysis, and you will see that most data analysts and data scientists spend the majority of their time cleaning and refining their imperfect data. This was a beginner-friendly but still great example of unclean data: disjointed, unaggregated datasets with duplicates. Copy or download the code and dataset from github here. Once you have a clean dataset in this example, I highly encourage you to come up with your own questions and explore the answers.

#seaborn #EDA #cleaningdata #pandas #python #analytics