Since I am an Economist by training and love programming and data science I wanted to combine these passions and do some fun data analysis. This post makes use of a variety of python libraries and to scrape and visualise economic data. I hope this is useful to some of you and that you enjoy reading this as much as I did doing it. The first thing I need to do is get some data. Since Wikipedia is the source of all internet knowledge (not!!) let’s start there.
I decided to scrape a table from the following wikipedia page: Wiki
Since I spend alot of my time doing economics I thought it would be a good idea to look at some of the richest and poortest countries in the world. The table in question ranks countries by their GDP per capita.
Before I go any further it is probably a good idea to give a brief explanation of what GDP per capita is (you tend to take for granted that alot of people don’t really speak “economics”). Simply put, it is a measure of how wealthy a country is. It is essentially the value of all the goods and services produced within a countries borders in one year divided by the population. This gives us a way of describing the average level of wealth per person in that country. It is quite an important economic variable and is often used to compare wealth levels across countries and across time.
In general, GDP per capita can increase for the following reasons.
- GDP increases
- Population decreases
- A combination of both.
This measure is thought to be a better indication of a countries wealth over using just GDP. The reason for this is that we could have a situation where GDP growth in the year was positive but GDP per capita fell because the population grew at a faster rate. This is one of the the reasons using only GDP could paint a misleading economic picture. These discrepancies can be particularly large in countries with large annual population growth such as Nigeria or India. So now we have the brief econ primer out of the way lets dig into the analysis.
Since I will be scrapping data from wikipedia using Beautiful soup seems like a no brainer. This library greatly simplifies extracting data from webpages and is the go to library for web scraping in python.
What is Beautiful Soup?
Beautiful soup is a python library for pulling data out of html and xml files. This makes the library extermely useful for extracting info from webpages. If you want more info on how exactly the library works and the various tasks you can perform with beautiful soup, feel free to read the Documentation
In order to use beautifu soup it is worth knowing some simple html tags. Having a little bit of knowledge of html will make it a great deal easier to search for the data we want. For example, Wikipedia uses a table tag for the tables it displays on its web pages. Knowing this, we can simply parse the html and look only for information contained within these tags.
First off I need to import all the neccessary libraries for the analysis. I will be using BeautifulSoup, plotly and a library called bubbly for creating nice interactive charts (more on this later).
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from bubbly.bubbly import bubbleplot
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly as py
import plotly.graph_objs as go
init_notebook_mode(connected=True) #do not miss this line
from plotly import tools
Now that I have loaded the libraries in I can start the analysis. The code below allows us to load the webpage into our jupyter notebook and pass it to the BeautifulSoup class to create a soup object.
req = requests.get("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita")
page = req.text
soup = BeautifulSoup(page, 'html.parser')
soup.title
Ok so it looks like it worked and we can now perform commands on the soup object and extract the data we want. I mentioned before that we were interested in the table tag. Below is the code to extract the all the tables from the wiki page.
table = soup.find_all("table", "wikitable")
len(table)
from IPython.display import IFrame, HTML
HTML(str(table))
The code above returns a list where each entry contains one of the tables on the page. This page only has five tables so it is pretty easy to just get the table we need which happens to be the first entry in the list. We can also confirm whether it is by using the HTML command from IPython.display which prints the table as it appears on wikipedia.
Now that we have the table it is just a matter of getting the country names and the GDP per capita. To do this, we need to know about more about the structure of HTML tables. In particular, we should know about the the th, tr and td tags. These stand for table header, table row and cell respectively. Ok so let’s try extracting some of the data.
GDP_PC = table[0]
table_rows = GDP_PC.find_all('tr')
header = table_rows[1]
table_rows[1].a.get_text()
The code above essentially finds all the tr tags which indicates the rows of the table. We then get the header of the table and print it out to give the results below. This corresponds to the country names and we extact the name we need using a.get_text(). We know this from the initial bit of code. Each index in table_rows corresponds to a country and that the country name is located in the a tag. This is the same for each value of the index.

Now all we need to do to get all the country names is to loop through table_rows, extract the data and append to a list.
countries = [table_rows[i].a.get_text() for i in range(len(table_rows))[1:]]
cols = [col.get_text() for col in header.find_all('th')]
Python has a really nice succint way of coding these kind of loops using list comprehensions. Note I skip the first entry of table_rows as it does not correspond to a country. We also use a list comprehension to extract the column headers which will be useful later on. The above code is equivalent to the for loop below.
country = []
for i in range(len(table_rows))[1:]:
country.append(table_rows[i].a.get_text())
Next up we move onto the td tag. This is where our GDP per capita data is stored in the table. The data is pretty messy however, and there are a number of workarounds I need to implement to get the correct data out and into the right format. Let’s take a quick look at one of the data points.
temp = GDP_PC.find_all('td')
temp[5].get_text()
This gives us ‘114,430\n’. We can see that all the data is defined as strings and there are commas and line breaks in each cell so we will need to fix this later. First let’s concentrate on gettting the data into a list.
temp = GDP_PC.find_all('td')
GDP_per_capita = [temp[i].get_text() for i in range(len(temp)) if "," in temp[i].get_text()]
GDP_per_capita = [i for i in GDP_per_capita if '\xa0' not in i]
temp_list = []
for i in range(len(temp)):
temp_list.append(temp[i].get_text())
new_list = temp_list[-11:]
numbers = [i for i in new_list if "\n" in i]
for i in numbers:
GDP_per_capita.append(i)
rank = list(range(len(countries)))
There is alot going on in the code above so let’s go through it step by step. The first thing I do is find all the cells in GDP_PC and store in a temp variabe. The next line loops through this variable and grabs the text if it contains a comma. I did this since most of the entries are in the thousands and therefore contain a comma. This approach does however miss the last four entries as they are hundreds of dollars so I have to create a workaround for that which is what new_list and numbers are doing. Finally I append these entries onto the GDP_per_capita list and also generate a rank column which is just numbers from 1 to 192. This may not be the most efficient way of doing this and there is probably better way but hey it worked so I am happy with it.
After extracting the three columns, rank, country and GDP per capita as lists we need to merge these together and create a pandas dataframe. This will make plotting and analysing the data much simpler. There is a handy function called zip that allows us to do this and I create two separate dataframes. One for the top 20 richest countries and one for the bottom twenty poorest countries. The code below implements this.
data = zip(rank[0:21],countries[0:21], GDP_pc[0:21])
import pandas as pd
cols = ['Rank', 'Country', 'GDP Per Capita']
data1 = pd.DataFrame(list(data), columns = cols)
data2 = zip(rank[-21:],countries[-21:], GDP_pc[-21:])
data2 = pd.DataFrame(list(data2), columns = cols)
Rank | Country | GDP Per Capita | |
---|---|---|---|
0 | 0 | Qatar | 124,927\n |
1 | 1 | Macau | 114,430\n |
2 | 2 | Luxembourg | 109,192\n |
3 | 3 | Singapore | 90,531\n |
4 | 4 | Brunei | 76,743\n |
5 | 5 | Ireland | 72,632\n |
6 | 6 | Norway | 70,590\n |
7 | 7 | Kuwait | 69,669\n |
8 | 8 | United Arab Emirates | 68,245\n |
9 | 9 | Switzerland | 61,360\n |
10 | 10 | Hong Kong | 61,016\n |
11 | 11 | San Marino | 60,359\n |
12 | 12 | United States | 59,495\n |
13 | 13 | Saudi Arabia | 55,263\n |
14 | 14 | Netherlands | 53,582\n |
15 | 15 | Iceland | 52,150\n |
16 | 16 | Bahrain | 51,846\n |
17 | 17 | Sweden | 51,264\n |
18 | 18 | Germany | 50,206\n |
19 | 19 | Australia | 49,882\n |
20 | 20 | Taiwan | 49,827\n |
From the table above it looks like the code worked and we now have our top and bottom 20 countries in pandas dataframes. Before we can plot the data we need to do a little bit more cleaning. The data are currently defined as strings so we need to fix this in order to use certain Pandas functions. The code below removes HTML line breaks “\n”, commas and defines the data type as int.
data1['GDP Per Capita'] = data1['GDP Per Capita'].apply(lambda x: x.replace('\n', ''))
data1['GDP Per Capita'] = data1['GDP Per Capita'].apply(lambda x: x.replace(',', '')).astype(int)
We finally have our data ready to create some nice looking visualisations.
Intro to Plotly
As mentioned before I will be using plotly to create what I think are really nice visualisations. I really like this library and it is simple enough to make pretty interactive plots. If you want to know about what kind of graphs you can create I encourage you to read the documentation Website
Below is the code to create a simple bar chart of the top 10 richest countries in the world. First we pass the data to go.Bar to create a br chart containing the country names on the x axis and the GDP per capita on the y axis. We then store this in a list and it gets passed to the go.Figure method. The same steps here apply to create all the different types of plots in plotly. Some of the results may or may not surprise you. For example, the top 10 is littered with countries heavily focused on producing oil such as Qatar and Kuwait who get approximately 70 and 94 per cent of government revenue from oil. Alot of these countries tend to have relatively small populations and large economies so it is not really surprising that they are very rich based on this measure (alot of wealth to share out among a relatively small population).
trace1 = go.Bar(
x = data1.Country,
y = data1['GDP Per Capita'])
data = [trace1]
layout = go.Layout(
title='Top 20 countries ranked by GDP per Capita')
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
Pretty easy right? Now for the poorest countries. Not surprsingly, these countries tend to be concentrated in Africa where populations tend to grow rapidly and the economies lag behind the more developed nations.
trace1 = go.Bar(
x = data2.Country,
y = data2['GDP Per Capita'])
data = [trace1]
layout = go.Layout(
title='Worst 20 countries ranked by GDP per Capita')
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
After getting a quick overview of the top 10 and bottom 10 countries lets try and get a more broad overview looking at the world as a whole. A good way of doing this is by using a map. In plotly you can create choropleth maps which shade the different regions based on some variable. In our case that is GDP per capita. Countries with a higher GDP per capita will have a darker shade of red. The most important things to note about this code is the country names passed into the locations argument and the location mode argument. These must match for the plot to work. You can also use country codes and even longitude and latitude to achieve the same plot but I think this is probably the easiest way. Notice that plotly allows you to zoom in to particular regions for a closer look which is a really nice feature.
We can see that in the richest countries tend to be centered in North America and Europe while the poorest countries are in Africa denoted by the lighter colour.
data = [ dict(
type='choropleth',
locations = data_all['Country'],
autocolorscale = True,
z = data_all['GDP Per Capita'],
locationmode = 'country names',
marker = dict(
line = dict (
color = 'rgb(255,255,255)',
width = 2
)
),
colorbar = dict(
title = "Millions USD"
)
) ]
layout = dict(
title = 'Top Countries by GDP per capital')
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
Do People from Rich Nations Live Longer?
Ok now that I have shown you some simple plots using plotly I want to go a step further and create something really cool. There is a really nice library called bubbly which creates bubble charts and has some interesting features to enhance the level of interactivity you can have with your charts. You can do this with plotly but there is quite a bit of coding involved to achieve the desired effect and bubbly makes it super easy. Credit to Aashitak for this library. There is also a nice kaggle kernel showing how the library works under the hood and is definitely worth checking out.
What I want to do here is create a bubble chart looking at GDP per capita vs life expectancy. The chart also takes into acccount the population for each country and what continent the country is in. I obtained all of the data from the world bank website. Below is the code to read the data in using pandas and I create list of unique values of the countries, continents and years which will be useful for manipulating the data. As it turns out this is a pretty famous visualisation created by the gapminder foundation. They have a really nice tool which plots this and other charts available Here if anyone wants to check it out.
For this analysis I use data from the world bank. This is in a completely different format then the gapminder_indicator dataset on kaggle. To use the bubble library we need to data to be in the format of the latter so there is a bit of data manipulation required. The reason I used the world bank data is that it has a slightly longer time series and wanted to get a view of more recent developments. The code below loads the datasets in and we extract the countries used in the gapminder dataset from the world bank dataset to make things easier.
gdp = pd.read_csv("gdp_per_capota.csv", engine = "python")
life = pd.read_csv("LifeExp.csv", engine = "python")
pop = pd.read_csv("population.csv", engine = "python")
gapminder_indicators = pd.read_csv("gapminder_indicators.csv", engine = "python")
countries = gapminder_indicators.country.unique()
continents = gapminder_indicators.continent.unique()
years = gapminder_indicators.year.unique()
['Country Name',
'1982',
'1987',
'1992',
'1997',
'2002',
'2007',
'2010',
'2013',
'2016']
# Filter countries first
gdp_new = gdp[gdp['Country Name'].isin(countries)]
life_new = life[life['Country Name'].isin(countries)]
pop_new = pop[pop['Country Name'].isin(countries)]
# # Now filter years
years = [str(year) for year in years]
years = years[6:]
for i in ['2010', '2013', '2016']:
years.append(i)
years.insert(0,"Country Name")
gdp_new = gdp_new[years]
life_new = life_new[years]
pop_new = pop_new[years]
The gapminder_indicator dataset has the data in the correct format (long format, see below) for plotting so I essentialluy need to manipulate my three datasets into the same format and merge them together before I can plot them using bubbly.
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.445314
Afghanistan Asia 1957 30.332 9240934 820.853030
Afghanistan Asia 1962 31.997 10267083 853.100710
Afghanistan Asia 1967 34.020 11537966 836.197138
Afghanistan Asia 1972 36.088 13079460 739.981106
The world banks data set is formatted differently with the population for each year being allocated a different column (wide format). Below is the code I use to manipulate the world bank data into the correct format.
Country name 1960 1961 1962
Aruba 54211.0 55438.0 56225.0
Afghanistan 8996351.0 9166764.0 9345868.0
Angola 5643182.0 5753024.0 5866061.0
Albania 1608800.0 1659800.0 1711319.0
Andorra 13411.0 14375.0 15370.0
melted_gdp = pd.melt(gdp_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_gdp = melted_gdp.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
melted_life = pd.melt(life_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_life = melted_life.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
melted_pop = pd.melt(pop_new, id_vars = ["Country Name"], var_name = "Year", value_name = "Data")
grouped_pop = melted_pop.groupby(["Country Name"]).apply(lambda x: x.sort_values(["Year"], ascending = True)).reset_index(drop=True)
temp = pd.merge(grouped_gdp, grouped_life, on = ['Country Name', 'Year'], how = 'inner')
temp = pd.merge(temp, grouped_pop, on = ['Country Name', 'Year'], how = 'inner')
cols= ['Country Name', 'Year', 'Data_x', 'Data_y', 'Data']
temp = temp[cols]
data = temp.copy()
Let me explain what I am doing here. The melt function allows me to collapse all the year columns into one row alongside the values for each year in the values row. I then groupby by the country names and sort each row by year so I am left with a dataset that has the country and year sorted chronologically. This is the same as the gapminder_indicators. I then merge datasets together on the country name and year and am left with the dataset in the correct format. I just need to drop a few columns that were added on when I used the merge function. I think you may be able to this in one pandas function but I decided to do it in a bit more of a manual way as it is good practice to think about how you need to manipulate your data.
The other thing I need to do now is to create a continent column which maps the country to the correct continent as I want to use this information in my plot. To do this I create dictionary using the gapminder dataset and then map this dictionary to a new column in my merged dataset.
dictionary = dict(zip(gapminder_indicators['country'], gapminder_indicators['continent']))
data["continent"] = data["Country Name"].map(dictionary)
data.rename(columns = {'Data_x': 'GDP_pc', 'Data_y': 'Life Expectancy', 'Data': 'Population'}, inplace=True)
Finally we have a finished dataset and we can create our plot. We use the bubbleplot function in the bubbly library to do this. The function creates a beautiful interactive plot of life expectancy vs GDP per capita and plots the size of the bubble according to the population of the country. The bubbles are also coloured by the continent and we are able to plot all of this information across time which is really nice. The most notable changes are China and India indicated by the largest purple bubbles. At the start of the sample they were among the poorest countries and had a relatively low life expectancy. Over time, however, the made a substantial move towards the upper right of the chart indicating large increases in both GDP per capita and life expectancy. This pretty much mirrors what we have seen with China becoming and economic powerhouse over the last 20 or so years.
What is also clear from the chart is that there is a positive correlation between GDP per capita and Life Expectancy. As one increases the other also tends to increase. Of course this tells us nothing about any causal relationship and it is unclear whether if countries have higher life expectancy because they are rich or ountries are rich because they have higher life expectancy. That is perhaps a question for an economics research paper and not this particular blog post.
So that is how you can extract data from the internet using beautiful soup and also how to use data visualisations to interpret and uncover trends in data which might not be immediately obvious looking at the raw data.
from bubbly.bubbly import bubbleplot
figure = bubbleplot(dataset=data, x_column='GDP_pc', y_column='Life Expectancy',
bubble_column='Country Name', time_column='Year', size_column='Population', color_column='continent',
x_title="GDP per Capita", y_title="Life Expectancy", title='Gapminder Global Indicators',
x_logscale=True, scale_bubble=3, height=650)
iplot(figure, config={'scrollzoom': True})