sepdek March 29, 2020
Illustration of COVID-19 by the Public Health Image Library

In a recent post (Analysis of live COVID-19 data – a MATLAB/Octave approach) I described how it is possible to analyse live COVID-19 data from reliable online data sources and create intuitive graphical representations.

Apparently, this task is also simple when using Python programming language. To this end, a Colab Python notebook was created to demonstrate the process, which is shared as an endnote to this post. Experienced Python users or those that do not need to look at the following description may skip straight to the endnotes section for that link…

First thing here is to select a reliable data source. For the purposes of this demonstration the data source selected was the European Center for Disease Prevention and Control, as those data are collected and fed by the “Our World in Data” project @ https://ourworldindata.org/coronavirus-source-data. In addition, official population data to be used for normalisation have been collected from DataHub.io (which are taken from the World Bank dataset.

At the beginning some initialisation is needed, as usual.

[Python]
import pandas as pd
import requests
import io
from google.colab import files
from datetime import datetime

from pandas.plotting import register_matplotlib_converters
from matplotlib import dates as dts
from matplotlib import pyplot as plt
from matplotlib import style

# select the default graph style
style.use(‘ggplot’)
[/Python]

To read the current outbreak data we may use the pandas function read_csv as follows

[Python]
url = ‘https://covid.ourworldindata.org/data/ecdc/full_data.csv’
data = pd.read_csv(url,parse_dates=[‘date’],index_col=[‘date’])
[/Python]

The parse_dates['date'] parameter informs the parser that there is a ‘date’ field in the CSV data that is the date-time index for the time-series. Thus the ‘date’ field is transformed to an index and future references should address to it as data.index instead of data.date. This conversion informs Python that we are dealing with time-series and enables better plot visualisations.

Next we define the array of countries of interest (it can change to any number of countries as long as they are correctly spelled — as defined in the loaded CSV data)

[Python]
locations = [‘China’,’Italy’,’Spain’,’United States’,’Turkey’,’Greece’]
[/Python]

In order to create multiple plots on the same figure, a strategy would require a loop for each of the countries, as follows

[Python]
# define the figure and initialise it to a size
fig = plt.figure(figsize=(25,6))
# supposing we need to do two subplots one next to the other,
# define the subplot to the left; also define some basic parameters of the subplot
ax1 = fig.add_subplot( 1, 2, 1)
ax1.set_title( ‘Confirmed cases’)
ax1.set_xlabel( ‘Time’)
ax1.set_ylabel( ‘Confirmed cases’)
# define the subplot to the right; also define some basic parameters of the subplot
ax2 = fig.add_subplot( 1, 2, 2)
ax2.set_title( ‘Confirmed deaths’)
ax2.set_xlabel( ‘Time’)
ax2.set_ylabel( ‘Confirmed deaths’)

# run the loop for all countries
for loc in locations:
# filter the data for a particular country
df = data.query( “location=='”+loc+”‘”)
# get the date field (now the ‘index’)
x = df.index
# create scatter plots for the total cases (on the left) and the total deaths (on the right)
ax1.scatter( x, df.total_cases, label=loc)
ax2.scatter( x, df.total_deaths, label=loc)

# display a legend for both subplots
ax1.legend(loc=’best’)
ax2.legend(loc=’best’)
# maximise the space the graphs will take inside the browser window
plt.tight_layout()
# activate the minor X axis ticks to represent days
days = dts.DayLocator()
ax1.xaxis.set_minor_locator(days)
ax2.xaxis.set_minor_locator(days)
# and finally display the figure
plt.show()
[/Python]

This will create the following figure

Confirmed COVID-19 data 2020_03_29_12_16_37
Fig. 1. Confirmed COVID-19 data 2020_03_29_12_16_37

which may be saved and downloaded with

[Python]
# create a timestamp for the file
datetimeSignature = datetime.now()
# create the figure filename
graphFileName = ‘confirmed_data_’ + datetimeSignature.strftime(‘%Y_%m_%d_%H_%M_%S’) + ‘.png’
# save the figure
fig.savefig(graphFileName,dpi=300)
# download the figure
files.download(graphFileName)
[/Python]

Next step is to demonstrate how to create graphs of normalised data using each country’s population as the normalisation factor. Apparently, this involves the loading of population data from reliable resources. In this demonstration, the datahub.io (world bank dataset) was selected as the resource for the data. Some modifications had to be done to be able to load the CSV data because pd.read_csv could not work straight away, like in the case of the COVID-19 data above. Let’s take a look

[Python]
# Get population data from the datahub.io (pd.read_csv does not work in this case without the “requests” and “io” operations!)
url = ‘https://datahub.io/JohnSnowLabs/population-figures-by-country/r/population-figures-by-country-csv.csv’
# use the requests and io libraries to download the data
response = requests.get(url)
fileObject = io.StringIO(response.content.decode(‘utf-8’))
# now read and parse the CSV data as usual
worldBankData = pd.read_csv(fileObject)
[/Python]

In those data the latest population data are in the last column of the worldBankData dataframe and can be accessed for the selected countries (stored in the locations array as follows

[Python]
# initialise the populations array
populations = [0]*len(locations)
# run a loop for all countries
for idx,loc in enumerate(locations):
# read the country population from the last column
dp = worldBankData.query(“Country=='”+loc+”‘”)
populations[idx] = int( dp.iloc[:,-1] )
#print(“Country=='”+loc+”‘”, idx, populations[idx])
[/Python]

Plotting the normalised data follows the same iterative approach shown above, with one thing that changes being the need for the loop index as follows

[Python]
for idx,loc in enumerate(locations):
df = data.query( “location=='”+loc+”‘”)
x = df.index
ax1.scatter( x, 100*df[“total_cases”]/populations[idx], label=loc)
ax2.scatter( x, 100*df.total_deaths/populations[idx], label=loc)
[/Python]

Of course one should change the title and X, Y axes labels accordingly. The result of this graph is shown in the following figure:

Normalised confirmed COVID-19 data_2020_03_29_12_16_44
Fig. 2.Normalised confirmed COVID-19 data_2020_03_29_12_16_44

Saving of this figure uses the exact same strategy as shown above for the confirmed cases.

 
 

ENDNOTES

*** The whole process is available as a Colab notebook for better inspection and experimentation. Please notice that the access to the live data may need a free account.

*** The featured image in this post is an illustration of COVID-19 by the Public Health Image Library

Discussion

comments