December 16, 2009

Book Sample: Matplotlib for Python Developers - Plotting Data


Today's post is a sample from the new book: "Matplotlib for Python Developers" by Sandro Tosi. I will review the book in an upcoming post.

-Corey



Disclosure: Packt Publishing sent me a free copy of this book to review.




Matplotlib for Python Developers


http://www.packtpub.com/matplotlib-python-development/book





Matplotlib for Python Developers


In this two-part article, by Sandro Tosi you will see several examples of Matplotlib usage in real-world situations, showing some of the common cases where we can use Matplotlib to draw a plot of some values.

There is a common workflow for this kind of job:

  1. Identify the best data source for the information we want to plot
  2. Extract the data of interest and elaborate it for plotting
  3. Plot the data

Usually, the hardest part is to extract and prepare the data for plotting. Due to this, we are going to show several examples of the first two steps.


The examples are:

  • Plotting data from a database
  • Plotting data from a web page
  • Plotting the data extracted by parsing an Apache log file
  • Plotting the data read from a comma-separated values (CSV) file
  • Plotting extrapolated data using curve fitting
  • Third-party tools using Matplotlib (NetworkX and mpmath)

Let's begin

Plotting data from a database

Databases often tend to collect much more information than we can simply extract and watch in a tabular format (let's call it the "Excel sheet" report style).

Databases not only use efficient techniques to store and retrieve data, but they are also very good at aggregating it.

One suggestion we can give is to let the database do the work. For example, if we need to sum up a column, let's make the database sum the data, and not sum it up in the code. In this way, the whole process is much more efficient because:

  • There is a smaller memory footprint for the Python code, since only the aggregate value is returned, not the whole result set to generate it
  • The database has to read all the rows in any case. However, if it's smart enough, then it can sum values up as they are read
  • The database can efficiently perform such an operation on more than one column at a time

The data source we're going to query is from an open source project: the Debian distribution. Debian has an interesting project called UDD , Ultimate Debian Database, which is a relational database where a lot of information (either historical or actual) about the distribution is collected and can be analyzed.

On the project website http://udd.debian.org/, we can fi nd a full dump of the database (quite big, honestly) that can be downloaded and imported into a local PostgreSQL instance (refer to http://wiki.debian.org/UltimateDebianDatabase/CreateLocalReplica for import instructions

Now that we have a local replica of UDD, we can start querying it:

# module to access PostgreSQL databases
import psycopg2
# matplotlib pyplot module
import matplotlib.pyplot as plt

Since UDD is stored in a PostgreSQL database, we need psycopg2 to access it. psycopg2 is a third-party module available at http://initd.org/projects/psycopg

# connect to UDD database
conn = psycopg2.connect(database="udd")
# prepare a cursor
cur = conn.cursor()

We will now connect to the database server to access the udd database instance, and then open a cursor on the connection just created.

# this is the query we'll be making
query = """
select to_char(date AT TIME ZONE 'UTC', 'HH24'), count(*)
from upload_history
where to_char(date, 'YYYY') = '2008'
group by 1
order by 1"""


We have prepared the select statement to be executed on UDD. What we wish to do here is extract the number of packages uploaded to the Debian archive (per hour) in the whole year of 2008.

  • date AT TIME ZONE 'UTC': As date field is of the type timestamp with time zone, it also contains time zone information, while we want something independent from the local time. This is the way to get a date in UTC time zone.
  • group by 1: This is what we have encouraged earlier, that is, let the database do the work. We let the query return the already aggregated data, instead of coding it into the program.
# execute the query
cur.execute(query)
# retrieve the whole result set
data = cur.fetchall()

We execute the query and fetch the whole result set from it.

# close cursor and connection
cur.close()
conn.close()

Remember to always close the resources that we've acquired in order to avoid memory or resource leakage and reduce the load on the server (removing connections that aren't needed anymore).

# unpack data in hours (first column) and
# uploads (second column)
hours, uploads = zip(*data)

The query result is a list of tuples, (in this case, hour and number of uploads), but we need two separate lists—one for the hours and another with the corresponding number of uploads. zip() solves this with *data, we unpack the list, returning the sublists as separate arguments to zip(), which in return, aggregates the elements in the same position in the parameters into separated lists. Consider the following example:

In [1]: zip(['a1', 'a2'], ['b1', 'b2'])
Out[1]: [('a1', 'b1'), ('a2', 'b2')]

To complete the code:

# graph code
plt.plot(hours, uploads)
# the the x limits to the 'hours' limit
plt.xlim(0, 23)
# set the X ticks every 2 hours
plt.xticks(range(0, 23, 2))
# draw a grid
plt.grid()
# set title, X/Y labels
plt.title("Debian packages uploads per hour in 2008")
plt.xlabel("Hour (in UTC)")
plt.ylabel("No. of uploads")

The previous code snippet is the standard plotting code, which results in the following screenshot:

From this graph we can see that in 2008, the main part of Debian packages uploads came from European contributors. In fact, uploads were made mainly in the evening hours (European time), after the working days are over (as we can expect from a voluntary project).


Plotting data from the Web

Often, the information we need is not distributed in an easy-to-use format such as XML or a database export but for example only on web sites.

More and more often we find interesting data on a web page, and in that case we have to parse it to extract that information: this is called web scraping .

In this example, we will parse a Wikipedia article to extracts some data to plot. The article is at http://it.wikipedia.org/wiki/Demografia_d'Italia and contains lots of information about Italian demography (it's in Italian because the English version lacks a lot of data); in particular, we are interested in the population evolution over the years.

Probably the best known Python module for web scraping is BeautifulSoup ( http://www.crummy.com/software/BeautifulSoup/). It's a really nice library that gets the job done quickly, but there are situations (in particular with JavaScript embedded in the web page, such as for Wikipedia) that prevent it from working.

As an alternative, we find lxml quite productive (http://codespeak.net/lxml/). It's a library mainly used to work with XML (as the name suggests), but it can also be used with HTML (given their quite similar structures), and it is powerful and easy–to-use.

Let's dig into the code now:

# to get the web pages
import urllib2
# lxml submodule for html parsing
from lxml.html import parse
# regular expression module
import re
# Matplotlib module
import matplotlib.pyplot as plt

Along with the Matplotlib module, we need the following modules:

  • urllib2: This is the module (from the standard library) that is used to access resources through URL (we will download the webpage with this).
  • lxml: This is the parsing library.
  • re: Regular expressions are needed to parse the returned data to extract the information we need. re is a module from the standard library, so we don't need to install a third-party module to use it.
# general urllib2 config
user_agent = 'Mozilla/5.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }
url = "http://it.wikipedia.org/wiki/Demografia_d'Italia"

Here, we prepare some configuration for urllib2, in particular, the user_agent header is used to access Wikipedia and the URL of the page.

# prepare the request and open the url
req = urllib2.Request(url, headers=headers)
response = urllib2.urlopen(req)

Then we make a request for the URL and get the HTML back.

# we parse the webpage, getroot() return the document root
doc = parse(response).getroot()

We parse the HTML using the parse() function of lxml.html and then we get the root element. XML can be seen as a tree, with a root element (the node at the top of the tree from where every other node descends), and a hierarchical structure of elements.

# find the data table, using css elements
table = doc.cssselect('table.wikitable')[0]

We leverage the structure of HTML accessing the first element of type table of class wikitable because that's the table we're interested in.

# prepare data structures, will contain actual data
years = []
people = []

Preparing the lists that will contain the parsed data.

# iterate over the rows of the table, except first and last ones
for row in table.cssselect('tr')[1:-1]:

We can start parsing the table. Since there is a header and a footer in the table, we skip the first and the last line from the lines (selected by the tr tag) to loop over.

# get the row cell (we will use only the first two)
data = row.cssselect('td')

We get the element with the td tag that stands for table data: those are the cells in an HTML table.

# the first cell is the year
tmp_years = data[0].text_content()
# cleanup for cases like 'YYYY[N]' (date + footnote link)
tmp_years = re.sub('[.]', '', tmp_years)

We take the first cell that contains the year, but we need to remove the additional characters (used by Wikipedia to link to footnotes).

# the second cell is the population count
tmp_people = data[1].text_content()
# cleanup from '.', used as separator
tmp_people = tmp_people.replace('.', '')

We also take the second cell that contains the population for a given year. It's quite common in Italy to separate thousands in number with a '.' character: we have to remove them to have an appropriate value.

# append current data to data lists, converting to integers
years.append(int(tmp_years))
people.append(int(tmp_people))

We append the parsed values to the data lists, explicitly converting them to integer values.

# plot data
plt.plot(years,people)
# ticks every 10 years
plt.xticks(range(min(years), max(years), 10))
plt.grid()
# add a note for 2001 Census
plt.annotate("2001 Census", xy=(2001, people[years.index(2001)]),
xytext=(1986, 54.5*10**6),
arrowprops=dict(arrowstyle='fancy'))

Running the example results in the following screenshot that clearly shows why the annotation is needed:

In 2001, we had a national census in Italy, and that's the reason for the drop in that year: the values released from the National Institute for Statistics (and reported in the Wikipedia article) are just an estimation of the population. However, with a census, we have a precise count of the people living in Italy.




Plotting data by parsing an Apache log file

Plotting data from a log file can be seen as the art of extracting information from it.

Every service has a log format different from the others. There are some exceptions of similar or same format (for example, for services that come from the same development teams) but then they may be customized and we're back at the beginning.

The main differences in log files are:

  • Fields orders: Some have time information at the beginning, others in the middle of the line, and so on
  • Fields types: We can find several different data types such as integers, strings, and so on
  • Fields meanings: For example, log levels can have very different meanings

From all the data contained in the log file, we need to extract the information we are interested in from the surrounding data that we don't need (and hence we skip).

In our example, we're going to analyze the log file of one of the most common services: Apache. In particular, we will parse the access.log file to extract the total number of hits and amount of data transferred per day.

Apache is highly configurable, and so is the log format. Our Apache configuration, contained in the httpd.conf file, has this log format:

"%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i""

In this way, we extract the day of the request along with the response size from every line.

# prepare dictionaries to contain the data
day_hits = {}
day_txn = {}

These dictionaries will store the parsed data.

# we open the file
with open('<location of the Apache>/access.log') as f:
# and for every line in it
for line in f:

we open the file (we have to select the proper location of access.log as it differs between operating systems and/or installation), and for every line in it

# we pass the line to regular expression
m = apa_line.match(line)
# and we get the 2 values matched back
day, call_size = m.groups()

We parse the line and take the resulting values.

# if the current day is already present
if day in day_hits:
# we add the call and the size of the request
day_hits[day] += 1
day_txn[day] += int(call_size)

else if the current day is already present in the dictionaries, then add the hit and the size to the respective dictionaries.

else:
# else we initialize the dictionaries
day_hits[day] = 1
day_txn[day] = int(call_size)

If the current day is not present, then we need to initialize the dictionaries for a new day.

# prepare a list of the keys (days)
keys = sorted(day_hits.keys())

We prepare a sorted list of dictionary keys, since we need to access it several times.

# prepare a figure and an Axes in it
fig = plt.figure()
ax1 = fig.add_subplot(111)

We prepare a Figure and an Axes in it.

# bar width
width = .4

We define the bars width.

# for each key (day) and it's position
for i, k in enumerate(keys):

Then we enumerate the item in keys so that we have a progressive number associated with each key.

# we plot a bar
ax1.bar(i - width/2, day_hits[k], width=width, color='y')

Now we can plot a bar at position i (but shifted by width/2 to center the tick) with its height proportional to the number of hits, and width set to width. We set the bar color to yellow.

# for each label for the X ticks
for label in ax1.get_xticklabels():
# we hide it
label.set_visible(False)

We hide the labels on the X-axis to avoid its superimposition with the other Axes labels (for transfer size).

# add a label to the Y axis (for the first plot)
ax1.set_ylabel('Total hits')

We set a label for the Y-axis.

# create another Axes instance, twin of the previous one
ax2 = ax1.twinx()

We now create a second Axes, sharing the X-axis with the previous one.

# plot the total requests size
ax2.plot([day_txn[k] for k in keys], 'k', linewidth=2)

We plot a line for the transferred size, using the black color and with a bigger width.

# set the Y axis to start from 0
ax2.set_ylim(ymin=0)

We let the Y-axis start from 0 so that it can be congruent with the other Y-axis.

# set the X ticks for each element of keys (days)
ax2.set_xticks(range(len(keys)))
# set the label for them to keys, rotating and align to the right
ax2.set_xticklabels(keys, rotation=25, ha='right')

We set the ticks for the X-axis (that will be shared between this and the bar plot) and the labels, rotating them by 25 degrees and aligning them to the right to better fit the plot.

# set the formatter for Y ticks labels
ax2.yaxis.set_major_formatter(FuncFormatter(megabytes))
# add a label to Y axis (for the second plot)
ax2.set_ylabel('Total transferred data (in Mb)')

Then we set the formatter for the Y-axis so that the labels are shown in megabytes (instead of bytes).

# add a title to the whole plot
plt.title('Apache hits and transferred data by day

Finally, we set the plot title.

On executing the preceding code snippet, the following screenshot is displayed:

The preceding screenshot tells us that it is not a very busy server, but it still shows what's going on.


Continue Reading Plotting data using Matplotlib: Part 2

No comments: