data science

Plotting Colourful Graphs with R, RStudio and Ggplot2

The Aesthetics of Data Science

Data visualization is a powerful tool for communicating results and recently receives more and more attention due to the hype of data science. Integrating a meaningful graph into a paper or your thesis could improve readability and understandability more than any formulas or extended textual descriptions can. There exists a variety of different approaches for visualising data. Recently a lot of new Javascript based frameworks have gained quite some momentum, which can be used in Web applications and apps. A more classical work horse for data science is the R project and its plotting engine ggplot2. The reason why I decide to stick with R is its popularity and flexibility, which is still  impressive. Also with RStudio, there exists a convenient IDE which provides useful features for data scientists.

Plotting Graphs

In this blog post, I demonstrate how to plot time series data and use colours to highlight a specific aspect of data. As almost all techniques, R and ggplot2 require practise and training, which I realised again today when I spent quite a bit of time struggling with getting a simple plot right.

Currently I am evaluating two systems I developed and I needed to visualize their storage and execution time demands in comparison. My goal was to create a plot for each non-functional property, the execution time and the storage demand, while each plot should depict both systems’ performance. Each system runs a set of operations, think of create, read, update and delete operations (CRUD). Now for visualizing which of these operations has the most effects on the system, I needed to colourise each operation within one graph. This is the easy part. What was more tricky is to provide for each graph a defined set of colours, which can be mapped to each instance of the variable. Things which have the same meaning in both graphs should visualized in the same way, which requires a little hack.

Prerequisits

Install the following packages via apt

sudo apt-get install r-base r-recommended r-cran-ggplot2

and RStudio by downloading the deb – File from the project homepage.

Evaluation Data

As an example,we plan to evaluate the storage demand of two different systems and compare the results. Consider the following sample data.

# Set seed to get the same random numbers for this example
set.seed(42);
# Generate 200 random data records
N <- 200
# Generate a random, increasing sequence of integers that we assume is the storage demand in some unit
storage1 =sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
storage2 = sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
# Define the operations availabel and draw a random sample
operationTypes = c('CREATE','READ','UPDATE','DELETE')
operations = sample(operationTypes,N,replace=TRUE)
# Create the dataframe
df  df
     id storage1 storage2 operations
1     1       24      238     CREATE
2     2      139     1755     UPDATE
3     3      158     1869     UPDATE
4     4      228     2146       READ
5     5      395     2967     DELETE
6     6      734     3252     CREATE
7     7      789     4049     DELETE
8     8     2909     4109       READ
9     9     3744     4835     CREATE
10   10     3894     4990       READ

....

We created a random data set simulating the characteristics of system measurement data. As you can see, we have a list of operations of the four types CREATE, READ, UPDATE and DELETE and a measurement value for the storage demand in both systems.

The Simple Plot

Plotting two graphs of thecolumns storage1 and storage2 is straight forward.

# Simple plot
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color="Storage 1")) +
  geom_point(aes(x=id,y=storage2,color="Storage 2")) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  scale_color_manual(values=c("Storage 1"="forestgreen", "Storage 2"="aquamarine"), 
                     name="Measurements", labels=c("System 1", "System 2"))

print(p1)

We assign for each point plot a color. Note that the color nme “Storage 1” for instance of course does not denote a color, but it assignes a level for all points of the graph. This level can be thought of as a category, which ensures that all the points which belong to the same category have the same color. As you can see at the definition of the color scale, we assign the actual color to this level there.  This is the result:

Plotting Levels

A common task is to visualise categories or levels of measurement data. In this example, there are four different levels we could observe: CREATE, READ, UPDATE and DELETE.

# Plot with levels
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operations)) +
  geom_point(aes(x=id,y=storage2,color=operations)) +
  ggtitle("Overview of Measurements") +
  labs(color="Measurements") +
  scale_color_manual(values=c("CREATE"="darkgreen", 
                              "READ"="darkolivegreen", 
                              "UPDATE"="forestgreen", 
                              "DELETE"="yellowgreen"))
print(p1)

Instead of assigning two colours, one for each graph, we can also assign colours to the operations. As you can see in the definition of the graphs and the colour scale, we map the colours to the variable operations instead. As a result we get differently coloured points per operation, but we get these of course for both graphs in an identical fashion as the categories are the same for both measurements. The result looks like this:

Now this is obviously not what we want to achieve as we cannot differentiate between the two graphs any more.

Plotting the same Levels for both Graphs in Different Colours

This last part is a bit tricky, as ggplot2 does not allow assigning different colour schemes within one plot. There do exist some hacks for this, but the solution does not improve the readability of the code in my opinion. In order to apply different colour schemes for the two graphs while still using the categories, I appended two extra columns to the data set. If we append some differentiation between the two graphs and basically double the categories from four to eight, where each graph now uses its own four categories, we can also assign distinct colours to them.

df$operationsStorage1 <- paste(df$operations,"-Storage1", sep = '')
df$operationsStorage2 <- paste(df$operations,"-Storage2", sep = '')

p3 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operationsStorage1)) +
  geom_point(aes(x=id,y=storage2,color=operationsStorage2)) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  labs(color="Operations") +
  scale_color_manual(values=c("CREATE-Storage1"="darkgreen", 
                              "READ-Storage1"="darkolivegreen", 
                              "UPDATE-Storage1"="forestgreen", 
                              "DELETE-Storage1"="yellowgreen",
                              "CREATE-Storage2"="aquamarine", 
                              "READ-Storage2"="dodgerblue",
                              "UPDATE-Storage2"="royalblue",
                              "DELETE-Storage2"="turquoise"))
print(p3)

We then assign the new column for each system individually as colour value. This ensures that each graph only considers the categories that we assigned in this step. Thus we can assign a different color scheme for wach graph and print the corresponding colours in the label (legend) next to the chart. This is the result:

Now we can see which operation was used at every measurement and still be able to distinguish between the two systems.

Data Wrangling with csvkit and SQLite

As mentioned earlier, csvkit is a very convenient tool for handling coma separated text files, especially when they are too large to be processed with conventional spread sheet applications like Excel or Libre Office Calc. The limits of Office programs can rather easy be reached, especially when dealing with scientific data. Open Office Calc supports the following limits.

  • maximum number of rows: 1,048,576
  • maximum number of columns: 1,024
  • maximum number of sheets: 256

Excel offers  also 1 048 576 rows but provides 16,384 columns. SQLite in contrast allows by default 2000 columns and provides if really needed up to 32767 columns if complied with a specific setting. In terms of row storage, SQLite provides a theoretical maximum number of 264 (18446744073709551616) rows. This limit is unreachable since the maximum database size of 140 terabytes will be reached first.

The limits we discussed will not be hit the our example of air traffic data, which we obtain from ourairports.com. You can download the sample file with currently 47409 airports described in the CSV format from the linked web page.

$: csvstat airports.csv 
  1. id
	<type 'int'>
	Nulls: False
	Min: 2
	Max: 316827
	Sum: 2112054844
	Mean: 44549.6602755
	Median: 23847
	Standard Deviation: 77259.1794792
	Unique values: 47409
  2. ident
	<type 'unicode'>
	Nulls: False
	Unique values: 47409
	Max length: 7
  3. type
	<type 'unicode'>
	Nulls: False
	Unique values: 7
	5 most frequent values:
		small_airport:	30635
		heliport:	9098
		medium_airport:	4536
		closed:	1623
		seaplane_base:	927
	Max length: 14
......

This little command provides us with the statistics of the columns in the file. We see that the file we provided offers 18 columns and we also can immediately see the column types, if there are null values and what the 5 most frequent values are. If we are interested in a list of columns only, we can print them with the following command.

$: csvcut -n  airports.csv 
  1: id
  2: ident
  3: type
  4: name
  5: latitude_deg
  6: longitude_deg
  7: elevation_ft
  8: continent
  9: iso_country
 10: iso_region
 11: municipality
 12: scheduled_service
 13: gps_code
 14: iata_code
 15: local_code
 16: home_link
 17: wikipedia_link
 18: keywords

We can also use the csvcut command for – you expect it already – cutting specific columns from the CSV file, in order to reduce the size of the file and only retrieve the columns that we are interested in. Image you would like to create a list of all airports per region. Simply cut the columns you need and redirect the output into a new file. The tool csvlook provides us with a MySQL-style preview of the data.

$: csvcut --columns=name,iso_country,iso_region  airports.csv > airports_country_region.csv
$: csvlook airports_country_region.csv | head -n 15
|--------------------------------------------------------------------------------+-------------+-------------|
|  name                                                                          | iso_country | iso_region  |
|--------------------------------------------------------------------------------+-------------+-------------|
|  Total Rf Heliport                                                             | US          | US-PA       |
|  Lowell Field                                                                  | US          | US-AK       |
|  Epps Airpark                                                                  | US          | US-AL       |
|  Newport Hospital & Clinic Heliport                                            | US          | US-AR       |
|  Cordes Airport                                                                | US          | US-AZ       |
|  Goldstone /Gts/ Airport                                                       | US          | US-CA       |
|  Cass Field                                                                    | US          | US-CO       |
|  Grass Patch Airport                                                           | US          | US-FL       |
|  Ringhaver Heliport                                                            | US          | US-FL       |
|  River Oak Airport                                                             | US          | US-FL       |
|  Lt World Airport                                                              | US          | US-GA       |
|  Caffrey Heliport                                                              | US          | US-GA       |

We could then sort the list of airports alphabetically in reverse and write the new list into a new file. We specify the name of the column we want to sort the file and We measure the execution time needed by prepending the command time.

$: time csvsort -c "name" --delimiter="," --reverse airports_country_region.csv > airports_country_region_sorted.csv

real    0m1.177s
user    0m1.135s
sys    0m0.042s

A nice feature of csvkit is its option to query CSV files with SQL. You can formulate SELECT queries and it even supports joins and other tricks. Thus you can achieve the same result with just one SQL query.

$: time csvsql -d ',' --query="SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC" airports.csv > sql_airports_country_region.csv

real	0m11.626s
user	0m11.532s
sys	0m0.090s

Obviously, this is not the fastest possibility and may not be suitable for larger data sets. But csvkit offers more: You can create SQL tables automatically by letting csvkit browse through your CSV files. It will try to guess the column type, the appropriate field length and even constraints.

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports airports.csv 
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports -u 0 airports.csv > airport_schema.sql

The second command in the listing above simply stores the table in a separate file.  We can import this CREATE TABLE statement by reading the file in SQLite. Change to the folder where you downloaded SQLite3 and create a new database called AirportDB. The following listing contains SQL style comments (starting with –) in order to improve readability.

./sqlite3 AirportDB.sqlite


SQLite version 3.9.2 2015-11-02 18:31:45
Enter ".help" for usage hints.
-- change the command separator from its default | to ;
sqlite> .separator ;
-- read the SQL file we created before
sqlite> .read /home/stefan/datawrangling/airport_schema.sql
-- list all tables
sqlite> .tables
Airports
-- print table schema

sqlite> .schema Airports
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

You can also achieve the same results directly from Bash, simply by piping the SQL file to the database.

$: cat ~/datawrangling/airport_schema.sql | ./sqlite3 AirportDB.sqlite
$: ./sqlite3 AirportDB.sqlite ".tables"
Airports
$: ./sqlite3 AirportDB.sqlite ".schema Airports"
CREATE TABLE "Airports" (
	id INTEGER NOT NULL, 
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))
);

We created a complex SQL table by automatically parsing CSV files. This gives a lot of opportunities, also for Excel spreadsheets and other data available in CSV. The great thing about csvkit is that it supports a large variety of database dialects. You can use the same command by adapting the -i parameter for the following database systems:

  • access
  • sybase
  • sqlite
  • informix
  • firebird
  • mysql
  • oracle
  • maxdb
  • postgresql
  • mssql

All major systems are supported, which is a great benefit. Now that we have the schema ready, we need to import the data into the SQLite database.  We can use the SQLite client to import the CSV file into the database, but suddenly we run into a problem! The 12th column contains boolean values, as correctly identified by the csvkit tool. When we inspect the file again with csvlook, we can see that the column contains ‘yes’ and ‘no’ values. Unfortunately SQLite does not understand this particular notion of boolean values, but rather expects 0 for false and 1 for true, as described in the data types documentation.We have two options: We could replace the values of yes and no by their corresponding integer, for instance with awk:

$: awk -F, 'NR>1 { $12 = ($12 == "\"no\"" ? 0 : $12) } 1' OFS=,  airports.csv &gt; airports_no.csv
$: awk -F, 'NR>1 { $12 = ($12 == "\"yes\"" ? 1 : $12) } 1' OFS=,  airports_no.csv &gt; airports_yes.csv

Or, much more comfortably, we could again use csvkit, which can help us out and replaces the values automatically. The following command imports the data into our database. As we already created the table in advance, we can skip the process with the appropriate flag.

$: time csvsql --db "sqlite:///home/stefan/datawrangling/AirportDB.sqlite" --table "Airports" --insert airports.csv --no-create

real    0m11.161s
user    0m10.743s
sys    0m0.169s

This takes a little while, but after a few seconds, we have the data ready. We can then open the database and query our Airport data set.

sqlite>  SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC
Run Time: real 3.746 user 0.087215 sys 0.167298

You can now use the data in an advanced way and also may utilise advanced database features such as indices in order to speed up the data processing. If we compare again the execution of the same query on the CSV file and within SQLite, the advantage becomes much more obvious if we omit command line output, for instance by querying the COUNT of the airport names.

:$ time csvsql -d ',' --query="SELECT COUNT(name) FROM airports" airports.csv 
COUNT(name)
47409

real	0m11.068s
user	0m10.977s
sys	0m0.086s

-- SQLite
sqlite> .timer on
sqlite> SELECT COUNT(name) FROM Airports;
47409
Run Time: real 0.014 user 0.012176 sys 0.001273