Data Science

Facebook open sourced its forecasting tool [Prohpet][1] for time series data. Although forecasting is not a trivial task, the libraries are very easy to use and produce nice results quickly. In this basic blog post, I am going to forecast the visitor statistics based on the historical data I collected with Piwik.

Python Prerequisites

Install and initialize a new virtual Python environment

# Install virtual environments package
sudo pip3 install virtualenv
# Create a new folder for the project 
mkdir python-projects
cd python-projects/
# Create a new virtual environment
virtualenv -p python3 py

Install Prophet and its Dependencies

Within your new Python virtual environment, install the required dependencies first and then Prophet

# Linux Dependencies
sudo apt-get install python3-tk
# Python Dependencies
./py/bin/pip3 install cython numpy
# Prohpet
./py/bin/pip3 install fbprophet```


## Get the Data from your Piwik Database

We aggregate the data from the visitors table per day and store the result in a CSV file. In the case of this blog, I started collecting visitor traffic data from early 2013. Prophet allows displaying not only trends and seasonality, but also to forecast into the future.

SELECT DATE_FORMAT(visit_first_action_time,'%Y-%m-%d’), SUM(visitor_count_visits) FROM db_piwik.piwik_log_visit GROUP BY 1 INTO OUTFILE ‘/tmp/visits.csv’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;“```

Usually MySQL runs with a security setting that prevents writing files to the server’s disk (for a good reason). Check the variable secure-file-priv to find the path you can use for exporting.

The data now looks similar like this:

~/python-projects $ head visits.csv 
2013-11-05,3
2014-01-11,4
2014-01-14,2
2014-01-15,10
2014-01-16,8
2014-01-17,6
2014-01-18,1
2014-01-19,1
2014-01-20,1
2014-01-21,6

This is exactly the format which Prophet expects.

Forecasting with Prophet

The short but [nice tutorial][2] basically shows it all. Here is the script, it is basically the very same as from the tutorial:

import pandas as pd
import numpy as np
from fbprophet import Prophet
import matplotlib.pyplot as plt

df = pd.read_csv('visits.csv')
df.columns = ['ds', 'y']
df['y'] = np.log(df['y'])
df.head()

m = Prophet()
m.fit(df);

future = m.make_future_dataframe(periods=365)
future.tail()

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

figure_forecast = m.plot(forecast);
plt.savefig('forcast.png')

m.plot_components(forecast);
plt.savefig('forcast_component.png')```


The results are the forecast graph and the components as nice graphs. Facebook Prophet incorporates seasonal variations, holidays and trends derived from historical data.

[<img class="aligncenter size-full wp-image-2925" src="./media/2017/06/forcast.png" alt="" width="720" height="432" srcset="./media/2017/06/forcast.png 720w, ./media/2017/06/forcast-300x180.png 300w" sizes="(max-width: 720px) 100vw, 720px" />][3]

[<img class="aligncenter size-full wp-image-2926" src="./media/2017/06/forcast_component.png" alt="" width="648" height="648" srcset="./media/2017/06/forcast_component.png 648w, ./media/2017/06/forcast_component-150x150.png 150w, ./media/2017/06/forcast_component-300x300.png 300w, ./media/2017/06/forcast_component-60x60.png 60w" sizes="(max-width: 648px) 100vw, 648px" />][4]As you can see, the weekend is rather low on visitors and that the beginning summer is also rather weak.





<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Predicting+Visitors+with+Facebook+Prophet" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2017/06/10/predicting-visitors-with-facebook-prophet/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>

 [1]: https://facebookincubator.github.io/prophet/
 [2]: https://facebookincubator.github.io/prophet/docs/quick_start.html#python-api
 [3]: ./media/2017/06/forcast.png
 [4]: ./media/2017/06/forcast_component.png

Jupyter Notebooks are a great way for working with Python interactively. The integration of Python code into documents is very useful for reports or for writing executable documentation of algorithms and functions. The text can be structured and exported in various formats. With the ever increasing popularity of Python based on the data science hype, more and more libraries are available. Although Python3 is considered to be the future of Python, consensus on the question Python 2.7 vs Python 3.5 is not yet reached. There are quite a few differences and Python 3 is not backwards compatible and therefore the code cannot be executed with both versions without modification. When you install Jupyter Notebooks via Anaconda, Python3 is recommended but Python 2.7 packages also exist.

As there is a large number of libraries, which have not yet been ported to Python 3, it can be useful to switch between the language version within a Jupyter Notebook. The following example assumes that you have both Python versions already installed.

Installing a new Kernel

In Jupyter Notebooks, the kernel is responsible for executing Python code. When you install the Anaconda System for Python3, this version also becomes the default for the notebooks. In order to enable Python 2.7 in your notebooks, you need to install a new kernel like this:

The Aesthetics of Data Science

Data visualization is a powerful tool for communicating results and recently receives more and more attention due to the hype of data science. Integrating a meaningful graph into a paper or your thesis could improve readability and understandability more than any formulas or extended textual descriptions can. There exists a variety of different approaches for visualising data. Recently a lot of new Javascript based frameworks have gained quite some momentum, which can be used in Web applications and apps. A more classical work horse for data science is the R project and its plotting engine ggplot2. The reason why I decide to stick with R is its popularity and flexibility, which is still impressive. Also with RStudio, there exists a convenient IDE which provides useful features for data scientists.

Plotting Graphs

In this blog post, I demonstrate how to plot time series data and use colours to highlight a specific aspect of data. As almost all techniques, R and ggplot2 require practise and training, which I realised again today when I spent quite a bit of time struggling with getting a simple plot right.

Currently I am evaluating two systems I developed and I needed to visualize their storage and execution time demands in comparison. My goal was to create a plot for each non-functional property, the execution time and the storage demand, while each plot should depict both systems’ performance. Each system runs a set of operations, think of create, read, update and delete operations (CRUD). Now for visualizing which of these operations has the most effects on the system, I needed to colourise each operation within one graph. This is the easy part. What was more tricky is to provide for each graph a defined set of colours, which can be mapped to each instance of the variable. Things which have the same meaning in both graphs should visualized in the same way, which requires a little hack.

Prerequisits

Install the following packages via apt

sudo apt-get install r-base r-recommended r-cran-ggplot2

and RStudio by downloading the deb – File from the project homepage.

Evaluation Data

As an example,we plan to evaluate the storage demand of two different systems and compare the results. Consider the following sample data.

# Set seed to get the same random numbers for this example
set.seed(42);
# Generate 200 random data records
N <- 200
# Generate a random, increasing sequence of integers that we assume is the storage demand in some unit
storage1 =sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
storage2 = sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
# Define the operations availabel and draw a random sample
operationTypes = c('CREATE','READ','UPDATE','DELETE')
operations = sample(operationTypes,N,replace=TRUE)
# Create the dataframe
df  df
     id storage1 storage2 operations
1     1       24      238     CREATE
2     2      139     1755     UPDATE
3     3      158     1869     UPDATE
4     4      228     2146       READ
5     5      395     2967     DELETE
6     6      734     3252     CREATE
7     7      789     4049     DELETE
8     8     2909     4109       READ
9     9     3744     4835     CREATE
10   10     3894     4990       READ

....

We created a random data set simulating the characteristics of system measurement data. As you can see, we have a list of operations of the four types CREATE, READ, UPDATE and DELETE and a measurement value for the storage demand in both systems.

The Simple Plot

Plotting two graphs of thecolumns storage1 and storage2 is straight forward.

# Simple plot
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color="Storage 1")) +
  geom_point(aes(x=id,y=storage2,color="Storage 2")) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  scale_color_manual(values=c("Storage 1"="forestgreen", "Storage 2"="aquamarine"), 
                     name="Measurements", labels=c("System 1", "System 2"))

print(p1)

We assign for each point plot a color. Note that the color nme “Storage 1” for instance of course does not denote a color, but it assignes a level for all points of the graph. This level can be thought of as a category, which ensures that all the points which belong to the same category have the same color. As you can see at the definition of the color scale, we assign the actual color to this level there. This is the result:

Plotting Levels

A common task is to visualise categories or levels of measurement data. In this example, there are four different levels we could observe: CREATE, READ, UPDATE and DELETE.

# Plot with levels
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operations)) +
  geom_point(aes(x=id,y=storage2,color=operations)) +
  ggtitle("Overview of Measurements") +
  labs(color="Measurements") +
  scale_color_manual(values=c("CREATE"="darkgreen", 
                              "READ"="darkolivegreen", 
                              "UPDATE"="forestgreen", 
                              "DELETE"="yellowgreen"))
print(p1)

Instead of assigning two colours, one for each graph, we can also assign colours to the operations. As you can see in the definition of the graphs and the colour scale, we map the colours to the variable operations instead. As a result we get differently coloured points per operation, but we get these of course for both graphs in an identical fashion as the categories are the same for both measurements. The result looks like this:

Now this is obviously not what we want to achieve as we cannot differentiate between the two graphs any more.

Plotting the same Levels for both Graphs in Different Colours

This last part is a bit tricky, as ggplot2 does not allow assigning different colour schemes within one plot. There do exist some hacks for this, but the solution does not improve the readability of the code in my opinion. In order to apply different colour schemes for the two graphs while still using the categories, I appended two extra columns to the data set. If we append some differentiation between the two graphs and basically double the categories from four to eight, where each graph now uses its own four categories, we can also assign distinct colours to them.

df$operationsStorage1 <- paste(df$operations,"-Storage1", sep = '')
df$operationsStorage2 <- paste(df$operations,"-Storage2", sep = '')

p3 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operationsStorage1)) +
  geom_point(aes(x=id,y=storage2,color=operationsStorage2)) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  labs(color="Operations") +
  scale_color_manual(values=c("CREATE-Storage1"="darkgreen", 
                              "READ-Storage1"="darkolivegreen", 
                              "UPDATE-Storage1"="forestgreen", 
                              "DELETE-Storage1"="yellowgreen",
                              "CREATE-Storage2"="aquamarine", 
                              "READ-Storage2"="dodgerblue",
                              "UPDATE-Storage2"="royalblue",
                              "DELETE-Storage2"="turquoise"))
print(p3)

We then assign the new column for each system individually as colour value. This ensures that each graph only considers the categories that we assigned in this step. Thus we can assign a different color scheme for wach graph and print the corresponding colours in the label (legend) next to the chart. This is the result:

Now we can see which operation was used at every measurement and still be able to distinguish between the two systems.