Programming

Lightning fast integration tests with Docker, MySQL and tmpfs

Integration tests that involve database operations requite to tear down and re-initialize a database multiple times. Although most developer machines offer SSD and enough RAM, database initialization can consume a considerable amount of time nevertheless.

Docker allows defining volumes that are mounted directly into the memory by using tmpfs. We can utilize this feature to utilize the speed up disk operations as during database imports, by moving the data from the disk into the memory.

The following example measures the time for writing 1GB worth of data to an SSD:

dd if=/dev/zero of=/tmp/output bs=1024k count=1024; 
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 2,22033 s, 484 MB/s</code>```


For comparison, the following steps create a RAM disk and write the data to memory.

$ sudo mount -o size=1G -t tmpfs none /tmp/tmpfs

$ dd if=/dev/zero of=/tmp/tmpfs/output bs=1024k count=1024; 1024+0 Datensätze ein 1024+0 Datensätze aus 1073741824 bytes (1,1 GB, 1,0 GiB) copied, 0,309017 s, 3,5 GB/s



As you can see writing 1GB to memory is 7x faster. With the following Docker run command, you can spin-up a default MySQL container, where the data directory resides in a tmpfs.

docker run -d
–rm
–name mysql-56-in-memory
-p 3307:3306
–tmpfs /var/lib/mysql:rw,noexec,nosuid,size=1024m
-e MYSQL_ALLOW_EMPTY_PASSWORD=TRUE
-e MYSQL_DATABASE=dev
mysql:5.6



The arguments of Docker run mean the following

  * &#8211;rm: delete the container once it was stopped
  * &#8211;name: a name for the container
  * -p: map the host&#8217;s port 3307 to the port 3306 inside the container. This allows to run multiple MySQL containers in parallel and connect to them from the host via the port specified
  * &#8211;tmpfs: This line mounts the internal directory of the container to a RAM disk. It should be writeable (rm). Noexec prevents the execution of binaries, nosuid prevents changing the permission flags and the size specifies the size occupied by the tmpfs partition in memory. Adapt this to your usecase. The minimum for MySQL is around 200MB. Add the space needed for your data, indices etc.
  * MYSQL\_ALLOW\_EMPTY_PASSWORD does what it implies
  * MYSQL_DATABASE defines the name of a database to be created

If you run this command you can connect to the container like this: _mysql -u root -h 127.0.0.1 -P 3307_

The container behaves like a normal MySQL database, unless the data is not persisted on a hard disk, but only stored in the ephemeral memory. If you stop the container, it will be removed by docker and if you reboot the machine the data will be gone. for obvious reasons this is only a good idea for test data that can be re-created at any time.

You can achieve the same also with Docker Compose if you would like to orchestrate multiple containers.

version: ‘3’ services: mysql-56-integration: container_name: mysql-56-in-memory restart: unless-stopped image: mysql:5.6 environment: - MYSQL_ALLOW_EMPTY_PASSWORD='true’ - MYSQL_HOST=’’ volumes: - data:/var/cache ports: - “3307:3306”

volumes: data: driver_opts: type: tmpfs device: tmpfs```

Jupyter docker stacks with a custom user

Jupyter allows to set a custom user instead of**_jovyan_** which is the default for all containers of the [Jupyter Docker Stack][1]. You need to change this user or its UID and GID in order to get the permissions right when you mount a volume from the host into the Jupyter container. The following steps are required:

  1. Create an unprivileged user and an asociated group on the host. Here we call the user and the group docker_worker
  2. Add your host user to the group. This gives you the permissions to modify and read the files also on the host. This is useful if your working directory on the hist is under source code control (eg. git)
  3. Launch the container with the correct settings that change the user inside the container

It is important to know that during the launch the container needs root privileges in order to change the settings in the mounted host volume and inside the container. After the permissions have been changed, the user is switched back and does not run with root privileges, but your new user. Thus make sure to secure your Docker service, as the permissions inside the container also apply to the host.

Prepare an unprivileged user on the host

1. sudo groupadd -g 1011 docker_worker
2. sudo useradd -s /bin/false -u 1010 -g 1020 docker_worker
3. Add your user to the group: sudo usermod -a -G docker_worker stefan```


# Docker-compose Caveats

It is important to know that docker-compose supports either an array or a dictionary for environment variables ([docs][2]). In the case below we use arrays and we quote all variables. If you accidentally use a dictionary, then the quotes would be passed along to the Jupyter script. You would then see this error message:&nbsp;

/usr/local/bin/start-notebook.sh: ignoring /usr/local/bin/start-notebook.d/* Set username to: docker_worker Changing ownership of /home/docker_worker to 1010:1020 chown: invalid user: ‘'-R’’```

The docker-compose file

version: '2'
services:
    datascience-notebook:
        image: jupyter/base-notebook:latest
        volumes:
            - /tmp/jupyter_test_dir:/home/docker_worker/work            
        ports:
            - 8891:8888
        command: "start-notebook.sh"
        user: root
        environment:
          NB_USER: 'docker_worker'
          NB_UID: 1010
          NB_GID: 1020
          CHOWN_HOME: 'yes'
          CHOWN_HOME_OPTS: -R```


Here you can see that we set the variables that cause the container to ditch jovyan in favor of docker_worker.

> NB\_USER: &#8216;docker\_worker&#8217;  
> NB_UID: 1010  
> NB_GID: 1020  
> CHOWN_HOME: &#8216;yes&#8217;  
> CHOWN\_HOME\_OPTS: -R

This facilitates easy version control of the working directory of Jupyter. I also added the snipped to my [Github Jupyter template][3].

<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Jupyter+docker+stacks+with+a+custom+user" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2018/08/08/jupyter-docker-stacks-with-a-custom-user/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>

 [1]: https://github.com/jupyter/docker-stacks
 [2]: https://docs.docker.com/compose/compose-file/#environment
 [3]: https://github.com/stefanproell/jupyter-notebook-docker-compose/blob/master/README.md

Predicting Visitors with Facebook Prophet

Facebook open sourced its forecasting tool [Prohpet][1] for time series data. Although forecasting is not a trivial task, the libraries are very easy to use and produce nice results quickly. In this basic blog post, I am going to forecast the visitor statistics based on the historical data I collected with Piwik.

Python Prerequisites

Install and initialize a new virtual Python environment

# Install virtual environments package
sudo pip3 install virtualenv
# Create a new folder for the project 
mkdir python-projects
cd python-projects/
# Create a new virtual environment
virtualenv -p python3 py

Install Prophet and its Dependencies

Within your new Python virtual environment, install the required dependencies first and then Prophet

# Linux Dependencies
sudo apt-get install python3-tk
# Python Dependencies
./py/bin/pip3 install cython numpy
# Prohpet
./py/bin/pip3 install fbprophet```


## Get the Data from your Piwik Database

We aggregate the data from the visitors table per day and store the result in a CSV file. In the case of this blog, I started collecting visitor traffic data from early 2013. Prophet allows displaying not only trends and seasonality, but also to forecast into the future.

SELECT DATE_FORMAT(visit_first_action_time,'%Y-%m-%d’), SUM(visitor_count_visits) FROM db_piwik.piwik_log_visit GROUP BY 1 INTO OUTFILE ‘/tmp/visits.csv’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’;“```

Usually MySQL runs with a security setting that prevents writing files to the server’s disk (for a good reason). Check the variable secure-file-priv to find the path you can use for exporting.

The data now looks similar like this:

~/python-projects $ head visits.csv 
2013-11-05,3
2014-01-11,4
2014-01-14,2
2014-01-15,10
2014-01-16,8
2014-01-17,6
2014-01-18,1
2014-01-19,1
2014-01-20,1
2014-01-21,6

This is exactly the format which Prophet expects.

Forecasting with Prophet

The short but [nice tutorial][2] basically shows it all. Here is the script, it is basically the very same as from the tutorial:

import pandas as pd
import numpy as np
from fbprophet import Prophet
import matplotlib.pyplot as plt

df = pd.read_csv('visits.csv')
df.columns = ['ds', 'y']
df['y'] = np.log(df['y'])
df.head()

m = Prophet()
m.fit(df);

future = m.make_future_dataframe(periods=365)
future.tail()

forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()

figure_forecast = m.plot(forecast);
plt.savefig('forcast.png')

m.plot_components(forecast);
plt.savefig('forcast_component.png')```


The results are the forecast graph and the components as nice graphs. Facebook Prophet incorporates seasonal variations, holidays and trends derived from historical data.

[<img class="aligncenter size-full wp-image-2925" src="./media/2017/06/forcast.png" alt="" width="720" height="432" srcset="./media/2017/06/forcast.png 720w, ./media/2017/06/forcast-300x180.png 300w" sizes="(max-width: 720px) 100vw, 720px" />][3]

[<img class="aligncenter size-full wp-image-2926" src="./media/2017/06/forcast_component.png" alt="" width="648" height="648" srcset="./media/2017/06/forcast_component.png 648w, ./media/2017/06/forcast_component-150x150.png 150w, ./media/2017/06/forcast_component-300x300.png 300w, ./media/2017/06/forcast_component-60x60.png 60w" sizes="(max-width: 648px) 100vw, 648px" />][4]As you can see, the weekend is rather low on visitors and that the beginning summer is also rather weak.





<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Predicting+Visitors+with+Facebook+Prophet" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2017/06/10/predicting-visitors-with-facebook-prophet/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>

 [1]: https://facebookincubator.github.io/prophet/
 [2]: https://facebookincubator.github.io/prophet/docs/quick_start.html#python-api
 [3]: ./media/2017/06/forcast.png
 [4]: ./media/2017/06/forcast_component.png

Validate Hibernate Search Input with an Analyzer

Stop Words

Hibernate Search lets you easily assign an @Analyzer on Fields, which are used to process terms before they are written to the index. An anlyzer can be used for instance for stemming and removing of words which are so frequent that they are insignificant for the results. These are examples for stop words:

[“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”,
 “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”,
 “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”]```


It is a common technique, to split input search terms into single keywords and use these keywords for combining a complex queries over several fields. The problem with this approach is that if a user provides such stop words as input and you manually split the input string into a list of keywords, for instance with a split method, it can occur that a stop word becomes a single keyword for Hibernate Search to process.

List keywordsList = Arrays.asList(searchKeywords.split(” “));```

Hibernate does not know what to search for an replies with this error message and also suggests a solution.

The query string 'of' applied on field 'title' has no meaningfull tokens to be matched. Validate the query input against the Analyzer applied on this field.```


In orderb to validate the string with the same analyzer as hibernate does when building the index, we can retrieve the analyzer from the class we want to search on and parse the string by providing it as a stream to the analyzer. First we get the analyser like this:

Validate.notNull(entityManager, “Entity manager can’t be null”); FullTextEntityManager fullTextEntityManager = org.hibernate.search.jpa.Search .getFullTextEntityManager(entityManager); QueryBuilder builder = fullTextEntityManager.getSearchFactory() .buildQueryBuilder().forEntity(MyClass.class).get();```

And then we can write a simple method, which takes the whole string as input and chops the whole string into a list of keywords, which is sent through the analyzer. Thus if your input string contains a stop word and it would not be added to the index, it will also not be included in the list. Thus Hibernate won’t even try to search for it, as it never sees the stemmed word in the first place.

/**
     * Validate input against the tokenizer and return a list of terms.
     * @param analyzer
     * @param string
     * @return
     */
    public static List<String> tokenizeString(Analyzer analyzer, String string)
    {
        List<String> result = new ArrayList&lt;String&gt;();
        try
        {
            TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
            stream.reset();
            while (stream.incrementToken())
            {
                result.add(stream.getAttribute(CharTermAttribute.class).toString());
            }
            stream.close();
        } catch (IOException e)
        {
            // not thrown b/c we're using a string reader...
            throw new RuntimeException(e);
        }
        return result;
    }
}```




<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Validate+Hibernate+Search+Input+with+an+Analyzer" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2017/04/13/validate-hibernate-search-input-with-an-analyzer/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>

Using Hibernate Search with Spring Boot

Spring Boot is a framework, that makes it much easier to develop Spring based applications, by following a convention over configuration principle (while in contrast Spring critics claim that the framework’s principle is rather configuration over everything). In this article, I am going to provide an example how to achieve the following:

  • Create a simple Web application based on Spring Boot
  • Persist and access data with Hibernate
  • Make it searchable with Hibernate Search (Lucine)

I use Eclipse with a Gradle plugin for convenience. MySQL will be our back-end for storing the data. The full example can be obtained from my [Github Repository][1].

Bootstrapping: Create a Simple Spring Boot Webapp

The easiest way to start with Spring Boot is heading over to [start.spring.io][2] and create a new project. In this example, I will use Gradle for building the application and handling the dependencies and I add Web and JPA starters.

[][3]

Download the archive to your local drive and extract it to a folder. I called the project SearchaRoo.

Import the Project with Eclipse

Import it as an existing Gradle Project in Eclipse by using the default settings. You will end up with a nice little project structure as shown below:

[][4]

We have a central application starter class denoted SearchaRooAppication.java, package definitions, application properties and even test classes. The great thing with Spring Boot is that it is very simple to start and that you can debug it as every other local Java application. There is no need for remote debugging or complex application server setups.

Prepare the Database

We need a few permissions on our MySQL instance before we can start.

CREATE DATABASE spring_employees;
CREATE USER 'dev'@'localhost' IDENTIFIED BY 'sEcReT';
GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, FILE, REFERENCES, INDEX,
    ALTER, SHOW DATABASES, SUPER, LOCK TABLES, CREATE VIEW, SHOW VIEW 
    on spring_employees.* TO 'dev'@'localhost';
GRANT RELOAD on *.* TO 'dev'@'localhost';
FLUSH PRIVILEGES;```


We can then add the connection details into the application.properties file. We will edit this file several times when the complexity of this project increases.

===============================

= JPA / HIBERNATE

===============================

Specify the DBMS

spring.jpa.database = MYSQL

Show or not log for each sql query

spring.jpa.show-sql = true spring.datasource.url=jdbc:mysql://127.0.0.1/employees?createDatabaseIfNotExist=true spring.datasource.username=dev spring.datasource.password=sEcReT spring.datasource.driver-class-name=com.mysql.jdbc.Driver```

Now the basic database setup is done. We can then start adding model classes.

Getting some Employees on Board

MySQL offers a rather small but well documented sample database called employees, which is hosted on Github. Obtain and import the data as follows:

git clone https://github.com/datacharmer/test_db.git
cd test_db
mysql -u dev -p sEcReT < employees.sql

The script creates a new schema called employees and you will end up with a schema like this:

[][5]

In the course of this article, we are going to model this schema with Java POJOs by annotating the entities and the a appropriate fields with JPA.

Dependencies

Before we can start modelling the entities in Java, have a look at the Gradle build file. We include additional dependencies for the MySQL connector and Apache commons.

buildscript {
	ext {
		springBootVersion = '1.5.1.RELEASE'
	}
	repositories {
		mavenCentral()
	}
	dependencies {
		classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
	}
}

apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'

jar {
	baseName = 'SearchaRoo'
	version = '0.0.1-SNAPSHOT'
}

sourceCompatibility = 1.8

repositories {
	mavenCentral()
}


dependencies {
	compile('org.springframework.boot:spring-boot-starter-data-jpa')
	compile('org.springframework.boot:spring-boot-starter-web')
	testCompile('org.springframework.boot:spring-boot-starter-test')
	compile("mysql:mysql-connector-java")
	compile('org.apache.commons:commons-lang3:3.5')
}

Modelling Reality

The next step covers modelling the data which we imported with Java POJOs. Obviously this is not the most natural way, because in general you would create the model first and then add data to it, but as we already had the data we decided to go in this direction. In the application.properties file, set the database to the imported employees database and set the Hibernate create property to validate. With this setting, we can confirm that we modelled the Java classed in accordance with the database model defined by the MySQL employees database.

An example of such a class is shown below, the other classes can be found in the Github repository.

package at.stefanproell.model;

import java.util.Date;
import java.util.List;

import javax.persistence.Column;
import javax.persistence.Entity;
import javax.persistence.EnumType;
import javax.persistence.Enumerated;
import javax.persistence.Id;
import javax.persistence.OneToMany;

@Entity(name="employees")
public class Employee {
&nbsp; &nbsp;@Id
&nbsp; &nbsp;@Column(name="emp_no")
&nbsp; &nbsp;private int employee_id;
&nbsp; &nbsp;
&nbsp; &nbsp;@Column(name="birth_date")
&nbsp; &nbsp;private Date birthdate;
&nbsp; &nbsp;
&nbsp; &nbsp;@Column(name="first_name")
&nbsp; &nbsp;private String firstname;
&nbsp; &nbsp;
&nbsp; &nbsp;@Column(name="last_name")
&nbsp; &nbsp;private String lastname;
&nbsp; &nbsp;
&nbsp; &nbsp;@Column(name="gender",columnDefinition = "ENUM('M', 'F', 'UNKNOWN') DEFAULT 'UNKNOWN'")
&nbsp; &nbsp;@Enumerated(EnumType.STRING)
&nbsp; &nbsp;private Gender gender;
&nbsp; &nbsp;
&nbsp; &nbsp;@Column(name="hire_date")
&nbsp; &nbsp;private Date hireDate;

&nbsp; &nbsp;@OneToMany(mappedBy="employee")
&nbsp; &nbsp;List<Title> titles; 
&nbsp; &nbsp;
&nbsp; &nbsp;@OneToMany(mappedBy="employee")
&nbsp; &nbsp;List<Department_Employee> employee_department;

&nbsp; &nbsp;// Setters and getters
&nbsp; &nbsp;
&nbsp; &nbsp;
}```


Now that we have prepared the data model, our schema is now fixed and does not change any more. We can deactivate the Hibernate based dynamic generation of the database tables and use the Spring database initialization instead.To see if we modelled the data correctly, we import MySQL employee data dump we obtained before and import it into our newly created schema, which maps the Java POJOs.

## Importing the Initial Data

In the next step, we import the data from the MySQL employee database into our schema spring_hibernate. This schema contains the tables that Hibernate created for us. The following script copies the data between the two schemata. If you see an error, then there is an issue with your model.

<pre class="theme:github lang:mysql decode:true">-- The original data is stored in the database called employees
-- Spring created the new schema called spring_employees
USE `spring_employees`;

-- Departments

INSERT INTO `spring_employees`.`departments`
(`dept_no`,
`dept_name`)
SELECT `departments`.`dept_no`,
    `departments`.`dept_name`
FROM `employees`.`departments`;

-- Employees

INSERT INTO `spring_employees`.`employees`
(`emp_no`,
`birth_date`,
`first_name`,
`gender`,
`hire_date`,
`last_name`)
SELECT `employees`.`emp_no`,
    `employees`.`birth_date`,
    `employees`.`first_name`,
    `employees`.`gender`,
    `employees`.`hire_date`,
    `employees`.`last_name`
FROM `employees`.`employees`;

-- Join table 
INSERT INTO `spring_employees`.`dept_emp`
(`emp_no`,
`dept_no`,
`from_date`,
`to_date`)
SELECT 
`dept_emp`.`emp_no`,
    `dept_emp`.`dept_no`,
    `dept_emp`.`from_date`,
    `dept_emp`.`to_date`
FROM `employees`.`dept_emp`;

-- Join table 

INSERT INTO `spring_employees`.`dept_manager`
(
`emp_no`,
`dept_no`,
`from_date`,
`to_date`)
SELECT `dept_manager`.`emp_no`,
    `dept_manager`.`dept_no`,
    `dept_manager`.`from_date`,
    `dept_manager`.`to_date`
FROM `employees`.`dept_manager`;

-- Titles

INSERT INTO `spring_employees`.`titles`
(`emp_no`,
`title`,
`from_date`,
`to_date`)
SELECT `titles`.`emp_no`,
    `titles`.`title`,
    `titles`.`from_date`,
    `titles`.`to_date`
FROM `employees`.`titles`;

-- Salaries

INSERT INTO `spring_employees`.`salaries`
(`emp_no`,
`salary`,
`from_date`,
`to_date`)
SELECT `salaries`.`emp_no`,
    `salaries`.`salary`,
    `salaries`.`from_date`,
    `salaries`.`to_date`
FROM `employees`.`salaries`;```


We now imported the data in the database schema that we defined for our project. Spring can load schema and initial data during start-up. So we provide two files, one containing the schema and the other one containing the data. To do that, we create two dumps of the database. One containing the schema only, the other one containing the data only.

mysqldump -u dev -psEcReT –no-data –databases spring_employees > src/main/resources/schema.sql mysqldump -u dev -psEcReT –no-create-info –databases employees > src/main/resources/data.sql```

By deactivating the Hibernate data generation and activating the Spring way, the database gets initialized every time the application starts. Change and edit the following lines in the application.properties

spring.jpa.hibernate.ddl-auto=none
spring.datasource.initialize=true
spring.datasource.schema=classpath:/schema.sql
spring.datasource.data=classpath:/data.sql```


Before we can import the data with the scripts, make sure to drop the schema and disable foreign key checks in the schema file and enable them again at the end. Spring ignores the actionable MySQL comments. So your schema file should contain this

<pre class="theme:github lang:default decode:true ">DROP DATABASE IF EXISTS `spring_employees`;
SET foreign_key_checks = 0;

// rest of the code

SET foreign_key_checks = 1;```


And also insert the two foreign key statements to the data file. Note that the import can take a while. If you are happy with the initialized data, you can deactivate the initialization by setting the variable to false: <span class="lang:default decode:true crayon-inline">spring.datasource.initialize=false</span>

The application.properties file meanwhile looks like this:

<pre class="theme:github lang:default decode:true "># ===============================
# = JPA / HIBERNATE
# ===============================

# Specify the DBMS
spring.jpa.database = MYSQL

# Show or not log for each sql query
spring.jpa.show-sql = true
spring.datasource.url=jdbc:mysql://127.0.0.1/spring_employees?createDatabaseIfNotExist=true
spring.datasource.username=dev
spring.datasource.password=sEcReT
spring.datasource.driver-class-name=com.mysql.jdbc.Driver
spring.datasource.connectionProperties=useUnicode=true;characterEncoding=utf-8;
spring.jpa.properties.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
# Do not initialize anything
spring.jpa.hibernate.ddl-auto=none
spring.datasource.initialize=false
spring.datasource.schema=classpath:/schema.sql
spring.datasource.data=classpath:/data.sql
spring.datasource.platform=mysql```


# Adding Hibernate Search

Hibernate search offers full-text search capabilities by using a dedicated index. We need to add the dependencies to the build file.

<pre class="theme:github lang:default decode:true ">dependencies {
	compile('org.springframework.boot:spring-boot-starter-data-jpa')
	compile('org.springframework.boot:spring-boot-starter-web')
	testCompile('org.springframework.boot:spring-boot-starter-test')
	compile("mysql:mysql-connector-java")
	compile('org.apache.commons:commons-lang3:3.5')
	compile("org.hibernate:hibernate-search-orm:5.5.6.Final")
	compile('org.springframework.boot:spring-boot-starter-test')
	compile('org.springframework.boot:spring-boot-starter-logging')
	compile('org.springframework.boot:spring-boot-starter-freemarker')
}```


Refresh the gradle file after including the search dependencies.

## Adding Hibernate Search Dependencies

In this step, we annotate the model POJO classes and introduce the full-text search index. Hibernate search utilises just a few basic settings to get started. Add the following variables to tne application properties file.

===============================

= HIBERNATE SEARCH

===============================

Spring Data JPA will take any properties under spring.jpa.properties.* and

pass them along (with the prefix stripped) once the EntityManagerFactory is

created.

Specify the DirectoryProvider to use (the Lucene Directory)

spring.jpa.properties.hibernate.search.default.directory_provider = filesystem

Using the filesystem DirectoryProvider you also have to specify the default

base directory for all indexes (make sure that the application have write

permissions on such directory)

spring.jpa.properties.hibernate.search.default.indexBase = /tmp/SearchRroo/```

Please not that storing the Lucene index in the tmp directory is not the best idea, but for testing we can use this rather futile location. We also use the filesystem to store the index, as this is the simplest approach.

Create a Service

In order to facilitate Hibernate Search on our data, we add a service class, which offers methods for searching. The service uses a configuration, which is injected by Spring during run time. The configuration is very simple.

package at.stefanproell.service;

import javax.persistence.EntityManager;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;



@Configuration
public class HibernateSearchConfiguration {
	private final Logger logger = LoggerFactory.getLogger(HibernateSearchConfiguration.class);

	@Autowired
	private EntityManager entityManager;

	@Bean
	HibernateSearchService hibernateSearchService() {
		HibernateSearchService hibernateSearchService = new HibernateSearchService(entityManager);
		hibernateSearchService.initializeHibernateSearch();
		return hibernateSearchService;
	}
}```


The @Configuration is loaded when Spring builds the application context. It provides a bean of our service, which can then be injected into the application. The service itself provides methods for creating and searching the index. In this example, the search method is very simple: it only searches on the first and the last name of an employee and it allows users to make one mistake (distance 1).

<pre class="theme:github lang:default decode:true">package at.stefanproell.service;

import java.util.List;

import javax.persistence.EntityManager;
import javax.persistence.NoResultException;
import javax.persistence.PersistenceContext;

import org.apache.lucene.search.Query;
import org.hibernate.search.jpa.FullTextEntityManager;
import org.hibernate.search.jpa.Search;
import org.hibernate.search.query.dsl.QueryBuilder;
import org.slf4j.LoggerFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;


import at.stefanproell.model.Employee;
import ch.qos.logback.classic.Logger;

@Service
public class HibernateSearchService {

&nbsp; &nbsp;private final Logger logger = (Logger) LoggerFactory.getLogger(HibernateSearchService.class);

&nbsp; &nbsp;private final EntityManager entityManager;

&nbsp; &nbsp;@Autowired
&nbsp; &nbsp;public HibernateSearchService(EntityManager entityManager) {
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;super();
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;this.entityManager = entityManager;
&nbsp; &nbsp;}


&nbsp; &nbsp;public void initializeHibernateSearch() {

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;try {
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;fullTextEntityManager.createIndexer().startAndWait();
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;} catch (InterruptedException e) {
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;// TODO Auto-generated catch block
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;e.printStackTrace();
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}
&nbsp; &nbsp;}
&nbsp; &nbsp;
&nbsp; &nbsp;@Transactional
&nbsp; &nbsp;public List<Employee> fuzzySearch(String searchTerm){

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;FullTextEntityManager fullTextEntityManager = Search.getFullTextEntityManager(entityManager);
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;QueryBuilder qb = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity(Employee.class).get();
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;Query luceneQuery = qb.keyword().fuzzy().withEditDistanceUpTo(1).withPrefixLength(1).onFields("firstname", "lastname")
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;.matching(searchTerm).createQuery();

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;javax.persistence.Query jpaQuery = fullTextEntityManager.createFullTextQuery(luceneQuery, Employee.class);

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;// execute search

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;List<Employee> employeeList = null;
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;try {
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;employeeList&nbsp; = jpaQuery.getResultList();
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;} catch (NoResultException nre) {
&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;logger.warn("No result found");

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;}

&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;return employeeList;

&nbsp; &nbsp;
&nbsp; &nbsp;}
}

The service implementation currently only contains an initialization method, which used for creating the Lucene index on the filesystem. Before we can test the index, we need to have at least one indexed entity. This can be achieved by simply adding the annotation @Indexed to the POJO.

Hibernate Search and Spring Boot: Building Bridges

Hibernate Search is a very convenient way for storing database content in a Lucine index and add fulltext search capabilities to data driven projects simply by annotating classes. It can be easily integrated into Spring Boot applications and as long as only the basic features are used, it works out of the box. The fun starts when the Autoconfiguration cannot find out how to properly configure things automatically, then it gets tricky quite quickly. Of course this is natural behaviour, but one gets spoiled quite quickly. 

Using the latest Features: Hibernate ORM, Hibernate Search and Spring Boot

The current version of Spring Boot is 1.5.2. This version uses Hibernate ORM 5.0. The latest stable Hibernate Search versions are 5.6.1.Final and 5.7.0.Final, which in in contrast require Hibernate ORM 5.1 and 5.2 respectively. Also you need Java 8 now. For this reason if you need the latest Spring Search features in combination with Spring Boot, you need to adapt the dependencies as follows:

buildscript {
	ext {
		springBootVersion = '1.5.1.RELEASE'
	}
	repositories {
		mavenCentral()
	}
	dependencies {
		classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}")
	}
}

apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'

jar {
	baseName = 'SearchaRoo'
	version = '0.0.1-SNAPSHOT'
}

sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {
	mavenCentral()
}


dependencies {
	compile('org.springframework.boot:spring-boot-starter-data-jpa')
	compile('org.springframework.boot:spring-boot-starter-web')
	testCompile('org.springframework.boot:spring-boot-starter-test')
	compile("mysql:mysql-connector-java")
	compile('org.apache.commons:commons-lang3:3.5')
	compile('org.springframework.boot:spring-boot-starter-test')
	compile('org.springframework.boot:spring-boot-starter-logging')
	compile('org.springframework.boot:spring-boot-starter-freemarker')

	// Hibernate Search
    compile("org.hibernate:hibernate-core:5.2.8.Final")
    compile("org.hibernate:hibernate-search-orm:5.7.0.Final")    
            configurations.all 
    {
    	exclude group: "org.hibernate:", module: "hibernate-entitymanager"
	}
}```


Note that the Hibernate Entity Manager needs to be excluded, because it has been integrated into the core in the new Hibernate version. Details are given in the [Spring Boot documentation][1].

## Enforcing the Dependencies to be Loaded in the Correct Sequence

As written earlier, Spring Boot takes care of a lot of configurations for us. Most of the time, this works perfectly and reduces the pain for configuring a new application manually. In some particular cases, Spring cannot figure out that there exists a dependency between different services, which needs to be resolved in a specified order. A typical use case is the implementation of FieldBridges for Hibernate Search. FieldBrides translate between the actual Object from the Java World and the representation of such an object in the Lucene index. Typically an [EnumBridge][2]is used for indexing Enums, which are often used for realizing internationalization (I18n).

When the Lucene Index is created, Hibernate checks if Enum fields need to be indexed and if there exist Bridge that converts between the object and the actual record in the Index. The problem here is that Hibernate JPA is loaded at a very early stage in the Spring Boot startup proces. The problem only arises if the BridgeClass utilises @Autowired&nbsp;fields which get injected. Typically, these fields would get injected when the&nbsp;AnnotationBeanConfigurerAspect bean is loaded.&nbsp;Hibernate creates the session with the session factory auto configuration before the&nbsp;spring configurer aspect bean was loaded. So the FieldBridge used by Hibernate during the initialization of the index does not have the service injected yet, causing a nasty Null Pointer Exception.&nbsp;

### Example EnumBridge

The following EnumBridge example utilises an injected Service, which needs to be available before Hibernate starts. If not taken care of, this causes a Null Pointer Exception.

@Configurable public class HibernateSearchEnumBridgeExample extends EnumBridge { private I18nMessageService i18nMessageService;

@Autowired
@Required
public void setI18nMessageService(I18nMessageService service) {
this.i18nMessageService = service;
}

@Override
public String objectToString(Object object)
{
     return  i18nMessageService.getMessageForEnum(object);
}


@Override
public Enum<? extends Enum> stringToObject(String name)
{
    return Enum.valueOf(name);

}

@Override
public void setAppliedOnType(Class<?> returnType)
{
    super.setAppliedOnType(returnType);
}

}```

Enforce Loading the Aspect Configurer Before the Session Factory

In order to enforce that theAnnotationBeanConfigurerAspect is created before the Hibernate Session Factory is created, we simply implement our own HibernateJpaAutoConfiguration by extension and add the AnnotationBeanConfigurerAspect to the constructor. Spring Boot now knows that it needs to instantiate the AnnotationBeanConfigurerAspect before it can instantiate the HibernateJpaAutoConfiguration and we then have wired Beans ready for the consumption of the bridge. I found the correct hint [here][3] and [here][4].

@Configuration
public class HibernateSearchConfig extends HibernateJpaAutoConfiguration {

	public HibernateSearchConfig(DataSource dataSource, JpaProperties jpaProperties,
				AnnotationBeanConfigurerAspect beanConfigurerAspect,
				ObjectProvider<JtaTransactionManager> jtaTransactionManager,
				ObjectProvider<TransactionManagerCustomizers> transactionManagerCustomizers) {

			super(dataSource, jpaProperties, jtaTransactionManager, transactionManagerCustomizers);
	}
}```


As it has turned out, using @DependsOn annotations did not work and also @Ordering the precedence of the Beans was not suffucient. With this little hack, we can ensure the correct sequence of initialization.

<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="http://twitter.com/share?text=Hibernate+Search+and+Spring+Boot%3A+Building+Bridges" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url="https://blog.stefanproell.at/2017/03/10/hibernate-search-and-spring-boot/"  data-related="" target="_blank">Tweet</a>
  </div>
  
  <div class="twttr_followme">
    <a href="https://twitter.com/@stefanproell" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>
  </div>
</div>

 [1]: http://docs.spring.io/spring-boot/docs/1.5.2.RELEASE/reference/htmlsingle/
 [2]: https://docs.jboss.org/hibernate/search/5.7/api/
 [3]: http://blog.piotrturski.net/2014/05/spring-boot-dependson-is-not-enough.html
 [4]: https://hibernate.atlassian.net/browse/HSEARCH-1316

Deploying MySQL in a Local Development Environment

Installing MySQL via apt-get is a simple task, but the migration between different MySQL versions requires planning and testing. Thus installing one central instance of the database system might not be suitable, when the version of MySQL or project specific settings should be switched quickly without interfering with other applications. Using one central instance can quickly become cumbersome. In this article, I will describe how any number of MySQL instances can be stored and executed from within a user’s home directory.

Adapting MySQL Data an Log File Locations

Some scenarios might require to run several MySQL instances at once, other scenarios cover sensitive data, where we do not want MySQL to write any data on non-encrypted partitions. This is especially true for devices which can get easily stolen, for instance laptops. If you use a laptop for developing your applications from time to time, chances are good that you need to store sensitive data in a database, but need to make sure that the data is encrypted when at rest. The data stored in a database needs to be protected when at rest.

This can be solved with full disk encryption, but this technique has several disadvantages. First of all, full disk encryption only utilises one password. This entails, that several users who utilise a device need to share one password, which reduces the reliability of this approach. Also when the system needs to be rebooted, full disk encryption can become an obstacle, which increases the complexity further.

Way easier to use is the transparent home directory encryption, which can be selected during many modern Linux setup procedures out of the box. We will use this encryption type for this article, as it is reasonable secure and easy to setup. Our goal is to store all MySQL related data in the home directory and run MySQL with normal user privileges.

Creating the Directory Structure

The first step is creating a directory structure for storing the data. In this example, the user name is stefan, please adapt to your needs.

Switch the Git Clone Protocol from HTTPS to SSH

Gitlab offers several options for interacting with remote repositories: git, http, https and ssh. The first option – git – is the native transport protocol and does not encrypt the traffic. The same applies for http, rendering https and ssh the only feasible protocols if you commit and retrieve data via insecure networks. Ssh and https are also both available via the web interfaces of Github and Gitlab. In both systems you can simply copy and paste the clone URLs including the protocol. The following screenshot shows the Github version.

[][1]

HTTPS

The simplest way to fetch the repository is to just copy the default HTTPS URL and clone it to the local drive. Git will ask you for the Github credentials.

:~/Projects$ git clone https://github.com/username/test-project.git
Klone nach 'test-project' ...
Username for 'https://github.com': 

You will be asked for the credentials every time you interact with the Github remote repository. Per default, git stores credentials for 5 minutes. Instead of waiting so long, we can just drop the credentials and proceed with an empty cache again.

git credential-cache exit```


To make our live a little easier, we can store the username. In this example, we store this information only locally, valid for this cloned repository only. The same settings can also be applied globally.

<pre class="theme:github lang:default decode:true">git config user.email "user@example.org"
git config user.name "username"

Git will store that information locally (i.e. inside the repository) in the config file:

Getting Familiar with Eclipse Again: Git Integration in Comparison with IntelliJ IDEA

Eclipse and IntelliJ are both great Java IDEs, which have their own communities, advantages and disadvantages. After having spent a few years in JetBrains IntelliJ Community Edition, I got accustomed to the tight and clean Git integration into the user interface. Now I consider switching back up Eclipse, I stumbled over a few things that I try to describe in this post.

IntelliJ and Eclipse Handle Project Structures Differently

Eclipse utilises a workspace concept, which allows to work on several projects at the same time. IntelliJ in contrast allows only one open project and organizes substructures in modules. A comparison of these concepts can be found here. These two different viewpoints have effects on the way how Git is integrated into the workflow.

Sharing Projects

The different views on project structures of both IDEs imply that Git repositories are also treated differently. While IntelliJ utilises the root of a repository directly, Eclipse introduces a subfolder for the project. This leads to the interesting observation that importing a project from Git again into an Eclipse workspace requires a small adaption in order to let Eclipse recognize the structure again.

A Small Workflow

In order to get familiar again with Eclipse, I created a small test project, which I then shared by pushing it to a Git repository. I then deleted the project and tried to re-import it again.

Step 1

Create new test project. In this case a Spring Boot Project, which works with Maven. Note that the new project is stored in the Eclipse workspace.

Step 2

As the second step, we create a new repository. Login into Github or your Gitlab instance and create a new project. Initialize it so that you have a master branch ready and copy the URL of the repository. We will then add this repository by opening the Git Repository perspective in Eclipse and add the repository. You can provide a default location for your local repositories in the Eclipse -> Team -> Git properties. In the Git Repository perspective, you can then see the path of the local storage location and some information about the repository, for instance that the local and the master branch are identical (they have the same commit hash). Note that the Git path is different than your workspace project path.

Step 3

We now have a fresh Java Maven based project in our Eclipse workspace and an empty Git repository in the default location for Git repositories in a different location. What we need to do next is to share the source code, by moving it into the Git storage location and add it to the Git index. Eclipse can help us with that,, by using the Team->Share menu from the Project Explorer view, when right clicking on the project.

Step 4

In the next step, we need to specify the Git repository we want to push our code to. This is an easy step, simply select the repository we just created from the drop down menu. In the menu you can see that the content of the current project location on the left side will be moved to the right side. Eclipse created a new subfolder within the repository for our project. This is something that IntelliJ would not do.

In this step, eclipse separates the local and custom project metadata from the actual source code, which is a good thing.

Step 5

In the fifth step, we simply apply some changes and commit and push them to the remote repository using the git staging window.

After this step, the changes are visible in the remote repository as well and available to collaborators. In order to simulate someone, who is going to checkout our little project from Gitlab, we delete the project locally and also remove the git repository.

Step 6

Now we start off with a clean workspace and clone the project from Gitlab. In the git repositories window, we can select clone project and provide the same URL again.

Step 7

In the next screen, we select the local destination for the cloned project. This could be for instance the default directory for Git projects or any other location on your disk. Confirm this dialogue.

Step 8

Now comes the tricky part, which did not work as expected in Eclipse Neon 4.6.1. Usually, one would tell Eclipse, that the cloned project is a Maven project, and it should detect the pom.xml file and download the dependencies. Todo so, we would select Import-> Git -> Projects from Git and clone the repository from this dialogue. Then, as a next step, we would select the Configure -> Convert to Maven Project option, but Eclipse does not seem to recognize the Maven structure. It would only show the files and directories, but not consider the Maven dependencies specified in the pom.xml file.

What happens is that Eclipse tries to add a new pom.xml file and ignores the actual one.

Of course this is a problem and does not work.

Step 9 – Solution

Instead of using the method above, just clone the repository from the Git Repository perspective and then go back to the Project Explorer. Now instead of importing the project via the Git menu, chose the existing Maven project and select the path of the Git repository we cloned before.

And in the next dialogue, specify the path:

As you can see, now Eclipse found the correct pom.xml file and provides the correct dependencies and structure!

Conclusion

Which IDE you prefer is a matter of taste and habit. Both environments do provide a lot of features for developers and differ in the implementation of these features. With this short article, we tried to understand some basic implications of the two philosophies how Eclipse and IntelliJ handle project structures. Once we anticipate these differences, it becomes easy to work with both of them.

Parsing SQL Statements

JDBC and the Limits of ResultSet Metadata

For my work in the area of data citation, I need to analyse queries, which are used for creating subsets. I am particularly interested in query parameters, sortings and filters. One of the most commonly used query languages is SQL, which is used by many relational database management systems such as MySQL. In some cases, the interaction with databases is abstract, meaning that there is hardly any SQL statements executed directly. The SQL statements are rather built on the fly by object relational mappers such as Hibernate. Other scenarios use SQL statements as String and also prepared statements, which are executed via JDBC. However,  analysing SQL statements is tricky as the language is very flexible.

In order to understand what columns have been selected, it is sufficient to utilise the ResultSet Metadata and retrieve the column names from there. In my case I need to extract this imformation from the query in advance and potentially enforce a specific sorting by adding columns to the ORDER BY clause. In this scenario, I need to parse the SQL statement and retrieve this information from the statement itself. Probably the best way to do this would be to implement a parser for the SQL dialect with ANTLR (ANother Tool for Language Recognition). But this is quite a challenge, so I decided to take a shortcut: FoundationDB.

The FoundationDB Parser

FoundationDB was a NoSQL database which provided several layers for supporting different paradigms at once. I am using past tense here, because the project got acquired by Apple in 2015 and since then does pursue the open source project any more. However, the Maven libraries for the software are still available at Maven Central. FoundationDB uses its own SQL parser, which understands standard SQL queries. These queries can be interpreted as a tree and the parser library allows traversing SQL statements and analyse the nodes. We can use this tree to parse and interpret SQL statements and extract additional information.

The Foundations of FoundationDB

The FoundationDB parser can be included into your own project with the following Maven dependency:

<dependency>
    <groupId>com.foundationdb&lt;/groupId&gt;
    <artifactId>fdb-sql-parser&lt;/artifactId&gt;
    <version>1.6.1&lt;/version&gt;
</dependency>

The usage of the parser is straight forward. We use the following example SQL statement as input:

	FROM tableA AS a, tableB AS b 
	WHERE a.firstColumn = b.secondColumn AND 
	b.thirdColumn < 5 
	ORDER BY a.thirdColumn,a.secondColumn DESC

The following function calls the parser and prints the tree of the statement.

	/*
     * Print a SQL statement
     * @param sqlString
     */
	public void parseSQLString(String sqlString) {
                Parser parser = new Parser(); 
		StatementNode stmt;
		try {
		    stmt = this.parser.parseStatement(sqlString);
                    stmt.treePrint();

		} catch (StandardException e) {
			e.printStackTrace();
		}
	}

The resulting tree is listed below. The statement has also been normalized, which ensures a stable sequence of the parameters.

name: null
updateMode: UNSPECIFIED
statementType: SELECT
resultSet: 	
	com.foundationdb.sql.parser.SelectNode@21b8d17c
	isDistinct: false
	resultColumns: 		
		com.foundationdb.sql.parser.ResultColumnList@6433a2

		[0]:		
		com.foundationdb.sql.parser.ResultColumn@5910e440
		exposedName: firstcolumn
		name: firstcolumn
		tableName: null
		isDefaultColumn: false
		type: null
		expression: 			
			com.foundationdb.sql.parser.ColumnReference@6267c3bb
			columnName: firstcolumn
			tableName: a
			type: null
		[1]:		
		com.foundationdb.sql.parser.ResultColumn@533ddba
		exposedName: secondcolumn
		name: secondcolumn
		tableName: null
		isDefaultColumn: false
		type: null
		expression: 			
			com.foundationdb.sql.parser.ColumnReference@246b179d
			columnName: secondcolumn
			tableName: b
			type: null
		[2]:		
		com.foundationdb.sql.parser.ResultColumn@7a07c5b4
		exposedName: thirdcolumn
		name: thirdcolumn
		tableName: null
		isDefaultColumn: false
		type: null
		expression: 			
			com.foundationdb.sql.parser.ColumnReference@26a1ab54
			columnName: thirdcolumn
			tableName: b
			type: null
	fromList: 		
		com.foundationdb.sql.parser.FromList@3d646c37

		[0]:		
		com.foundationdb.sql.parser.FromBaseTable@41cf53f9
		tableName: tablea
		updateOrDelete: null
		null
		correlation Name: a
		a
		[1]:		
		com.foundationdb.sql.parser.FromBaseTable@5a10411
		tableName: tableb
		updateOrDelete: null
		null
		correlation Name: b
		b
	whereClause: 		
		com.foundationdb.sql.parser.AndNode@2ef1e4fa
		operator: and
		methodName: and
		type: null
		leftOperand: 			
			com.foundationdb.sql.parser.BinaryRelationalOperatorNode@306a30c7
			operator: =
			methodName: equals
			type: null
			leftOperand: 				
				com.foundationdb.sql.parser.ColumnReference@b81eda8
				columnName: firstcolumn
				tableName: a
				type: null
			rightOperand: 				
				com.foundationdb.sql.parser.ColumnReference@68de145
				columnName: secondcolumn
				tableName: b
				type: null
		rightOperand: 			
			com.foundationdb.sql.parser.BinaryRelationalOperatorNode@27fa135a
			operator: <
			methodName: lessThan
			type: null
			leftOperand: 				
				com.foundationdb.sql.parser.ColumnReference@46f7f36a
				columnName: thirdcolumn
				tableName: b
				type: null
			rightOperand: 				
				com.foundationdb.sql.parser.NumericConstantNode@421faab1
				value: 5
				type: INTEGER NOT NULL
orderByList: 	
	com.foundationdb.sql.parser.OrderByList@2b71fc7e
	allAscending: false
	[0]:	
	com.foundationdb.sql.parser.OrderByColumn@5ce65a89
	ascending: true
	nullsOrderedLow: false
	columnPosition: -1
	expression: 		
		com.foundationdb.sql.parser.ColumnReference@25f38edc
		columnName: thirdcolumn
		tableName: a
		type: null
	[1]:	
	com.foundationdb.sql.parser.OrderByColumn@1a86f2f1
	ascending: false
	nullsOrderedLow: false
	columnPosition: -1
	expression: 		
		com.foundationdb.sql.parser.ColumnReference@3eb07fd3
		columnName: secondcolumn
		tableName: a
		type: null

This tree offers a lot of information, which can be used programmatically as well. In the top of the output, we can see that the statement was a SELECT statement and that it was not DISTINCT. Then follows the ResultSet, which contains a list of the three ResultColumns, which have been specified in the SELECT clause. We can see the column names and the table names from which they are drawn. The next block provides the referenced tables (the FROM list) and their alias names. The WHERE – block contains the operands which have been used for filtering and last but not least, there is the list of ORDER BY clauses and their sorting directions.

The Visitor

In order to access the information shown above programmatically, we need to access the content of the node one by one. This can be achieved with the visitor pattern, which traverses all the nodes of the tree. The following listing shows how the visitor pattern can be used for accessing the list of columns from the SELECT clause.

	**
     * Return a list of columns of a SELECT clause
     * @param sql
     * @return
     */
	public ArrayList selectColumnsList(String sql){
        SQLParser parser = new SQLParser();
        BooleanNormalizer normalizer = new BooleanNormalizer(parser);
        StatementNode stmt = null;

        try {
            stmt = parser.parseStatement(sql);
            stmt = normalizer.normalize(stmt);
        } catch (StandardException e) {
            e.printStackTrace();
        }


        final ArrayList<ResultColumn> columns = new ArrayList&lt;ResultColumn&gt;();
        Visitor v = new Visitor() {

            @Override
            public boolean visitChildrenFirst(Visitable node) {
                if (node instanceof SelectNode)
                    return true;
                return false;
            }

            @Override
            public Visitable visit(Visitable node) throws StandardException {
                if (node instanceof ResultColumn) {
                    ResultColumn resultColumn = (ResultColumn) node;
                    columns.add(resultColumn);
                    System.out.println("Column " + columns.size()+ ": " + resultColumn.getName());
                }
                return null;
            }

            @Override
            public boolean stopTraversal() {
                // TODO Auto-generated method stub
                return false;
            }

            @Override
            public boolean skipChildren(Visitable node) throws StandardException {
                if (node instanceof FromList) {
                    return true;
                }
                return false;
            }
        };

        try {
            stmt.accept(v);
        } catch (StandardException e) {
            e.printStackTrace();
        }

        return columns;

    }

This code example, we define a visitor which traverses all the ResultColumn nodes. Every time the current node is an instance of ResultColumn, we add this node to our list of columns. The nodes are only visited, if they are children of a SELECT statement. This is our entry point into the tree. We leave the tree when we reach the FROM list. We then apply the visitor to the statement, which initiates the traversal. As a result, we receive a list of columns which have been used for the result set.

In order to get the list of ORDER BY columns, we can utilise a similar approach. The following functions gives an example:

/**
     * Return list of order by clauses
     * @param sqlText
     * @return
     */
	public OrderByList orderByColumns(String sqlText){

		SQLParser parser = new SQLParser();
		BooleanNormalizer normalizer = new BooleanNormalizer(parser);
		StatementNode stmt;
        OrderByList orderByList = null;
        try {
			stmt = parser.parseStatement(sqlText);
			stmt = normalizer.normalize(stmt);
            CursorNode node = (CursorNode) stmt;
            orderByList = node.getOrderByList();
            int i=0;
            for(OrderByColumn orderByColumn : orderByList){
                i++;
                String direction;
                if(orderByColumn.isAscending()){
                    direction="ASC";
                }else{
                    direction="DESC";
                }
                System.out.println("ORDER BY Column " +i+ ": " +orderByColumn.getExpression().getColumnName()+ " Direction: " + direction );

            }
        } catch (StandardException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
		return orderByList;


	}

This time, we retrieve the list of ORDER BY columns directly from the CurserNode. Similar principles can be used for manipulating SQL statements and apply a different sorting for instance.

An Interactive Map with Leaflet, GeoJSON, and jQuery Using Bootstrap

A Side Project – An Interactive Parking Map of Innsbruck

When I recently moved to Innsbruck, I noticed that there was no interactive map for the parking system available. The amount of time you can park your car depends on the zone your car is located in. There are 20 parking zones and they are defined by their bordering streets in the city.  Innsbruck is very dense and parking is always a hot topic. So I thought having an interactive map makes it easier to find zones where you can leave your car longer, also if your are not so familiar with the street names. The city of Innsbruck offers the GIS data at the open data portal of Austria, which made it quite easy to implement such a map. I used the following technologies for creating this map:

The source code is available at my Github account. The implementation is available here and also at the Austrian Open Data Portal.

Code Snippets

Explaining the whole source code would be a bit too much for this post and most of it is pretty self explanatory, but in the following section I would like to highlight a few things of the project.

Initializing

In the top of the HTML file, we load the JavaScript file which contains all the functions and variables for our implementation. The script is called parkraum.js (parkraum means parking space in German).

The initialization of the Javascript code is straight forward with jQuery. I structured the initialization into a few components, as you can see in the following example.

<script>
    $(document).ready(function() {
	initMap();
	placeZonesOnMap();
	loadScrolling();
	populateParkzoneDropdown();
    });
</script>

We first initialize the Map itself, then place the parking zones, initialize the page scrolling to make it more smoothly and then populate the drop down menu with the available parking zones.

Initialize the Map

The first step is the initialization of the map with Leaflet. Note that map is a global variable. The coordinates [47.2685694, 11.3932759] are the center of Innsbruck annd 14 is the zoom level. In order to offer the map on a public page, you need to register with mapbox, a service which provides the tiles for the map. Mapbox is free for 50k map views per month. Make sure to use your own key and show some attribution.

// Initialize map and add a legend
function initMap() {
    map = L.map('map').setView([47.2685694, 11.3932759], 14);
    tileLayer = L.tileLayer('https://api.tiles.mapbox.com/v4/{id}/{z}/{x}/{y}.png?access_token=YOUR_API_KEY, {
        maxZoom: 18,
        attribution: 'Map data &copy; <a href="http://openstreetmap.org">OpenStreetMap&lt;/a&gt; contributors, ' +
            '<a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA&lt;/a&gt;, ' +
            'Imagery © <a href="http://mapbox.com">Mapbox&lt;/a&gt;',
        id: 'mapbox.light'
    }).addTo(map);

    addLegend();
}

We also add a legend to the map, for indicating the parking area type with colors. There exist 4 types of parking areas and we just add little squares with that colors to the map. The legend improves the readability of the map.

/* Add a legend to the map. */
function addLegend(){
    var legend = L.control({position: 'bottomright'});
    legend.onAdd = function (map) {

    var div = L.DomUtil.create('div', 'info legend');
    div.innerHTML =
        '<i style="background:#72B2FF";>&lt;/i&gt;&lt;span style="font-weight: 600;"&gt;90 Minuten Kurzparkzone&lt;/span&gt;&lt;br&gt;' +
        '<i style="background:#BEE7FF";>&lt;/i&gt;&lt;span style="font-weight: 600;"&gt;180 Minuten Kurzparkzone&lt;/span&gt;&lt;br&gt;' +
        '<i style="background:#A3FF72";>&lt;/i&gt;&lt;span style="font-weight: 600;"&gt;Werktags Parkstraße&lt;/span&gt;&lt;br&gt;' +
        '<i style="background:#D8D0F4";>&lt;/i&gt;&lt;span style="font-weight: 600;"&gt;Parkstraße&lt;/span&gt;&lt;br&gt;';

    return div;
};

legend.addTo(map);
}

The final map with the legend looks like this:

Reading Data, Adding Layers

I obtained the shapefiles with the geographic information from the Austrian Open Data Portal. The data seems to be exported from ArcGIS and I manually converted it into GeoJSON, which is directly supported by Leaflet.js without any plugins. To do this, I just copied the polygon data into the GeoJSON structure. I also separated the quite large file into smaller junks, each parking zone in one file.

Below you can see an example how the JSON looks like for the parking zone C. The structure contains the short name (“C”), some additional information about the zone, attributes for the color, opacity and outline and of course the actual coordinates, which make up a polygon covering the parking area.

{
    "type": "Feature",
    "properties": {
        "parkzonenKuerzel": "C",
        "parkzoneInfo":"Kurzparkzone 90 min gebührenpflichtig, werktags Mo-Fr von 9-21 Uhr und Sa von 9-13 Uhr, ½ Stunde EUR 0.70, 1 Stunde EUR 1.40, 1½ Stunden EUR 2.10.",
        "style": {
            "weight": 2,
            "color": "#999",
            "opacity": 1,
            "fillColor": "#72B2FF",
            "fillOpacity": 0.5
        }
    },
    "geometry":{
        "type":"Polygon",
        "coordinates":[
            [
                [11.389460535000069,47.261162269000067],[11.389526044000036,47.261021078000056],
                ...
          ]
        ]
    }
}

I created a small object containing the name of a parking zone, the relative path of the JSON file and a place holder for layer information.

var parkzonen = [{
    parkzone: 'C',
    jsonFile: './data/zoneC.json',
    layer: ''
}, {
    parkzone: 'D',
    jsonFile: './data/zoneD.json',
    layer: ''
}
...
}];

This is globally available inside the JS file. The actual loading of the parking zones and the placement of the polygones on the map is happening in the following function. It uses jQuery to load the JSON files. Note that jQuery expects the files to be delivered by a Web server. So in order for this to work, you need to make sure that the files can be served by a Web server , also on your local development machine. You can try this very easy, if you execute the following python statement within the root directory of your local development directory:``` sudo http-server -p 80  .

function placeZonesOnMap() {
    for (var zone in parkzonen) {
        var parkzonenKuerzel = parkzonen[zone].parkzone;
        var jsonURL = parkzonen[zone].jsonFile;
        $.ajaxSetup({
            beforeSend: function(xhr) {
                if (xhr.overrideMimeType) {
                    xhr.overrideMimeType("application/json");
                }
            }
        });

        $.getJSON(jsonURL, function(data) {

            placeZoneOnMap(data);

        });

    }
}
// Place zone on map
function placeZoneOnMap(data) {
    var parkzonenKuerzel = data.properties.parkzonenKuerzel;


    for (var zone in parkzonen) {
        if(parkzonen[zone].parkzone===parkzonenKuerzel){
            var layer = addGeoJSONToMap(data);
            parkzonen[zone].layer= layer;
        }
    }
}

function addGeoJSONToMap(data){
    var layer = L.geoJson([data], {
        style: function(feature) {
            return feature.properties && feature.properties.style;
        },
        onEachFeature: onEachFeature,
    }).addTo(map);
    return layer;
}

function onEachFeature(feature, layer) {
    var popupContent = 'Parkzone: <span style="font-weight: 900; font-size: 150%;">' + feature.properties.parkzonenKuerzel + '&lt;/span&gt;&lt;/br&gt;' + feature.properties.parkzoneInfo;

    layer.bindPopup(popupContent);
    var label = L.marker(layer.getBounds().getCenter(), {
        icon: L.divIcon({
            className: 'label',
            html: '<span style="font-weight: 900; font-size: 200%;color:black;">' + feature.properties.parkzonenKuerzel+'&lt;/span&gt;',
            iconSize: [100, 40]
        })
    }).addTo(map);
}```


The for loop iterates over the object, where we stored all the parking zones, respectively the JSON file locations. We load the files, one by one, and place the polygons on the map. This of course works in an asynchronous fashion. After this step, the polygons become visible on the map. We also add a clickable info box on all parking zones, which then display additional information.

## Dropdown Selection for Marking and Highlighting a Parking Zone

Users are able to select one of the parking zones from a drop down list.  In the first step, we add all parking zones by iterating over the parking zones object. Once selected, the parking zone will change the color and therefore be highlighted. To do that, we remove the layer of the parking zone and add it again in a different color.

function populateParkzoneDropdown(){ $('#selectParkzone’).empty(); $('#selectParkzone’).append($(‘</option>').val(‘Bitte Auswählen’).html(‘Zonen’)); $.each(parkzonen, function(i, p) { $('#selectParkzone’).append($(‘</option>').val(p.parkzone).html(p.parkzone)); });

}

$("#selectParkzone”).change(function () { var selectedParkZone = $("#selectParkzone”).val(); changeParkzoneColor(selectedParkZone);

});

function changeParkzoneColor (selectParkzone){ resetMap();

for (var zone in parkzonen) {
    if(parkzonen[zone].parkzone==selectParkzone){
        var layer = parkzonen[zone].layer;

        map.removeLayer(layer);
        layer.setStyle({
            fillColor: 'red',
            fillOpacity: 0.7
        });
        map.addLayer(layer);
    }
}

}

// Reset all layers function resetMap(){     console.log(‘reset’);     map.eachLayer(function (layer) {         map.removeLayer(layer);     });     map.addLayer(tileLayer);     placeZonesOnMap(); }



## Show the Current Location

Showing the current location is also a nice feature. By clicking on a button, the map scrolls to the current location, which is transmitted by the browser. Note that this only works if you deliver your pages with HTTPS!

function currentPosition() { if (navigator.geolocation) { navigator.geolocation.getCurrentPosition(function(position) { latit = position.coords.latitude; longit = position.coords.longitude; //console.log(‘Current position: ' + latit + ' ' + longit); //alert(‘Current position: ' + latit + ' ' + longit); // this is just a marker placed in that position var abc = L.marker([position.coords.latitude, position.coords.longitude]).addTo(map); // move the map to have the location in its center map.panTo(new L.LatLng(latit, longit)); }); } }```

The single page app utilises bootstrap for rendering the content nicely and providing the navigation features. The following code snippet show how we can make all links scroll smoothly.

// use the first element that is "scrollable"
function scrollableElement(els) {
    for (var i = 0, argLength = arguments.length; i < argLength; i++) {
        var el = arguments[i],
            $scrollElement = $(el);
        if ($scrollElement.scrollTop() > 0) {
            return el;
        } else {
            $scrollElement.scrollTop(1);
            var isScrollable = $scrollElement.scrollTop() > 0;
            $scrollElement.scrollTop(0);
            if (isScrollable) {
                return el;
            }
        }
    }
    return [];
}

// Filter all links 
function filterPath(string) {
    return string
        .replace(/^\//, '')
        .replace(/(index|default).[a-zA-Z]{3,4}$/, '')
        .replace(/\/$/, '');
}

function loadScrolling() {
    var locationPath = filterPath(location.pathname);
    var scrollElem = scrollableElement('html', 'body');

    $('a[href*=\\#]').each(function() {
        var thisPath = filterPath(this.pathname) || locationPath;
        if (locationPath == thisPath &&
            (location.hostname == this.hostname || !this.hostname) &&
            this.hash.replace(/#/, '')) {
            var $target = $(this.hash),
                target = this.hash;
            if (target) {
                var targetOffset = $target.offset().top;
                $(this).click(function(event) {
                    event.preventDefault();
                    $(scrollElem).animate({
                        scrollTop: targetOffset
                    }, 400, function() {
                        location.hash = target;
                    });
                });
            }
        }
    });
}

Create a Category Page for one Specific Category and Exclude this Category from the Main Page in WordPress

This blog serves as my digital notebook for more than eight years and I use to to collect all sorts of things, that I think are worth storing and sharing. Mainly, I blog about tiny technical bits, but recently I also started to write about my life here in Innsbruck, where I try to discover what this small city and its surroundings has to offer. The technical articles are written in English, as naturally the majority of visitors understands this language. The local posts are in German for the same reason. My intention was to separate this two topics in the blog and lot let the posts create any clutter between languages.

Child Themes

When tinkering with the code of your WordPress blog, it is strongly recommended to deploy and use a child theme. This allows to reverse changes easily and more importantly, allows to update the theme without having to re-implement your adaptions after each update. Creating a child theme is very easy and described here. In addition I would recommend using some sort of code versioning tool, such as Git.

Excluding a Category from the Main Page

WordPress offers user defined categories out of the box and category pages for each category. This model does not fit well for my blog, where I have static pages and a time series of blog posts on the main page. In order to prevent that the posts about Innsbruck show up at the main page, the category ‘Innsbruck’ needs to be excluded. We can create or modify the file functions.php in the child theme folder and add the following code.

<?php
add_action( 'wp_enqueue_scripts', 'theme_enqueue_styles' );
function theme_enqueue_styles() {
        wp_enqueue_style( 'parent-style', get_template_directory_uri() . '/style.css' );

}

// Exclude Innsbruck Category from Main Page
function exclude_category($query) {
     if ( $query->is_home() ) {
         // Get the category ID of the category Insbruck
         $innsbruckCategory = get_cat_ID( 'Innsbruck' );
         // Add a minus in front of the string
         $query->set('cat','-' . $innsbruckCategory);

      }
     return $query;
}
add_filter('pre_get_posts', 'exclude_category');

?>

This adds a filter which gets executed before the posts are collected. We omit all posts of the category Innsbruck, by adding a minus as prefix of the category ID. Of course you could also lookup the category ID in the administration dashboard, by hovering with your mouse over the category name and save one database query.

A Custom Page Specific for one Category

In the second step, create a new page in the dashboard. This page will contain all posts of the Innsbruck category that we will publish. Create the file page.php in your child theme folder and use the following code:

<?php
/**
 * The template for displaying all pages.
 *
 * This is the template that displays all pages by default.
 * Please note that this is the WordPress construct of pages
 * and that other 'pages' on your WordPress site will use a
 * different template.
 *
 * @package dazzling
 */

    get_header();
?>
    <div id="primary" class="content-area col-sm-12 col-md-8">
        <main id="main" class="site-main" role="main">

<?php
     // Specify the arguments for the post query
     $args = array(
        'cat' => '91', // Innsbruck category id
        'post_type' => 'post',
        'posts_per_page' => 5,
        'paged' => ( get_query_var('paged') ? get_query_var('paged') : 1),
    );

    if( is_page( 'innsbruck' )) {
        query_posts($args);
    }
?>

<?php while ( have_posts() ) : the_post(); ?>
    <?php get_template_part( 'content', 'post' ); ?>
    <?php
        // If comments are open or we have at least one comment, load up the comment template
        if ( comments_open() || '0' != get_comments_number() ) :
            comments_template();
        endif;
    ?>
<?php endwhile; // end of the loop. ?>


<div class="navigation">
    <div class="alignleft">&lt;?php next_posts_link('&laquo; Ältere Beiträge') ?&gt;&lt;/div&gt;
    <div class="alignright">&lt;?php previous_posts_link('Neuere Beiträge &raquo;') ?&gt;&lt;/div&gt;
</div>


    </main>&lt;!-- #main --&gt;
</div>&lt;!-- #primary --&gt;
<?php get_sidebar(); ?>
<?php get_footer(); ?>

In this code snippet, we define a set of arguments, which are used for filtering the posts of the desired category. In this example, I used the id of the Innsbruck category (91) directly. We define that we want to display posts only, 5 per page. An important aspect is the pagination. When we only display posts of one category, we need to make sure that Worpress counts the pages correctly. Otherwise the page would always display the same posts, regardless how often the user clicks on the next page button. The reason is that this button uses the global paged variable, which is set correctly in the example above.

The if conditional makes sure that only the pages from the Innsbruck category are displayed. The while loop then iterates over all posts and displays them. At the bottom we can see the navigational buttons for older and newer posts of the Innsbruck category.

Persistent Data in a MySQL Docker Container

Running MySQL in Docker

In a recent article on Docker in this blog, we presented some basics for dealing with data in containers. This article will present another popular application for Docker: MySQL containers. Running MySQL instances in Docker allows isolating database infrastructure with ease.

Connecting to the Standard MySQL Container

The description of the MySQL docker image provides a lot of useful information how to launch and connect to a MySQL container. The first step is to create standard MySQL container from the latest available image.

sudo docker run \
   --name=mysql-instance 
   -e MYSQL_ROOT_PASSWORD=secret 
   -p 3307:3306 
   -d 
   mysql:latest

This creates a MySQL container where the root password is set to secret. As the host is already running its own MySQL instance (which has nothing to do with this docker example), the standard port 3306 is already taken. Thus we publish utilise the port 3307 on the host system and forward it to the 3306 standard port from the container.

Connect from the Host

We can then connect from the command line like this:

mysql -uroot -psecret -h 127.0.0.1 -P3307

We could also provide the hostname localhost for connecting to the container, but as the MySQL client per default assumes that a localhost connection is via a socket, this would not work. Thus when using the hostname localhost, we needed to specify the protocol TCP, wo that the client connects via the network interface.

mysql -uroot -psecret -h localhost --protocol TCP -P3307

Connect from other Containers

Connecting from a different container to the MySQL container is pretty straight forward. Docker allows to link two containers and then use the exposed ports between them. The following command creates a new ubuntu container and links to the MySQL container.

sudo docker run -it --name ubuntu-container --link mysql-instance:mysql-link ubuntu:16.10 bash

After this command, you are in the terminal of the Ubuntu container. We then need to install the MySQL client for testing:

# Fetch the package list
root@7a44b3e7b088:/# apt-get update
# Install the client
root@7a44b3e7b088:/# apt-get install mysql-client
# Show environment variables
root@7a44b3e7b088:/# env

The last command gives you a list of environment variables, among which is the IP address and port of the MySQL container.

MYSQL_LINK_NAME=/ubuntu-container/mysql-link
HOSTNAME=7a44b3e7b088
TERM=xterm
MYSQL_LINK_ENV_MYSQL_VERSION=5.7.14-1debian8
MYSQL_LINK_PORT=tcp://172.17.0.2:3306
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
MYSQL_LINK_PORT_3306_TCP_ADDR=172.17.0.2
MYSQL_LINK_PORT_3306_TCP=tcp://172.17.0.2:3306
PWD=/
MYSQL_LINK_PORT_3306_TCP_PORT=3306
SHLVL=1
HOME=/root
MYSQL_LINK_ENV_MYSQL_MAJOR=5.7
MYSQL_LINK_PORT_3306_TCP_PROTO=tcp
MYSQL_LINK_ENV_GOSU_VERSION=1.7
MYSQL_LINK_ENV_MYSQL_ROOT_PASSWORD=secret
_=/usr/bin/env

You can then connect either manually of by providing the variables

mysql -uroot -psecret -h 172.17.0.2
mysql -uroot -p$MYSQL_LINK_ENV_MYSQL_ROOT_PASSWORD -h $MYSQL_LINK_PORT_3306_TCP_ADDR -P $MYSQL_LINK_PORT_3306_TCP_PORT

If you only require a MySQL client inside a container, simply use the MySQL image from docker. Batteries included!

Plotting Colourful Graphs with R, RStudio and Ggplot2

The Aesthetics of Data Science

Data visualization is a powerful tool for communicating results and recently receives more and more attention due to the hype of data science. Integrating a meaningful graph into a paper or your thesis could improve readability and understandability more than any formulas or extended textual descriptions can. There exists a variety of different approaches for visualising data. Recently a lot of new Javascript based frameworks have gained quite some momentum, which can be used in Web applications and apps. A more classical work horse for data science is the R project and its plotting engine ggplot2. The reason why I decide to stick with R is its popularity and flexibility, which is still  impressive. Also with RStudio, there exists a convenient IDE which provides useful features for data scientists.

Plotting Graphs

In this blog post, I demonstrate how to plot time series data and use colours to highlight a specific aspect of data. As almost all techniques, R and ggplot2 require practise and training, which I realised again today when I spent quite a bit of time struggling with getting a simple plot right.

Currently I am evaluating two systems I developed and I needed to visualize their storage and execution time demands in comparison. My goal was to create a plot for each non-functional property, the execution time and the storage demand, while each plot should depict both systems’ performance. Each system runs a set of operations, think of create, read, update and delete operations (CRUD). Now for visualizing which of these operations has the most effects on the system, I needed to colourise each operation within one graph. This is the easy part. What was more tricky is to provide for each graph a defined set of colours, which can be mapped to each instance of the variable. Things which have the same meaning in both graphs should visualized in the same way, which requires a little hack.

Prerequisits

Install the following packages via apt

sudo apt-get install r-base r-recommended r-cran-ggplot2

and RStudio by downloading the deb – File from the project homepage.

Evaluation Data

As an example,we plan to evaluate the storage demand of two different systems and compare the results. Consider the following sample data.

# Set seed to get the same random numbers for this example
set.seed(42);
# Generate 200 random data records
N <- 200
# Generate a random, increasing sequence of integers that we assume is the storage demand in some unit
storage1 =sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
storage2 = sort(sample(1:100000, size = N, replace = TRUE),decreasing = FALSE)
# Define the operations availabel and draw a random sample
operationTypes = c('CREATE','READ','UPDATE','DELETE')
operations = sample(operationTypes,N,replace=TRUE)
# Create the dataframe
df  df
     id storage1 storage2 operations
1     1       24      238     CREATE
2     2      139     1755     UPDATE
3     3      158     1869     UPDATE
4     4      228     2146       READ
5     5      395     2967     DELETE
6     6      734     3252     CREATE
7     7      789     4049     DELETE
8     8     2909     4109       READ
9     9     3744     4835     CREATE
10   10     3894     4990       READ

....

We created a random data set simulating the characteristics of system measurement data. As you can see, we have a list of operations of the four types CREATE, READ, UPDATE and DELETE and a measurement value for the storage demand in both systems.

The Simple Plot

Plotting two graphs of thecolumns storage1 and storage2 is straight forward.

# Simple plot
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color="Storage 1")) +
  geom_point(aes(x=id,y=storage2,color="Storage 2")) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  scale_color_manual(values=c("Storage 1"="forestgreen", "Storage 2"="aquamarine"), 
                     name="Measurements", labels=c("System 1", "System 2"))

print(p1)

We assign for each point plot a color. Note that the color nme “Storage 1” for instance of course does not denote a color, but it assignes a level for all points of the graph. This level can be thought of as a category, which ensures that all the points which belong to the same category have the same color. As you can see at the definition of the color scale, we assign the actual color to this level there.  This is the result:

Plotting Levels

A common task is to visualise categories or levels of measurement data. In this example, there are four different levels we could observe: CREATE, READ, UPDATE and DELETE.

# Plot with levels
p1 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operations)) +
  geom_point(aes(x=id,y=storage2,color=operations)) +
  ggtitle("Overview of Measurements") +
  labs(color="Measurements") +
  scale_color_manual(values=c("CREATE"="darkgreen", 
                              "READ"="darkolivegreen", 
                              "UPDATE"="forestgreen", 
                              "DELETE"="yellowgreen"))
print(p1)

Instead of assigning two colours, one for each graph, we can also assign colours to the operations. As you can see in the definition of the graphs and the colour scale, we map the colours to the variable operations instead. As a result we get differently coloured points per operation, but we get these of course for both graphs in an identical fashion as the categories are the same for both measurements. The result looks like this:

Now this is obviously not what we want to achieve as we cannot differentiate between the two graphs any more.

Plotting the same Levels for both Graphs in Different Colours

This last part is a bit tricky, as ggplot2 does not allow assigning different colour schemes within one plot. There do exist some hacks for this, but the solution does not improve the readability of the code in my opinion. In order to apply different colour schemes for the two graphs while still using the categories, I appended two extra columns to the data set. If we append some differentiation between the two graphs and basically double the categories from four to eight, where each graph now uses its own four categories, we can also assign distinct colours to them.

df$operationsStorage1 <- paste(df$operations,"-Storage1", sep = '')
df$operationsStorage2 <- paste(df$operations,"-Storage2", sep = '')

p3 <- ggplot(df, aes(x,y)) +
  geom_point(aes(x=id,y=storage1,color=operationsStorage1)) +
  geom_point(aes(x=id,y=storage2,color=operationsStorage2)) +
  ggtitle("Overview of Measurements") +
  xlab("Number of Operations") +
  ylab("Storage Demand in MB") +
  labs(color="Operations") +
  scale_color_manual(values=c("CREATE-Storage1"="darkgreen", 
                              "READ-Storage1"="darkolivegreen", 
                              "UPDATE-Storage1"="forestgreen", 
                              "DELETE-Storage1"="yellowgreen",
                              "CREATE-Storage2"="aquamarine", 
                              "READ-Storage2"="dodgerblue",
                              "UPDATE-Storage2"="royalblue",
                              "DELETE-Storage2"="turquoise"))
print(p3)

We then assign the new column for each system individually as colour value. This ensures that each graph only considers the categories that we assigned in this step. Thus we can assign a different color scheme for wach graph and print the corresponding colours in the label (legend) next to the chart. This is the result:

Now we can see which operation was used at every measurement and still be able to distinguish between the two systems.

Timelape Photography with the Camera Module V2 and a Raspberry Pi Model B

Recently, I bought a camera module for the Raspberry Pi and experimented a little bit with the possibilities a scriptable camera provides. The new Camera Module V2 offers 8.08 MP from a Sony sensor and can be controlled with a well documented Python library. It allows to take HD videos and shoot still images. Assembly is easy, but as the camera is attached with a rather short ribbon cable, which renders the handling is a bit cumbersome. For the moment, a modified extra hand from my soldering kit acts as a makeshift.

Initial Setup

The initial setup is easy and just requires a few steps, which is not surprising because most of the documentation is targeted to kids in order to encourage their inner nerd. Still works for me as well 🙂

Attach the cable to the raspberry pi as described here. You can also buy an adapter for the Pi Zero. Once the camera is wired to the board, activate the module with the tool raspi-config.

Then you are ready to install the Python library with sudo apt-get install python3-picamera, add your user to the video group with usermod -a -G video USERNAME  and then reboot the Raspberry. After you logged in again, you can start taking still images with the simple command raspistill -o output.jpg. You can find some more documentation and usage examples here.

Timelapse Photography

What I really enjoy is making timelapse videos with the Raspberry Pi, which gives a nice effect for everyday phenomena and allows to observe processes which are usually too slow to follow. The following Gif shows a melting ice cube. I took one picture every five seconds.

A Small Python Script

The following script creates a series of pictures with a defined interval and stores all images with a filename indicating the time of shooting in a folder. It is rather self explanatory. The camera needs a little bit of time to adjust, so we set the adjustTime variable to 5 seconds. Then we take a picture every 300 seconds, each image has a resolution of 1024×768 pixels.

import os
import time
import picamera
from datetime import datetime

# Grab the current datetime which will be used to generate dynamic folder names
d = datetime.now()
initYear = "%04d" % (d.year)
initMonth = "%02d" % (d.month)
initDate = "%02d" % (d.day)
initHour = "%02d" % (d.hour)
initMins = "%02d" % (d.minute)
initSecs = "%02d" % (d.second)

folderToSave = "timelapse_" + str(initYear) + str(initMonth) + str(initDate) +"_"+ str(initHour) + str(initMins)
os.mkdir(folderToSave)

# Set the initial serial for saved images to 1
fileSerial = 1

# Create and configure the camera
adjustTime=5
pauseBetweenShots=300

# Create and configure the camera
with picamera.PiCamera() as camera:
    camera.resolution = (1024, 768)
    #camera.exposure_compensation = 5

    # Start the preview and give the camera a couple of seconds to adjust
    camera.start_preview()
    try:
        time.sleep(adjustTime)

        start = time.time()
        while True:
            d = datetime.now()
            # Set FileSerialNumber to 000X using four digits
            fileSerialNumber = "%04d" % (fileSerial)

            # Capture the CURRENT time (not start time as set above) to insert into each capture image filename
            hour = "%02d" % (d.hour)
            mins = "%02d" % (d.minute)
            secs = "%02d" % (d.second)
            camera.capture(str(folderToSave) + "/" + str(fileSerialNumber) + "_" + str(hour) + str(mins) + str(secs) + ".jpg")

            # Increment the fileSerial
            fileSerial += 1
            time.sleep(pauseBetweenShots)

    except KeyboardInterrupt:
        print ('interrupted!')
        # Stop the preview and close the camera
        camera.stop_preview()

finish = time.time()
print("Captured %d images in %d seconds" % (fileSerial,finish - start))

This script then can run unattended and it creates a batch of images on the Raspberry Pi.

Image Metadata

The file name preserves the time of the shot, so that we can see later when a picture was taken. But the tool also stores EXIF metadata, which can be used for processing. You can view the data with the exiftool.

>ExifTool Version Number         : 9.46
File Name                       : 1052.jpg
Directory                       : .
File Size                       : 483 kB
File Modification Date/Time     : 2016:07:08 08:49:52+02:00
File Access Date/Time           : 2016:07:08 09:19:14+02:00
File Inode Change Date/Time     : 2016:07:08 09:17:52+02:00
File Permissions                : rw-r--r--
File Type                       : JPEG
MIME Type                       : image/jpeg
Exif Byte Order                 : Big-endian (Motorola, MM)
Make                            : RaspberryPi
Camera Model Name               : RP_b'imx219'
X Resolution                    : 72
Y Resolution                    : 72
Resolution Unit                 : inches
Modify Date                     : 2016:07:05 08:37:33
Y Cb Cr Positioning             : Centered
Exposure Time                   : 1/772
F Number                        : 2.0
Exposure Program                : Aperture-priority AE
ISO                             : 50
Exif Version                    : 0220
Date/Time Original              : 2016:07:05 08:37:33
Create Date                     : 2016:07:05 08:37:33
Components Configuration        : Y, Cb, Cr, -
Shutter Speed Value             : 1/772
Aperture Value                  : 2.0
Brightness Value                : 2.99
Max Aperture Value              : 2.0
Metering Mode                   : Center-weighted average
Flash                           : No Flash
Focal Length                    : 3.0 mm
Maker Note Unknown Text         : (Binary data 332 bytes, use -b option to extract)
Flashpix Version                : 0100
Color Space                     : sRGB
Exif Image Width                : 1024
Exif Image Height               : 768
Interoperability Index          : R98 - DCF basic file (sRGB)
Exposure Mode                   : Auto
White Balance                   : Auto
Compression                     : JPEG (old-style)
Thumbnail Offset                : 1054
Thumbnail Length                : 24576
Image Width                     : 1024
Image Height                    : 768
Encoding Process                : Baseline DCT, Huffman coding
Bits Per Sample                 : 8
Color Components                : 3
Y Cb Cr Sub Sampling            : YCbCr4:2:0 (2 2)
Aperture                        : 2.0
Image Size                      : 1024x768
Shutter Speed                   : 1/772
Thumbnail Image                 : (Binary data 24576 bytes, use -b option to extract)
Focal Length                    : 3.0 mm
Light Value                     : 12.6

Processing Images

The Raspberry Pi would need a lot of time to create an animated Gif or a video from these images. This is why I decided to add new images automatically to a Git repository on Github and fetch the results on my Desktop PC. I created a new Git repository and adapted the script shown above to store the images within the folder of the repository. I then use the following script to add and push the images to Github using a cronjob.

>cd /home/stefan/Github/Timelapses
now=$(date +"%m_%d_%Y %H %M %S")
echo $now
git pull
git add *.jpg
git commit -am "New pictures added $now"
git push

You can add this to you user’s cron table with crontab -e and the following line, which adds the images every 5 minutes,

*/5	*	*	*	*	/home/stefan/Github/Timelapses/addToGit.sh

On a more potent machine, you can clone the repository and pull the new images like this:

cd /home/stefan-srv/Github/Timelapses
now=$(date +"%m_%d_%Y %H %M %S")
echo "$now"
git pull --rebase

The file names are convenient for being able to read the date when it was taken, but most of the Linux tools require the files to be named within a sequence. The following code snippet renames the files into a sequence with four digits and pads them with zeros if possible.

>a=1
for i in *.jpg; do
  new=$(printf "%04d.jpg" "$a") #04 pad to length of 4
  mv -- "$i" "$new"
  let a=a+1
done

Animated Gifs

Imagemagick offers a set of great tools for images processing. With its submodule convert, you can create animated Gifs from a series of images like this:

convert -delay 10 -loop 0 *.jpg Output.gif

This adds a delay after each images and loops the gif images infinitely. ImageMagick requires a lot of RAM for larger Gif images and does not handle memory allocation well, but the results are still nice. Note that the files get very large, so a smaller resolution might be more practical.

Still Images to Videos

The still images can also be converted in videos. Use the following command to create an image with 10 frames per second:

>avconv -framerate 10 -f image2 -i %04d.jpg -c:v h264 -crf 1 out.mov

Example: Nordkette at Innsbruck, Tirol

This timelapse video of the Inn Valley Range in the north of the city of Innsbruck has been created by taking a picture with a Raspberry Pi Camera Module V2 every 5 minutes. This video consists of 1066 still images.