Lightning fast integration tests with Docker, MySQL and tmpfs

Integration tests that involve database operations requite to tear down and re-initialize a database multiple times. Although most developer machines offer SSD and enough RAM, database initialization can consume a considerable amount of time nevertheless.

Docker allows defining volumes that are mounted directly into the memory by using tmpfs. We can utilize this feature to utilize the speed up disk operations as during database imports, by moving the data from the disk into the memory.

The following example measures the time for writing 1GB worth of data to an SSD:

dd if=/dev/zero of=/tmp/output bs=1024k count=1024; 
1024+0 Datensätze ein
1024+0 Datensätze aus
1073741824 bytes (1,1 GB, 1,0 GiB) copied, 2,22033 s, 484 MB/s</code>```

For comparison, the following steps create a RAM disk and write the data to memory.

$ sudo mount -o size=1G -t tmpfs none /tmp/tmpfs

$ dd if=/dev/zero of=/tmp/tmpfs/output bs=1024k count=1024; 1024+0 Datensätze ein 1024+0 Datensätze aus 1073741824 bytes (1,1 GB, 1,0 GiB) copied, 0,309017 s, 3,5 GB/s

As you can see writing 1GB to memory is 7x faster. With the following Docker run command, you can spin-up a default MySQL container, where the data directory resides in a tmpfs.

docker run -d
–name mysql-56-in-memory
-p 3307:3306
–tmpfs /var/lib/mysql:rw,noexec,nosuid,size=1024m

The arguments of Docker run mean the following

  * &#8211;rm: delete the container once it was stopped
  * &#8211;name: a name for the container
  * -p: map the host&#8217;s port 3307 to the port 3306 inside the container. This allows to run multiple MySQL containers in parallel and connect to them from the host via the port specified
  * &#8211;tmpfs: This line mounts the internal directory of the container to a RAM disk. It should be writeable (rm). Noexec prevents the execution of binaries, nosuid prevents changing the permission flags and the size specifies the size occupied by the tmpfs partition in memory. Adapt this to your usecase. The minimum for MySQL is around 200MB. Add the space needed for your data, indices etc.
  * MYSQL\_ALLOW\_EMPTY_PASSWORD does what it implies
  * MYSQL_DATABASE defines the name of a database to be created

If you run this command you can connect to the container like this: _mysql -u root -h -P 3307_

The container behaves like a normal MySQL database, unless the data is not persisted on a hard disk, but only stored in the ephemeral memory. If you stop the container, it will be removed by docker and if you reboot the machine the data will be gone. for obvious reasons this is only a good idea for test data that can be re-created at any time.

You can achieve the same also with Docker Compose if you would like to orchestrate multiple containers.

version: ‘3’ services: mysql-56-integration: container_name: mysql-56-in-memory restart: unless-stopped image: mysql:5.6 environment: - MYSQL_ALLOW_EMPTY_PASSWORD='true’ - MYSQL_HOST=’’ volumes: - data:/var/cache ports: - “3307:3306”

volumes: data: driver_opts: type: tmpfs device: tmpfs```

Install Innotop from Source

Innotop is a great tool but not included in the current Ubuntu repositories. Here is how you install it manually:

# Install perl database interface
sudo apt-get install libdbi-perl
# Install MySQL and Terminal perl modules
sudo cpan Term::ReadKey DBD::mysql
# Clone innotop
git clone
# Enter directory
cd innotop
# Make
perl Makefile.PL
# Install
sudo make install

Then you can run innotop like this

innotop --user $ADMIN_USER --password $ADMIN_PASSWORD --host $HOST```

<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url=""  data-related="" target="_blank">Tweet</a>
  <div class="twttr_followme">
    <a href="" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>

Grafana and InfluxDB with SSL inside a Docker Container

Self-signed SSL certificates

On the host, create a directory for storing the self signed SSL certificates. This directory will be mounted in the Grafana container as well as in the InfluxDB container to /var/ssl. Create the self signed SSL certificates as follows:

mkdir -p /docker/ssl
cd /docker/ssl/
# Generate a private key
openssl genrsa -des3 -out server.key 1024
# Generate CSR
openssl req -new -key server.key -out server.csr
# Remove password
openssl rsa -in server.key -out server.key
# Generate self signed cert
openssl x509 -req -days 365 -in server.csr -signkey server.key -out server.crt
# Set permissions
sudo chmod 644 server.crt
sudo chmod 600 server.key

Next, create a config directory and create individual configuration files for Grafana and InfluxB: mkdir conf


In the file ./conf/grafana/defaults.ini set the protocol to https and provide the paths to the mounted ssl directory in the container.

#################################### Server ##############################
# Protocol (http, https, socket)
protocol = https
# https certs & key file
cert_file = /var/ssl/server.crt
cert_key = /var/ssl/server.key

## InfluxDB

The file ./conf/influxdb/influxdb.conf is also pretty simple. Add a [http] category and add the settings:

[meta] dir = “/var/lib/influxdb/meta” [data] dir = “/var/lib/influxdb/data” engine = “tsm1” wal-dir = “/var/lib/influxdb/wal” [http] https-enabled = true https-certificate =”/var/ssl/server.crt” https-private-key =”/var/ssl/server.key”

## Environment

You can set environment variables in <span class="lang:default decode:true crayon-inline ">env files</span> for the services.

### env.grafana


### env.influxdb


## Docker Compose

Now you can launch the service by using <span class="lang:default decode:true crayon-inline ">docker-compose up</span> for the following file. Note

version: ‘2’

services: influxdb: image: influxdb:latest container_name: influxdb ports: - “8083:8083” - “8086:8086” - “8090:8090” env_file: - ‘env.influxdb’ volumes: - data-influxdb:/var/lib/influxdb - /docker/ssl:/var/ssl - /docker/conf/influxdb/influxdb.conf:/etc/influxdb/influxdb.conf

    image: grafana/grafana:latest
    container_name: grafana
        - "3000:3000"
        - influxdb
        - 'env.grafana'
        - data-grafana:/var/lib/grafana
        - /docker/ssl:/var/ssl
        - /docker/conf/grafana/defaults.ini:/usr/share/grafana/conf/defaults.ini

volumes: data-influxdb: data-grafana:```

Lets Encrypt Setup

If you require valid certificates, you can also use certificates from lets encrypt.

First, create the certificates on the host:

certbot certonly --standalone --preferred-challenges http --renew-by-default -d```

Then use this docker-compose file.

version: ‘2’

services: influxdb: image: influxdb:latest container_name: influxdb ports: - “8083:8083” - “8086:8086” - “8090:8090” env_file: - ‘env.influxdb’ volumes: - data-influxdb:/var/lib/influxdb - /etc/letsencrypt/live/ - /etc/letsencrypt/live/ - /docker/conf/influxdb/influxdb.conf:/etc/influxdb/influxdb.conf

    image: grafana/grafana:latest
    container_name: grafana
        - "3000:3000"
        - influxdb
        - 'env.grafana'
        - data-grafana:/var/lib/grafana
        - /etc/letsencrypt/live/
        - /etc/letsencrypt/live/
        - /docker/conf/defaults.ini:/usr/share/grafana/conf/defaults.ini

volumes: data-influxdb: data-grafana:```

Compile Percona Query Playback

Install the prerequisits and clone the repository.

sudo apt-get install libtbb-dev libmysqlclient-dev libboost-program-options-dev libboost-thread-dev libboost-regex-dev libboost-system-dev libboost-chrono-dev pkg-config cmake  libssl-dev
git clone
cd query-playback/
mkdir build_dir
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..```

You might see this error

CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: MYSQL_LIB linked by target “mysql_client” in directory /home/sproell/git/query-playback/percona_playback/mysql_client

– Configuring incomplete, errors occurred! See also “/home/sproell/git/query-playback/build_dir/CMakeFiles/CMakeOutput.log”. See also “/home/sproell/git/query-playback/build_dir/CMakeFiles/CMakeError.log”.

I found this [issue on Github][1] and after editing the file <span class="lang:default decode:true crayon-inline ">CMakeLists.txt</span> (in the directory&nbsp;<span class="lang:default decode:true crayon-inline ">&nbsp;~/git/query-playback/percona_playback/mysql_client/CMakeLists.txt</span>&nbsp;) as suggested, the tool compiles. You need to replace&nbsp;<span class="lang:default decode:true crayon-inline">find_library(MYSQL_LIB &#8220;mysqlclient_r&#8221; PATH_SUFFIXES &#8220;mysql&#8221;)</span>&nbsp;with <span class="lang:default decode:true crayon-inline ">find_library(MYSQL_LIB &#8220;mysqlclient&#8221; PATH_SUFFIXES &#8220;mysql&#8221;)</span>&nbsp;(remove the _r suffix).

Then you can compile the project as [documented here][2].

~/git/query-playback/build_dir cd build_dir cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .. make sudo make install```

Illegal mix of collations: IntelliJ and UTF8mb4

When using variables inside SQL scripts within IntelliJ products (e.g. DataGrip), certain queries will not work because the encodings of the IntelliJ client and the server missmatch. This occurs for instance when you compare variables. A typical error message looks like this:

[HY000][1267] Illegal mix of collations (utf8mb4_unicode_520_ci,IMPLICIT) \
   and (utf8mb4_general_ci,IMPLICIT) for operation 'like'```

IntelliJ products do not yet support MySQL&#8217;s utf8mb4 character set encodings. The problem occurs when using variables in queries. Per default. IntelliJ uses a UTF-8 encoding for the connection. When you use utf8mb4 as the database default character set, then variables will be encoded in UTF-8 while the database content remailns in utf8mb4. It is not possible to provide the character set encodings to the IntelliJ settings, as it will refuse to connect.

Check your server settings using the MySQL client:

MySQL [cropster_research]> show variables like ‘%char%'; +————————–+—————————-+ | Variable_name | Value | +————————–+—————————-+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +————————–+—————————-+

This seems correct, but when you connect with the IntelliJ client, you will get wrong results when you use variables. Until the products supportutf8mb4, you would need to add the following settings to the script in order to force the right settings.

SET character_set_connection=utf8mb4; SET collation_connection=utf8mb4_unicode_520_ci;```

Verifying Replication Consistency with Percona’s pt-table-checksum

Replication is an important concept for improving database performance and security. In this blog post, I would like to demonstrate how the consistency between a MySQL master and a slave can be verified. We will create two Docker containers, one for the master one for the slave.

Installing the Percona Toolkit

The Percona Toolkit is a collection of useful utilities, which can be obained for free from the company’s portal. The following commands install the prerequisits, download the package and eventually the package.

sudo apt-get install -y wget libdbi-perl libdbd-mysql-perl libterm-readkey-perl libio-socket-ssl-perl
sudo dpkg -i percona-toolkit_3.0.4-1.xenial_amd64.deb 

Setting up a Test Environment with Docker

The following command creates and starts a docker container. Note that these are minimal examples and are not suitable for a serious environment.

docker run --name mysql_master -e MYSQL_ALLOW_EMPTY_PASSWORD=true -d mysql:5.6 --log-bin \
   --binlog-format=ROW --server-id=1```

Get the IP address from the master container:

Get the IP of the master

docker inspect mysql_master | grep IPAddress

“SecondaryIPAddresses”: null, “IPAddress”: “"```

You can connect to this container like this and verify the server id:

stefan@Lenovo ~/Docker-Projects $ mysql -u root -h
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.6.35-log MySQL Community Server (GPL)

Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show variables like 'server_id';
| Variable_name | Value |
| server_id     | 1     |
1 row in set (0,00 sec)

We repeat the command for the slave, but use a different id. port and name:

docker run --name mysql_slave -e MYSQL_ALLOW_EMPTY_PASSWORD=true -d mysql:5.6 --server-id=2```

For simplicity, we did not use Docker links, but will rather use IP addresses assigned by Docker directly.

## Replication Setup

First, we need to setup a user with replication privileges. This user will connect from the slave to the master.

On the host, interact with the master container

Get the IP address of the slave container

$ docker inspect mysql_slave | grep IPAddress “SecondaryIPAddresses”: null, “IPAddress”: “”, “IPAddress”: “”,

Login to the MySQL console of the master

Grant permissions


Get the current binlog position

mysql> SHOW MASTER STATUS; +——————-+———-+————–+——————+——————-+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +——————-+———-+————–+——————+——————-+ | mysqld-bin.000002 | 346 | | | | +——————-+———-+————–+——————+——————-+ 1 row in set (0,00 sec)```

Now log into the slave container and add the connection details for the master:

## Connect to the MySQL Slave instance
$ mysql -u root -h

### Setup the slave

Query OK, 0 rows affected, 2 warnings (0,05 sec)

### Start and check
mysql>   start slave;
Query OK, 0 rows affected (0,01 sec)

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_User: percona
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000002
          Read_Master_Log_Pos: 346
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 284
        Relay_Master_Log_File: mysqld-bin.000002
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

Now our simple slave setup is running.

Get some test data

Lets download the Sakila test database and import it into the master. It will be replicated immediately.

~/Docker-Projects $ tar xvfz sakila-db.tar.gz 

mysql -u root -h < sakila-db/sakila-schema.sql 
mysql -u root -h < sakila-db/sakila-data.sql```

Verify that the data is on the slave as well:

mysql -u root -h -e “USE sakila;SHOW TABLES;” +—————————-+ | Tables_in_sakila | +—————————-+ | actor | | actor_info | | address | | category | | city | | country | | customer | … | store | +—————————-+

After our setup is completed, we can proceed with Percona pt-table checksum.

# Percona pt-table-checksum

The Percona pt-table-checksum tool requires the connection information of the master and the slave in a specific format. This is called the DSN (data source name), which is a coma separated string. We can store this information in a dedicated database called percona in a table called dsns. We create this table on the master. Note that the data gets replicated to the slave within the blink of an eye.

CREATE DATABASE percona; USE percona;

CREATE TABLE DSN-Table ( id int(11) NOT NULL AUTO_INCREMENT, dsn varchar(255) NOT NULL, PRIMARY KEY (id) );

The next step involves creating permissions on the slave and the master!


The percona user is needed to run the script. Note that the IP address is this time from the (Docker) host, having the IP by default. In real world scenarios, this script would either be run on the master or on the slave directly.

Now we need to add the information about the slave to the table we created. The Percona tool could also read this from the process list, but it is more reliable if we add the information ourselves. To do so, we add a record to the table we just created, which describes the slave DSN:

INSERT INTO percona.DSN-Table VALUES (1,'h=,u=percona,p=SECRET,P=3306');```

The pt-table-checksum tool the connects to the master instance and the the slave. It computes checksums of all databases and tables and compares results. You can use the tool like this:

pt-table-checksum –replicate=percona.checksums –create-replicate-table –empty-replicate-table
–recursion-method=dsn=t=percona.DSN_Table -h -P 3306 -u percona -pSECRET TS ERRORS DIFFS ROWS CHUNKS SKIPPED TIME TABLE 09-10T10:13:11 0 0 0 1 0 0.020 mysql.columns_priv 09-10T10:13:11 0 0 3 1 0 0.016 mysql.db 09-10T10:13:11 0 0 0 1 0 0.024 mysql.event 09-10T10:13:11 0 0 0 1 0 0.014 mysql.func 09-10T10:13:11 0 0 40 1 0 0.026 mysql.help_category 09-10T10:13:11 0 0 614 1 0 0.023 mysql.help_keyword 09-10T10:13:11 0 0 1224 1 0 0.022 mysql.help_relation 09-10T10:13:12 0 0 585 1 0 0.266 mysql.help_topic 09-10T10:13:12 0 0 0 1 0 0.031 mysql.ndb_binlog_index 09-10T10:13:12 0 0 0 1 0 0.024 mysql.plugin 09-10T10:13:12 0 0 6 1 0 0.287 mysql.proc 09-10T10:13:12 0 0 0 1 0 0.031 mysql.procs_priv 09-10T10:13:12 0 1 2 1 0 0.020 mysql.proxies_priv 09-10T10:13:12 0 0 0 1 0 0.024 mysql.servers 09-10T10:13:12 0 0 0 1 0 0.017 mysql.tables_priv 09-10T10:13:12 0 0 1820 1 0 0.019 mysql.time_zone 09-10T10:13:12 0 0 0 1 0 0.015 mysql.time_zone_leap_second 09-10T10:13:12 0 0 1820 1 0 0.267 mysql.time_zone_name 09-10T10:13:13 0 0 122530 1 0 0.326 mysql.time_zone_transition 09-10T10:13:13 0 0 8843 1 0 0.289 mysql.time_zone_transition_type 09-10T10:13:13 0 1 4 1 0 0.031 mysql.user 09-10T10:13:13 0 0 1 1 0 0.018 percona.DSN_Table 09-10T10:13:13 0 0 200 1 0 0.028 09-10T10:13:13 0 0 603 1 0 0.023 sakila.address 09-10T10:13:13 0 0 16 1 0 0.033 sakila.category 09-10T10:13:13 0 0 600 1 0 0.023 09-10T10:13:13 0 0 109 1 0 0.029 09-10T10:13:14 0 0 599 1 0 0.279 sakila.customer 09-10T10:13:14 0 0 1000 1 0 0.287 09-10T10:13:14 0 0 5462 1 0 0.299 sakila.film_actor 09-10T10:13:14 0 0 1000 1 0 0.027 sakila.film_category 09-10T10:13:14 0 0 1000 1 0 0.032 sakila.film_text 09-10T10:13:14 0 0 4581 1 0 0.276 sakila.inventory 09-10T10:13:15 0 0 6 1 0 0.030 sakila.language 09-10T10:13:15 0 0 16049 1 0 0.303 sakila.payment 09-10T10:13:15 0 0 16044 1 0 0.310 sakila.rental 09-10T10:13:15 0 0 2 1 0 0.029 sakila.staff 09-10T10:13:15 0 0 2 1 0 0.020

The result shows a difference in the MySQL internal table for permissions. This is obviously not what we are interested in, as permissions are individual to a host. So we rather exclude the MySQL internal database and also the percona database, because it is not what we are interested in. Also in order to test it the tool works, we delete the last five category assignments from the table with <span class="lang:default decode:true crayon-inline">mysql -u root -h -e &#8220;DELETE FROM sakila.film_category WHERE film_id > 995;&#8221;</span> and update a row in the city table with&nbsp;

mysql -u root -h -e “update SET city='Innsbruck’ WHERE city_id=590;“```

Now execute the command again:

pt-table-checksum --replicate=percona.checksums --create-replicate-table --empty-replicate-table \
   --recursion-method=dsn=t=percona.DSN_Table --ignore-databases mysql,percona -h -P 3306 -u percona -pSECRET
09-10T10:46:33      0      0      200       1       0   0.017
09-10T10:46:34      0      0      603       1       0   0.282 sakila.address
09-10T10:46:34      0      0       16       1       0   0.034 sakila.category
09-10T10:46:34      0      1      600       1       0   0.269
09-10T10:46:34      0      0      109       1       0   0.028
09-10T10:46:34      0      0      599       1       0   0.285 sakila.customer
09-10T10:46:35      0      0     1000       1       0   0.297
09-10T10:46:35      0      0     5462       1       0   0.294 sakila.film_actor
09-10T10:46:35      0      1     1000       1       0   0.025 sakila.film_category
09-10T10:46:35      0      0     1000       1       0   0.031 sakila.film_text
09-10T10:46:35      0      0     4581       1       0   0.287 sakila.inventory
09-10T10:46:35      0      0        6       1       0   0.035 sakila.language
09-10T10:46:36      0      0    16049       1       0   0.312 sakila.payment
09-10T10:46:36      0      0    16044       1       0   0.320 sakila.rental
09-10T10:46:36      0      0        2       1       0   0.030 sakila.staff
09-10T10:46:36      0      0        2       1       0   0.027

You see that there is a difference in the tables and in the table sakila.film_category. The tool does not report the actual number of differences, but rather the number of different chunks. To get the actual differences, we need to use a different tool, which utilises the checksum table that the previous step created.

Show the differences with pt-tabel-sync

The pt-table-sync tool is the counter part for the pt-table-checksum util. It can print or even replay the SQL statements that would render the slave the same state again to be in sync with the master. We can run a dry-run first, as the tool is potentially dangerous.

pt-table-sync --dry-run  --replicate=percona.checksums --sync-to-master h= -P 3306 \
   -u percona -pSECRET --ignore-databases mysql,percona
# NOTE: --dry-run does not show if data needs to be synced because it
#       does not access, compare or sync data.  --dry-run only shows
#       the work that would be done.
# Syncing via replication P=3306,h=,p=...,u=percona in dry-run mode, without accessing or comparing data
#      0       0      0      0 Chunk     08:57:51 08:57:51 0
#      0       0      0      0 Nibble    08:57:51 08:57:51 0    sakila.film_category

With –dry-run, you only see affected tables, but not the actual data because it does not really access the databases tables in question. Use –print additionally or instead of dry-run to get a list:

pt-table-sync --print --replicate=percona.checksums --sync-to-master h= -P 3306 \
  -u percona -pSECRET --ignore-databases mysql,percona
REPLACE INTO `sakila`.`city`(`city_id`, `city`, `country_id`, `last_update`) VALUES \
   ('590', 'Yuncheng', '23', '2006-02-15 04:45:25') 
  \ /*percona-toolkit src_db:sakila src_tbl:city  ...
REPLACE INTO `sakila`.`film_category`(`film_id`, `category_id`, `last_update`) VALUES ... 
REPLACE INTO `sakila`.`film_category`(`film_id`, `category_id`, `last_update`) VALUES ('997',... 
REPLACE INTO `sakila`.`film_category`(`film_id`, `category_id`, `last_update`) VALUES ('998', '11 ...
REPLACE INTO `sakila`.`film_category`(`film_id`, `category_id`, `last_update`) VALUES ('999', '3', ...
REPLACE INTO `sakila`.`film_category`(`film_id`, `category_id`, `last_update`) VALUES ('1000', '5', ... 

The command shows how we can rename back from Innsbruck to Yuncheng again and also provides the INSERT statements to get the deleted records back.When we replace –print with –execute, the data gets written to the master and replicated to the slave. To allow this, we need to set the permissions on the master

GRANT INSERT, UPDATE, DELETE ON sakila.* TO 'percona'@'';
pt-table-sync --execute  --replicate=percona.checksums --check-child-tables \ 
  --sync-to-master h= -P 3306 -u percona -pSECRET --ignore-databases mysql,percona
REPLACE statements on can adversely affect child table `sakila`.`address` 
   because it has an ON UPDATE CASCADE foreign key constraint. 
   See --[no]check-child-tables in the documentation for more information. 
   --check-child-tables error  while doing on

This error indicates that updating the city table has consequences, because it is a FK to child tables. In this example, we are bold and ignore this warning. This is absolutely not recommended for real world scenarios.

pt-table-sync --execute  --replicate=percona.checksums --no-check-child-tables \
   --no-foreign-key-checks --sync-to-master h= -P 3306 -u percona -pSECRET \ 
   --ignore-databases mysql,percona

The command–no-check-child-tables ignores child tables and the command –no-foreign-key-checks ignores foreign keys.

Run the checksum command again to verify that the data has been restored:

pt-table-checksum --replicate=percona.checksums --create-replicate-table --empty-replicate-table \ 
   --recursion-method=dsn=t=percona.DSN_Table --ignore-databases mysql,percona 
   -h -P 3306 -u percona -pSECRET

09-10T11:24:42      0      0      200       1       0   0.268
09-10T11:24:42      0      0      603       1       0   0.033 sakila.address
09-10T11:24:42      0      0       16       1       0   0.029 sakila.category
09-10T11:24:42      0      0      600       1       0   0.275
09-10T11:24:42      0      0      109       1       0   0.023
09-10T11:24:43      0      0      599       1       0   0.282 sakila.customer
09-10T11:24:43      0      0     1000       1       0   0.046
09-10T11:24:43      0      0     5462       1       0   0.284 sakila.film_actor
09-10T11:24:43      0      0     1000       1       0   0.036 sakila.film_category
09-10T11:24:43      0      0     1000       1       0   0.279 sakila.film_text
09-10T11:24:44      0      0     4581       1       0   0.292 sakila.inventory
09-10T11:24:44      0      0        6       1       0   0.031 sakila.language
09-10T11:24:44      0      0    16049       1       0   0.309 sakila.payment
09-10T11:24:44      0      0    16044       1       0   0.325 sakila.rental
09-10T11:24:44      0      0        2       1       0   0.029 sakila.staff
09-10T11:24:44      0      0        2       1       0   0.028

0 DIFFS, we are done!

Hibernate Search and Spring Boot: Building Bridges

Hibernate Search is a very convenient way for storing database content in a Lucine index and add fulltext search capabilities to data driven projects simply by annotating classes. It can be easily integrated into Spring Boot applications and as long as only the basic features are used, it works out of the box. The fun starts when the Autoconfiguration cannot find out how to properly configure things automatically, then it gets tricky quite quickly. Of course this is natural behaviour, but one gets spoiled quite quickly. 

Using the latest Features: Hibernate ORM, Hibernate Search and Spring Boot

The current version of Spring Boot is 1.5.2. This version uses Hibernate ORM 5.0. The latest stable Hibernate Search versions are 5.6.1.Final and 5.7.0.Final, which in in contrast require Hibernate ORM 5.1 and 5.2 respectively. Also you need Java 8 now. For this reason if you need the latest Spring Search features in combination with Spring Boot, you need to adapt the dependencies as follows:

buildscript {
	ext {
		springBootVersion = '1.5.1.RELEASE'
	repositories {
	dependencies {

apply plugin: 'java'
apply plugin: 'eclipse'
apply plugin: 'org.springframework.boot'

jar {
	baseName = 'SearchaRoo'
	version = '0.0.1-SNAPSHOT'

sourceCompatibility = 1.8
targetCompatibility = 1.8

repositories {

dependencies {

	// Hibernate Search
    	exclude group: "org.hibernate:", module: "hibernate-entitymanager"

Note that the Hibernate Entity Manager needs to be excluded, because it has been integrated into the core in the new Hibernate version. Details are given in the [Spring Boot documentation][1].

## Enforcing the Dependencies to be Loaded in the Correct Sequence

As written earlier, Spring Boot takes care of a lot of configurations for us. Most of the time, this works perfectly and reduces the pain for configuring a new application manually. In some particular cases, Spring cannot figure out that there exists a dependency between different services, which needs to be resolved in a specified order. A typical use case is the implementation of FieldBridges for Hibernate Search. FieldBrides translate between the actual Object from the Java World and the representation of such an object in the Lucene index. Typically an [EnumBridge][2]is used for indexing Enums, which are often used for realizing internationalization (I18n).

When the Lucene Index is created, Hibernate checks if Enum fields need to be indexed and if there exist Bridge that converts between the object and the actual record in the Index. The problem here is that Hibernate JPA is loaded at a very early stage in the Spring Boot startup proces. The problem only arises if the BridgeClass utilises @Autowired&nbsp;fields which get injected. Typically, these fields would get injected when the&nbsp;AnnotationBeanConfigurerAspect bean is loaded.&nbsp;Hibernate creates the session with the session factory auto configuration before the&nbsp;spring configurer aspect bean was loaded. So the FieldBridge used by Hibernate during the initialization of the index does not have the service injected yet, causing a nasty Null Pointer Exception.&nbsp;

### Example EnumBridge

The following EnumBridge example utilises an injected Service, which needs to be available before Hibernate starts. If not taken care of, this causes a Null Pointer Exception.

@Configurable public class HibernateSearchEnumBridgeExample extends EnumBridge { private I18nMessageService i18nMessageService;

public void setI18nMessageService(I18nMessageService service) {
this.i18nMessageService = service;

public String objectToString(Object object)
     return  i18nMessageService.getMessageForEnum(object);

public Enum<? extends Enum> stringToObject(String name)
    return Enum.valueOf(name);


public void setAppliedOnType(Class<?> returnType)


Enforce Loading the Aspect Configurer Before the Session Factory

In order to enforce that theAnnotationBeanConfigurerAspect is created before the Hibernate Session Factory is created, we simply implement our own HibernateJpaAutoConfiguration by extension and add the AnnotationBeanConfigurerAspect to the constructor. Spring Boot now knows that it needs to instantiate the AnnotationBeanConfigurerAspect before it can instantiate the HibernateJpaAutoConfiguration and we then have wired Beans ready for the consumption of the bridge. I found the correct hint [here][3] and [here][4].

public class HibernateSearchConfig extends HibernateJpaAutoConfiguration {

	public HibernateSearchConfig(DataSource dataSource, JpaProperties jpaProperties,
				AnnotationBeanConfigurerAspect beanConfigurerAspect,
				ObjectProvider<JtaTransactionManager> jtaTransactionManager,
				ObjectProvider<TransactionManagerCustomizers> transactionManagerCustomizers) {

			super(dataSource, jpaProperties, jtaTransactionManager, transactionManagerCustomizers);

As it has turned out, using @DependsOn annotations did not work and also @Ordering the precedence of the Beans was not suffucient. With this little hack, we can ensure the correct sequence of initialization.

<div class="twttr_buttons">
  <div class="twttr_twitter">
    <a href="" class="twitter-share-button" data-via="" data-hashtags=""  data-size="default" data-url=""  data-related="" target="_blank">Tweet</a>
  <div class="twttr_followme">
    <a href="" class="twitter-follow-button" data-show-count="true" data-size="default"  data-show-screen-name="false"  target="_blank">Follow me</a>


Deploying MySQL in a Local Development Environment

Installing MySQL via apt-get is a simple task, but the migration between different MySQL versions requires planning and testing. Thus installing one central instance of the database system might not be suitable, when the version of MySQL or project specific settings should be switched quickly without interfering with other applications. Using one central instance can quickly become cumbersome. In this article, I will describe how any number of MySQL instances can be stored and executed from within a user’s home directory.

Adapting MySQL Data an Log File Locations

Some scenarios might require to run several MySQL instances at once, other scenarios cover sensitive data, where we do not want MySQL to write any data on non-encrypted partitions. This is especially true for devices which can get easily stolen, for instance laptops. If you use a laptop for developing your applications from time to time, chances are good that you need to store sensitive data in a database, but need to make sure that the data is encrypted when at rest. The data stored in a database needs to be protected when at rest.

This can be solved with full disk encryption, but this technique has several disadvantages. First of all, full disk encryption only utilises one password. This entails, that several users who utilise a device need to share one password, which reduces the reliability of this approach. Also when the system needs to be rebooted, full disk encryption can become an obstacle, which increases the complexity further.

Way easier to use is the transparent home directory encryption, which can be selected during many modern Linux setup procedures out of the box. We will use this encryption type for this article, as it is reasonable secure and easy to setup. Our goal is to store all MySQL related data in the home directory and run MySQL with normal user privileges.

Creating the Directory Structure

The first step is creating a directory structure for storing the data. In this example, the user name is stefan, please adapt to your needs.

A MySQL 5.7 Cluster Based on Ubuntu 16.04 LTS – Part 2

In a recent article, I described how to setup a basic MySQL Cluster with two data nodes and a combined SQL and management node. In this article, I am going to highlight a hew more things and we are going to adapt the cluster a little bit.

Using Hostnames

For making our lives easier, we can use hostnames which are easier to remember than IP addresses. Hostnames can be specified for each VM in the file /etc/hosts. For each request to the hostname, the operating system will lookup the corresponding IP address. We need to change this file on all three nodes to the following example:

A MySQL 5.7 Cluster Based on Ubuntu 16.04 LTS – Part 1

A Cluster Scenario

In this example we create the smallest possible MySQL cluster based on four nodes running on three machines. Node 1 will run the cluster management software, Node 2 and Node 3 will serve as dats nodes and Node 4 is the MySQSL API, which runs on the same VM on Node 1.

Parsing SQL Statements

JDBC and the Limits of ResultSet Metadata

For my work in the area of data citation, I need to analyse queries, which are used for creating subsets. I am particularly interested in query parameters, sortings and filters. One of the most commonly used query languages is SQL, which is used by many relational database management systems such as MySQL. In some cases, the interaction with databases is abstract, meaning that there is hardly any SQL statements executed directly. The SQL statements are rather built on the fly by object relational mappers such as Hibernate. Other scenarios use SQL statements as String and also prepared statements, which are executed via JDBC. However,  analysing SQL statements is tricky as the language is very flexible.

In order to understand what columns have been selected, it is sufficient to utilise the ResultSet Metadata and retrieve the column names from there. In my case I need to extract this imformation from the query in advance and potentially enforce a specific sorting by adding columns to the ORDER BY clause. In this scenario, I need to parse the SQL statement and retrieve this information from the statement itself. Probably the best way to do this would be to implement a parser for the SQL dialect with ANTLR (ANother Tool for Language Recognition). But this is quite a challenge, so I decided to take a shortcut: FoundationDB.

The FoundationDB Parser

FoundationDB was a NoSQL database which provided several layers for supporting different paradigms at once. I am using past tense here, because the project got acquired by Apple in 2015 and since then does pursue the open source project any more. However, the Maven libraries for the software are still available at Maven Central. FoundationDB uses its own SQL parser, which understands standard SQL queries. These queries can be interpreted as a tree and the parser library allows traversing SQL statements and analyse the nodes. We can use this tree to parse and interpret SQL statements and extract additional information.

The Foundations of FoundationDB

The FoundationDB parser can be included into your own project with the following Maven dependency:


The usage of the parser is straight forward. We use the following example SQL statement as input:

	FROM tableA AS a, tableB AS b 
	WHERE a.firstColumn = b.secondColumn AND 
	b.thirdColumn < 5 
	ORDER BY a.thirdColumn,a.secondColumn DESC

The following function calls the parser and prints the tree of the statement.

     * Print a SQL statement
     * @param sqlString
	public void parseSQLString(String sqlString) {
                Parser parser = new Parser(); 
		StatementNode stmt;
		try {
		    stmt = this.parser.parseStatement(sqlString);

		} catch (StandardException e) {

The resulting tree is listed below. The statement has also been normalized, which ensures a stable sequence of the parameters.

name: null
statementType: SELECT
	isDistinct: false

		exposedName: firstcolumn
		name: firstcolumn
		tableName: null
		isDefaultColumn: false
		type: null
			columnName: firstcolumn
			tableName: a
			type: null
		exposedName: secondcolumn
		name: secondcolumn
		tableName: null
		isDefaultColumn: false
		type: null
			columnName: secondcolumn
			tableName: b
			type: null
		exposedName: thirdcolumn
		name: thirdcolumn
		tableName: null
		isDefaultColumn: false
		type: null
			columnName: thirdcolumn
			tableName: b
			type: null

		tableName: tablea
		updateOrDelete: null
		correlation Name: a
		tableName: tableb
		updateOrDelete: null
		correlation Name: b
		operator: and
		methodName: and
		type: null
			operator: =
			methodName: equals
			type: null
				columnName: firstcolumn
				tableName: a
				type: null
				columnName: secondcolumn
				tableName: b
				type: null
			operator: <
			methodName: lessThan
			type: null
				columnName: thirdcolumn
				tableName: b
				type: null
				value: 5
	allAscending: false
	ascending: true
	nullsOrderedLow: false
	columnPosition: -1
		columnName: thirdcolumn
		tableName: a
		type: null
	ascending: false
	nullsOrderedLow: false
	columnPosition: -1
		columnName: secondcolumn
		tableName: a
		type: null

This tree offers a lot of information, which can be used programmatically as well. In the top of the output, we can see that the statement was a SELECT statement and that it was not DISTINCT. Then follows the ResultSet, which contains a list of the three ResultColumns, which have been specified in the SELECT clause. We can see the column names and the table names from which they are drawn. The next block provides the referenced tables (the FROM list) and their alias names. The WHERE – block contains the operands which have been used for filtering and last but not least, there is the list of ORDER BY clauses and their sorting directions.

The Visitor

In order to access the information shown above programmatically, we need to access the content of the node one by one. This can be achieved with the visitor pattern, which traverses all the nodes of the tree. The following listing shows how the visitor pattern can be used for accessing the list of columns from the SELECT clause.

     * Return a list of columns of a SELECT clause
     * @param sql
     * @return
	public ArrayList selectColumnsList(String sql){
        SQLParser parser = new SQLParser();
        BooleanNormalizer normalizer = new BooleanNormalizer(parser);
        StatementNode stmt = null;

        try {
            stmt = parser.parseStatement(sql);
            stmt = normalizer.normalize(stmt);
        } catch (StandardException e) {

        final ArrayList<ResultColumn> columns = new ArrayList&lt;ResultColumn&gt;();
        Visitor v = new Visitor() {

            public boolean visitChildrenFirst(Visitable node) {
                if (node instanceof SelectNode)
                    return true;
                return false;

            public Visitable visit(Visitable node) throws StandardException {
                if (node instanceof ResultColumn) {
                    ResultColumn resultColumn = (ResultColumn) node;
                    System.out.println("Column " + columns.size()+ ": " + resultColumn.getName());
                return null;

            public boolean stopTraversal() {
                // TODO Auto-generated method stub
                return false;

            public boolean skipChildren(Visitable node) throws StandardException {
                if (node instanceof FromList) {
                    return true;
                return false;

        try {
        } catch (StandardException e) {

        return columns;


This code example, we define a visitor which traverses all the ResultColumn nodes. Every time the current node is an instance of ResultColumn, we add this node to our list of columns. The nodes are only visited, if they are children of a SELECT statement. This is our entry point into the tree. We leave the tree when we reach the FROM list. We then apply the visitor to the statement, which initiates the traversal. As a result, we receive a list of columns which have been used for the result set.

In order to get the list of ORDER BY columns, we can utilise a similar approach. The following functions gives an example:

     * Return list of order by clauses
     * @param sqlText
     * @return
	public OrderByList orderByColumns(String sqlText){

		SQLParser parser = new SQLParser();
		BooleanNormalizer normalizer = new BooleanNormalizer(parser);
		StatementNode stmt;
        OrderByList orderByList = null;
        try {
			stmt = parser.parseStatement(sqlText);
			stmt = normalizer.normalize(stmt);
            CursorNode node = (CursorNode) stmt;
            orderByList = node.getOrderByList();
            int i=0;
            for(OrderByColumn orderByColumn : orderByList){
                String direction;
                System.out.println("ORDER BY Column " +i+ ": " +orderByColumn.getExpression().getColumnName()+ " Direction: " + direction );

        } catch (StandardException e) {
			// TODO Auto-generated catch block
		return orderByList;


This time, we retrieve the list of ORDER BY columns directly from the CurserNode. Similar principles can be used for manipulating SQL statements and apply a different sorting for instance.

Persistent Data in a MySQL Docker Container

Running MySQL in Docker

In a recent article on Docker in this blog, we presented some basics for dealing with data in containers. This article will present another popular application for Docker: MySQL containers. Running MySQL instances in Docker allows isolating database infrastructure with ease.

Connecting to the Standard MySQL Container

The description of the MySQL docker image provides a lot of useful information how to launch and connect to a MySQL container. The first step is to create standard MySQL container from the latest available image.

sudo docker run \
   -p 3307:3306 

This creates a MySQL container where the root password is set to secret. As the host is already running its own MySQL instance (which has nothing to do with this docker example), the standard port 3306 is already taken. Thus we publish utilise the port 3307 on the host system and forward it to the 3306 standard port from the container.

Connect from the Host

We can then connect from the command line like this:

mysql -uroot -psecret -h -P3307

We could also provide the hostname localhost for connecting to the container, but as the MySQL client per default assumes that a localhost connection is via a socket, this would not work. Thus when using the hostname localhost, we needed to specify the protocol TCP, wo that the client connects via the network interface.

mysql -uroot -psecret -h localhost --protocol TCP -P3307

Connect from other Containers

Connecting from a different container to the MySQL container is pretty straight forward. Docker allows to link two containers and then use the exposed ports between them. The following command creates a new ubuntu container and links to the MySQL container.

sudo docker run -it --name ubuntu-container --link mysql-instance:mysql-link ubuntu:16.10 bash

After this command, you are in the terminal of the Ubuntu container. We then need to install the MySQL client for testing:

# Fetch the package list
root@7a44b3e7b088:/# apt-get update
# Install the client
root@7a44b3e7b088:/# apt-get install mysql-client
# Show environment variables
root@7a44b3e7b088:/# env

The last command gives you a list of environment variables, among which is the IP address and port of the MySQL container.


You can then connect either manually of by providing the variables

mysql -uroot -psecret -h

If you only require a MySQL client inside a container, simply use the MySQL image from docker. Batteries included!

Hikari Connection Pooling with a MySQL Backend, Hibernate and Maven

Conection Pooling?

JDBC connection pooling is a great concept, which improves the performance of database driven applications by reusing connections. The benefit from connection pools is that the cost of creating and closing connections is avoided, by reusing connections from a pool of available connections. Database systems such as MySQL also assign database resources by limiting simultaneous connections. This is another reason, why connection pools have benefits in contrast to opening and closing individual connections.

Dipping into Pools

There exists a selection of different JDBC compatible connection pools which can be used more or less interchangeable. The most widely used pools are:

Most of these pools work in a very similar way. In the following tutorial, we are going to take out HikariCP for a spin. It is simple to use and claims to be very fast. In the following we are going to setup a small project using the following technologies:

  • Java 8
  • Tomcat 8
  • MySQL 5.7
  • Maven 3
  • Hibernate 5

and of course an IDE of your choice (I have become quite fond of IntelliJ IDEA Community Edition).

Project Overview

In this small demo project, we are going to write a minimalistic Web application, which simply computes a new random number for each request and stores the result in a database table. We use Java and store the data by using the Hibernate ORM framework.We also assume, that you have a running Apache Tomcat Servlet Container and also a running MySQL instance available.

In the first step, I created a basic Web project by selecting the Maven Webapp archetype, which then creates a basic structure we can work with.

Adding the Required Libraries

After we created the initial project, we need to add the required libraries. We can achieve this easily with Maven, by adding the dependency definitions to our pom.xml file. You can find these definitions at maven central. The build block contains the plugin for deploying the application at the Tomcat server.

<project xmlns="" xmlns:xsi=""
  <name>HibernateHikari Maven Webapp</name>






Now we have all the libraries we need available and we can begin with implementing the functionality.

The Database Table

As we want to persist random numbers, we need to have a database table, which will store the data. Create the following table in MySQL and ensure that you have a test user available:

CREATE TABLE `TestDB`.`RandomNumberTable` (
  `randomNumber` INT NOT NULL,
  PRIMARY KEY (`id`));```

## POJO Mojo: The Java Class to be Persisted

Hibernate allows us to persist Java objects in the database, by annotating the Java source code. The following Java class is used to store the random numbers that we generate.

@Entity @Table(name="RandomNumberTable”, uniqueConstraints={@UniqueConstraint(columnNames={“id”})}) public class RandomNumberPOJO { @Id @GeneratedValue(strategy= GenerationType.IDENTITY) @Column(name="id”, nullable=false, unique=true, length=11) private int id;

@Column(name="randomNumber", nullable=false)
private int randomNumber;

public int getId() {
    return id;

public void setId(int id) { = id;

public int getRandomNumber() {
    return randomNumber;

public void setRandomNumber(int randomNumber) {
    this.randomNumber = randomNumber;


The code and also the annotations are straight forward. Now we need to define a way how we can connect to the database and let Hibernate handle the mapping between the Java class and the database schema we defined before.

## Hibernate Configuration

Hibernate looks for the configuration in a file called hibernate.cfg.xml by default. This file is used to provide the connection details for the database.

    <property name="hibernate.dialect">org.hibernate.dialect.MySQLDialect</property>
    <property name="hibernate.connection.provider_class">com.zaxxer.hikari.hibernate.HikariConnectionProvider</property>
    <property name="hibernate.hikari.dataSource.url">jdbc:mysql://localhost:3306/TestDB?useSSL=false</property>
    <property name="hibernate.hikari.dataSource.user">testuser</property>
    <property name="hibernate.hikari.dataSource.password">sEcRet</property>
    <property name="hibernate.hikari.dataSourceClassName">com.mysql.jdbc.jdbc2.optional.MysqlDataSource</property>
    <property name="hibernate.hikari.dataSource.cachePrepStmts">true</property>
    <property name="hibernate.hikari.dataSource.prepStmtCacheSize">250</property>
    <property name="hibernate.hikari.dataSource.prepStmtCacheSqlLimit">2048</property>
    <property name="hibernate.hikari.dataSource.useServerPrepStmts">true</property>
    <property name="hibernate.current_session_context_class">thread</property>


The file above contains the most essential settings. We specify the database dialect that we speak `org.hibernate.dialect.MySQLDialect`, define the connection provider class (the Hikari CP) with `com.zaxxer.hikari.hibernate.HikariConnectionProvider` and provide the URL to our MySQL database (`jdbc:mysql://localhost:3306/TestDB?useSSL=false`) including the username and password for the database connection. Alternatively, you can also define the same information in the file.

## The Session Factory

We need to have a session factory, which initializes the database connection and the connection pool as well as handles the interaction with the database server. We can use the following class, which provides the session object for these tasks.

import javax.servlet.ServletContextEvent; import javax.servlet.ServletContextListener; import javax.servlet.annotation.WebListener;

import org.hibernate.SessionFactory; import org.hibernate.boot.registry.StandardServiceRegistryBuilder; import org.hibernate.cfg.Configuration; import org.hibernate.service.ServiceRegistry; import org.jboss.logging.Logger;

@WebListener public class HibernateSessionFactoryListener implements ServletContextListener {

public final Logger logger = Logger.getLogger(HibernateSessionFactoryListener.class);

public void contextDestroyed(ServletContextEvent servletContextEvent) {
    SessionFactory sessionFactory = (SessionFactory) servletContextEvent.getServletContext().getAttribute("SessionFactory");
    if(sessionFactory != null && !sessionFactory.isClosed()){"Closing sessionFactory");
    }"Released Hibernate sessionFactory resource");

public void contextInitialized(ServletContextEvent servletContextEvent) {
    Configuration configuration = new Configuration();
    // Add annotated class

    ServiceRegistry serviceRegistry = new StandardServiceRegistryBuilder().applySettings(configuration.getProperties()).build();"ServiceRegistry created successfully");
    SessionFactory sessionFactory = configuration
            .buildSessionFactory(serviceRegistry);"SessionFactory created successfully");

    servletContextEvent.getServletContext().setAttribute("SessionFactory", sessionFactory);"Hibernate SessionFactory Configured successfully");


This class provides two so called contexts, where the session gets initialized and a second one where it gets destroyed. The Tomcat Servlet container automatically calls these depending on the state of the session. You can see that the filename of the configuration file is provided (<span class="lang:default decode:true crayon-inline">configuration.configure(&#8220;hibernate.cfg.xml&#8221;);`) and that we tell Hibernate, to map our RandomNumberPOJO file (`configuration.addAnnotatedClass(RandomNumberPOJO.class);`). Now all that is missing is the Web component, which is waiting for our requests.

## The Web Component

The last part is the Web component, which we kept as simple as possible.

import org.hibernate.Session; import org.hibernate.SessionFactory; import org.hibernate.Transaction; import javax.persistence.TypedQuery; import javax.servlet.ServletException; import javax.servlet.http.HttpServlet; import javax.servlet.http.HttpServletRequest; import javax.servlet.http.HttpServletResponse;

import; import;

import java.util.List; import java.util.Random;

public class HelloServlet extends HttpServlet { public void doGet (HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException { PrintWriter out = res.getWriter(); addRandomNumber(req); out.println(“There are " + countNumbers(req) + " random numbers”);

    List<RandomNumberPOJO> numbers = getAllRandomNumbers(req,res);

    out.println("Random Numbers:");

    for(RandomNumberPOJO record:numbers){
        out.println("ID: " + record.getId() + "\t :\t" + record.getRandomNumber());



 * Create a new random number and store it the database
 * @param request
private void addRandomNumber(HttpServletRequest request){
    SessionFactory sessionFactory = (SessionFactory) request.getServletContext().getAttribute("SessionFactory");

    Session session = sessionFactory.getCurrentSession();
    Transaction tx = session.beginTransaction();
    RandomNumberPOJO randomNumber = new RandomNumberPOJO();
    Random rand = new Random();
    int randomInteger = 1 + rand.nextInt((999) + 1);


 * Get a list of all RandomNumberPOJO objects
 * @param request
 * @param response
 * @return
private List<RandomNumberPOJO> getAllRandomNumbers(HttpServletRequest request, HttpServletResponse response){
    SessionFactory sessionFactory = (SessionFactory) request.getServletContext().getAttribute("SessionFactory");
    Session session = sessionFactory.getCurrentSession();
    Transaction tx = session.beginTransaction();
    TypedQuery<RandomNumberPOJO> query = session.createQuery(
            "from RandomNumberPOJO", RandomNumberPOJO.class);

    List<RandomNumberPOJO> numbers =query.getResultList();


    return numbers;


 * Count records
 * @param request
 * @return
private int countNumbers(HttpServletRequest request){
    SessionFactory sessionFactory = (SessionFactory) request.getServletContext().getAttribute("SessionFactory");
    Session session = sessionFactory.getCurrentSession();
    Transaction tx = session.beginTransaction();

    String count = session.createQuery("SELECT COUNT(id) FROM RandomNumberPOJO").uniqueResult().toString();

    int rowCount = Integer.parseInt(count);

    return rowCount;


This class provides the actual servlet and is executed whenever a user calls the web application. First, a new RandumNumberPOJO object is instantiated and persisted. We then count how many numbers we already have and then we fetch a list of all existing records.

The last step before we can actually run the application is the definition of the web entry points, which we can define in the file called web.xml. This file is already generated by the maven achetype and we only need to add a name for our small web service and provide a mapping for the entry class.

HikariCP Test App




Compile and Run

We can then  compile and deploy the application with the following command:

mvn clean install org.apache.tomcat.maven:tomcat7-maven-plugin:2.0:deploy -e

This will compile and upload the application to the Tomcat server and we can then use our browser, open the URL http://localhost:8080/testapp/hello  to create and persist random numbers by refreshing the page. The result will look similar like this:

Data Wrangling with csvkit and SQLite

As mentioned earlier, csvkit is a very convenient tool for handling coma separated text files, especially when they are too large to be processed with conventional spread sheet applications like Excel or Libre Office Calc. The limits of Office programs can rather easy be reached, especially when dealing with scientific data. Open Office Calc supports the following limits.

  • maximum number of rows: 1,048,576
  • maximum number of columns: 1,024
  • maximum number of sheets: 256

Excel offers  also 1 048 576 rows but provides 16,384 columns. SQLite in contrast allows by default 2000 columns and provides if really needed up to 32767 columns if complied with a specific setting. In terms of row storage, SQLite provides a theoretical maximum number of 264 (18446744073709551616) rows. This limit is unreachable since the maximum database size of 140 terabytes will be reached first.

The limits we discussed will not be hit the our example of air traffic data, which we obtain from You can download the sample file with currently 47409 airports described in the CSV format from the linked web page.

$: csvstat airports.csv 
  1. id
	<type 'int'>
	Nulls: False
	Min: 2
	Max: 316827
	Sum: 2112054844
	Mean: 44549.6602755
	Median: 23847
	Standard Deviation: 77259.1794792
	Unique values: 47409
  2. ident
	<type 'unicode'>
	Nulls: False
	Unique values: 47409
	Max length: 7
  3. type
	<type 'unicode'>
	Nulls: False
	Unique values: 7
	5 most frequent values:
		small_airport:	30635
		heliport:	9098
		medium_airport:	4536
		closed:	1623
		seaplane_base:	927
	Max length: 14

This little command provides us with the statistics of the columns in the file. We see that the file we provided offers 18 columns and we also can immediately see the column types, if there are null values and what the 5 most frequent values are. If we are interested in a list of columns only, we can print them with the following command.

$: csvcut -n  airports.csv 
  1: id
  2: ident
  3: type
  4: name
  5: latitude_deg
  6: longitude_deg
  7: elevation_ft
  8: continent
  9: iso_country
 10: iso_region
 11: municipality
 12: scheduled_service
 13: gps_code
 14: iata_code
 15: local_code
 16: home_link
 17: wikipedia_link
 18: keywords

We can also use the csvcut command for – you expect it already – cutting specific columns from the CSV file, in order to reduce the size of the file and only retrieve the columns that we are interested in. Image you would like to create a list of all airports per region. Simply cut the columns you need and redirect the output into a new file. The tool csvlook provides us with a MySQL-style preview of the data.

$: csvcut --columns=name,iso_country,iso_region  airports.csv > airports_country_region.csv
$: csvlook airports_country_region.csv | head -n 15
|  name                                                                          | iso_country | iso_region  |
|  Total Rf Heliport                                                             | US          | US-PA       |
|  Lowell Field                                                                  | US          | US-AK       |
|  Epps Airpark                                                                  | US          | US-AL       |
|  Newport Hospital & Clinic Heliport                                            | US          | US-AR       |
|  Cordes Airport                                                                | US          | US-AZ       |
|  Goldstone /Gts/ Airport                                                       | US          | US-CA       |
|  Cass Field                                                                    | US          | US-CO       |
|  Grass Patch Airport                                                           | US          | US-FL       |
|  Ringhaver Heliport                                                            | US          | US-FL       |
|  River Oak Airport                                                             | US          | US-FL       |
|  Lt World Airport                                                              | US          | US-GA       |
|  Caffrey Heliport                                                              | US          | US-GA       |

We could then sort the list of airports alphabetically in reverse and write the new list into a new file. We specify the name of the column we want to sort the file and We measure the execution time needed by prepending the command time.

$: time csvsort -c "name" --delimiter="," --reverse airports_country_region.csv > airports_country_region_sorted.csv

real    0m1.177s
user    0m1.135s
sys    0m0.042s

A nice feature of csvkit is its option to query CSV files with SQL. You can formulate SELECT queries and it even supports joins and other tricks. Thus you can achieve the same result with just one SQL query.

$: time csvsql -d ',' --query="SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC" airports.csv > sql_airports_country_region.csv

real	0m11.626s
user	0m11.532s
sys	0m0.090s

Obviously, this is not the fastest possibility and may not be suitable for larger data sets. But csvkit offers more: You can create SQL tables automatically by letting csvkit browse through your CSV files. It will try to guess the column type, the appropriate field length and even constraints.

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports airports.csv 
CREATE TABLE "Airports" (
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))

$: csvsql -i sqlite -d ',' --db-schema AirportDB --table Airports -u 0 airports.csv > airport_schema.sql

The second command in the listing above simply stores the table in a separate file.  We can import this CREATE TABLE statement by reading the file in SQLite. Change to the folder where you downloaded SQLite3 and create a new database called AirportDB. The following listing contains SQL style comments (starting with –) in order to improve readability.

./sqlite3 AirportDB.sqlite

SQLite version 3.9.2 2015-11-02 18:31:45
Enter ".help" for usage hints.
-- change the command separator from its default | to ;
sqlite> .separator ;
-- read the SQL file we created before
sqlite> .read /home/stefan/datawrangling/airport_schema.sql
-- list all tables
sqlite> .tables
-- print table schema

sqlite> .schema Airports
CREATE TABLE "Airports" (
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))

You can also achieve the same results directly from Bash, simply by piping the SQL file to the database.

$: cat ~/datawrangling/airport_schema.sql | ./sqlite3 AirportDB.sqlite
$: ./sqlite3 AirportDB.sqlite ".tables"
$: ./sqlite3 AirportDB.sqlite ".schema Airports"
CREATE TABLE "Airports" (
	ident VARCHAR(7) NOT NULL, 
	type VARCHAR(14) NOT NULL, 
	name VARCHAR(77) NOT NULL, 
	latitude_deg FLOAT NOT NULL, 
	longitude_deg FLOAT NOT NULL, 
	elevation_ft INTEGER, 
	continent VARCHAR(4), 
	iso_country VARCHAR(4), 
	iso_region VARCHAR(7) NOT NULL, 
	municipality VARCHAR(60), 
	scheduled_service BOOLEAN NOT NULL, 
	gps_code VARCHAR(4), 
	iata_code VARCHAR(4), 
	local_code VARCHAR(7), 
	home_link VARCHAR(128), 
	wikipedia_link VARCHAR(128), 
	keywords VARCHAR(173), 
	CHECK (scheduled_service IN (0, 1))

We created a complex SQL table by automatically parsing CSV files. This gives a lot of opportunities, also for Excel spreadsheets and other data available in CSV. The great thing about csvkit is that it supports a large variety of database dialects. You can use the same command by adapting the -i parameter for the following database systems:

  • access
  • sybase
  • sqlite
  • informix
  • firebird
  • mysql
  • oracle
  • maxdb
  • postgresql
  • mssql

All major systems are supported, which is a great benefit. Now that we have the schema ready, we need to import the data into the SQLite database.  We can use the SQLite client to import the CSV file into the database, but suddenly we run into a problem! The 12th column contains boolean values, as correctly identified by the csvkit tool. When we inspect the file again with csvlook, we can see that the column contains ‘yes’ and ‘no’ values. Unfortunately SQLite does not understand this particular notion of boolean values, but rather expects 0 for false and 1 for true, as described in the data types documentation.We have two options: We could replace the values of yes and no by their corresponding integer, for instance with awk:

$: awk -F, 'NR>1 { $12 = ($12 == "\"no\"" ? 0 : $12) } 1' OFS=,  airports.csv &gt; airports_no.csv
$: awk -F, 'NR>1 { $12 = ($12 == "\"yes\"" ? 1 : $12) } 1' OFS=,  airports_no.csv &gt; airports_yes.csv

Or, much more comfortably, we could again use csvkit, which can help us out and replaces the values automatically. The following command imports the data into our database. As we already created the table in advance, we can skip the process with the appropriate flag.

$: time csvsql --db "sqlite:///home/stefan/datawrangling/AirportDB.sqlite" --table "Airports" --insert airports.csv --no-create

real    0m11.161s
user    0m10.743s
sys    0m0.169s

This takes a little while, but after a few seconds, we have the data ready. We can then open the database and query our Airport data set.

sqlite>  SELECT name,iso_country,iso_region FROM airports ORDER BY name DESC
Run Time: real 3.746 user 0.087215 sys 0.167298

You can now use the data in an advanced way and also may utilise advanced database features such as indices in order to speed up the data processing. If we compare again the execution of the same query on the CSV file and within SQLite, the advantage becomes much more obvious if we omit command line output, for instance by querying the COUNT of the airport names.

:$ time csvsql -d ',' --query="SELECT COUNT(name) FROM airports" airports.csv 

real	0m11.068s
user	0m10.977s
sys	0m0.086s

-- SQLite
sqlite> .timer on
sqlite> SELECT COUNT(name) FROM Airports;
Run Time: real 0.014 user 0.012176 sys 0.001273

Create an ER Diagram of an Existing SQLite Database (or many other RDBMS)

Visualisation helps solving problems and is therefore an important tool database design. Many database providers have their product specific tools for re-engineering existing schemata, but self-contained, serverless, embedded relational database management systems (RDBMS) such as SQLite often come without much tooling support. The extremely small footprint of SQLite provides a very powerful tool for implementing database driven applications without the hassle of database administration, user privilege management and other demanding tasks that come with more complex systems. There does not exist a workbench-like tool for SQLite, but we can use the open source SchemaCrawler for analysing database schemata and table relationships. The tool provides a plethora of commands and options, in this post we will only cover the diagramming part, which allows creating ER diagrams of the table.

After downloading and extracting the tool to your local drive, you can find a lot of examples included. The tool can handle SQLite, Oracle,  MS SQL Server, IBM DB2, MySQL, MariaDB, Postgres and Sybase database servers and is therefore very versatile. You will need Java 8 in order to run it. Have a look at the script below, which creates a PNG image of the database schema of the Chinook test database.

# The path of the unzipped SchemaCrawler directory
# The path of the SQLite database
# The type of the database system.
# Where to store the image
# Username and password need to be empty for SQLite

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -server=${RDBMS} -database=${SQLiteDatabaseFILE} -outputformat=png -outputfile=${OutputPATH} -command=graph -infolevel=maximum -user=${USER} -password=${PASSWORD}

The SchemaCrawlerPATH variable contains the path to the directory where we unzipped the SchemaCrawler files to. This is needed in order to load all the required libraries into the classpath below. We then specify the SQLite database file, define the RDBMS and provide an output path where we store the image. Additionally, we provide an empty user name and password combination. SQLite does not provide user authentication, thus those two parameters need to be empty, SchemaCrawler simply ignores them. Then we can execute the command and the tool generates the PNG of the ER diagram for us.

You can find a lot of examples also online, which gives you an overview of the features of this tool. One of the main purposes of SchemaCrawler is to generate diffable text outputs of the database schemata. In combination with a source code version management tool such as Git or Subversion, you can create clean and usable reports of your databases and keep track of the changes there. You can retrieve an overview of the options  with the following command.

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -?

You can also HTML reports with the following command:

java -classpath $(echo ${SchemaCrawlerPATH}/_schemacrawler/lib/*.jar | tr ' ' ':') schemacrawler.Main -server=${RDBMS} -database=${SQLiteDatabaseFILE} -outputformat=html -outputfile=report.html -command=details -infolevel=maximum -user=${USER} -password=${PASSWORD}

Other available output formats are plain text, CSV or JSON.