Hello, reader! If you work with big files and S3, you may find my s3patch utility useful.

Or, hire me to help you with your data problems. Email me to see if I have availability.


This blog post explains the motivation for the creation of a SQLite virtual table extension for Parquet files. Go to cldellow/sqlite-parquet-vtable if you just want the code.

I was playing around with a project to visualize data from the 2016 Canada census. The raw data from Stats Canada is a 1291 MB CSV 1. The CSV has 12,000,000 rows, each capturing metrics for men and women in a census area. Over 5,000 census areas are covered, with 2,200 dimensions in each. I wanted to find out what percentage of people travelled by bicycle in three particular cities.

My end goal is to make a standalone web site that lets you explore the data. Ideally, it’d be compact and could run on a dirt-cheap shared web host.

For my purposes, a 1291 MB CSV was too cumbersome. In addition to being large, grepping for a single record is slow:

cldellow@furfle:~$ time grep Dawson.Creek.*Bicycle census.csv
2016,"5955014","3","Dawson Creek",6.3,8.3,"00101","CY","5955014","Bicycle",1935,,25,25,0

real    0m0.858s

So what are my options? I considered my three favourite database technologies: Postgres, SQLite and parquet files.

Postgres

Postgres is a great general purpose tool. I always start with it.

Stringly typed

Let’s start by shoving the data into Postgres as-is, treating everything as a string. This is quick to implement:

CREATE TABLE strings(
  year text, geo_code_por text, geo_level text, geo_name text, gnr text,
  gnr_lf text, data_quality_flag text, csd_type_name text, alt_geo_code text,
  profile text, profile_id text, notes text, total text, male text,
  female text
);

COPY strings(year, geo_code_por, geo_level, geo_name, gnr, gnr_lf,
  data_quality_flag, csd_type_name, alt_geo_code, profile, profile_id,
  notes, total, male, female )
FROM '/home/cldellow/src/csv2parquet/census.tsv';

\d+ shows that this creates a 1438 MB table. We can query the table, casting strings to numeric datatypes as needed:

postgres=# WITH inputs AS (
  SELECT
    geo_name,
    CASE WHEN profile_id = '1930' THEN 'total' ELSE 'cyclist' END AS mode,
    female,
    male
  FROM strings
  WHERE
    profile_id IN ('1930', '1935') AND
    csd_type_name = 'CY' AND
    geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  (100.0 * cyclist.male::int / total.male::int)::numeric(4,2) AS pct_male,
  (100.0 * cyclist.female::int / total.female::int)::numeric(4,2) AS pct_female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' AND cyclist.mode = 'cyclist';

   geo_name   | male | female | pct_male | pct_female 
--------------+------+--------+----------+------------
 Dawson Creek | 25   | 0      |     0.86 |       0.00
 Kitchener    | 905  | 280    |     1.51 |       0.51
 Victoria     | 2650 | 2130   |    12.57 |       9.73
(3 rows)

Time: 723.383 ms

Not bad! No surprise: my rural hometown of Dawson Creek has almost no cyclists, the university town I live in has some, and the hippie enclave of Victoria has double digit percentages.

The query is a bit slow, though – almost a second. If we add an index with CREATE INDEX ON strings(geo_name), the query takes only 10ms – but the index adds an extra 400 MB!

Strongly typed

The CSV is denormalized. As such, it repeats a lot of the same values. For example, the string Dawson Creek appears 2,247 times. Total - Lone-parent census families in private households - 100% data appears 5,469 times. As you can imagine, that’s pretty wasteful. A better approach would be to normalize the data so that, rather than storing thousands of copies of the same value, we store one value and just point to it.

Usually, you do this by creating separate lookup tables for each set of values. That’s a lot of work for a blog post, so I’m going to compromise by using Postgres’s ENUM types. I’ll also start storing numeric values as numeric values, rather than strings.

I’m lazy, so we’ll bootstrap the enums and strongly-typed table from the existing strings table:

DO $$
BEGIN
  EXECUTE (
        SELECT format('CREATE TYPE geo_name AS ENUM (%s)',
                      string_agg(DISTINCT quote_literal(geo_name), ', '))
        FROM strings);

  EXECUTE (
        SELECT format('CREATE TYPE profile AS ENUM (%s)',
                      string_agg(DISTINCT quote_literal(md5(profile)), ', '))
        FROM strings);
END 
$$;

CREATE TABLE typed AS SELECT
  year::smallint,
  geo_code_por::int,
  geo_level::smallint,
  geo_name::geo_name,
  (CASE WHEN gnr = '' THEN NULL ELSE gnr END)::real AS gnr,
  (CASE WHEN gnr_lf = '' THEN NULL ELSE gnr_lf END)::real AS gnr_lf,
  data_quality_flag,
  csd_type_name,
  alt_geo_code,
  md5(profile)::profile,
  profile_id::smallint,
  (CASE WHEN notes = '' THEN NULL ELSE notes END)::smallint AS notes,
  (CASE WHEN total IN ('..', '...', 'x', 'F') THEN NULL ELSE total END)::real AS total,
  (CASE WHEN male IN ('..', '...', 'x', 'F') THEN NULL ELSE male END)::real AS male,
  (CASE WHEN female IN ('..', '...', 'x', 'F') THEN NULL ELSE female END)::real AS female
FROM strings;

\d+ shows that the newly-typed table is only 1147 MB (vs the untyped table’s 1438 MB). This is an improvement, but still a bit of a disappointment. Two things are hurting us: Postgres has a relatively high minimum per-row cost of about 40 bytes 2, and representing a small number like 280 as a number rather than a string is actually more storage intensive (in exchange, query-time arithmetic will be faster).

Now our query on the unindexed-but-typed table takes only 550 ms (vs the unindexed-and-untyped table’s 720 ms). This is because enumerated types are much cheaper to compare than strings.

Even better, creating an index on geo_name only takes 263 MB (vs 400 MB).

SQLite

Postgres was very fast to slurp the data in and fast to query once we set up indexes, but it still resulted in a 1419 MB database. Can we do better?

CSV virtual table module

SQLite supports exposing a CSV table in-place via the CSV virtual table extension. Virtual tables are a lot like Postgres’s foreign data wrappers.

I don’t expect this will provide satisfactory query performance, but it’s so easy to try that we may as well:

sqlite> .load ./csv.so
sqlite> CREATE VIRTUAL TABLE strings USING csv(filename='/home/cldellow/src/csv2parquet/census.csv');
sqlite> WITH inputs AS (
  SELECT
    c3 AS geo_name,
    CASE WHEN c10 = 1930 THEN 'total' ELSE 'cyclist' END AS mode,
    c14 AS female,
    c13 AS male
  FROM strings
  WHERE c10 IN ( 1930, 1935) AND
    c7 = 'CY' AND
    geo_name IN ('Victoria', 'Dawson Creek', 'Kitchener')
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  100.0 * cyclist.male / total.male,
  100.0 * cyclist.female / total.female
FROM inputs AS total
JOIN inputs AS cyclist USING (geo_name)
WHERE total.mode = 'total' and cyclist.mode = 'cyclist';

Kitchener|905|280|1.51299841176962|0.514563998897363
Victoria|2650|2130|12.5741399762752|9.73047053449064
Dawson Creek|25|0|0.863557858376511|0.0
Run Time: real 24.721 user 23.944000 sys 0.776000

Yuck. The columns are identified by ordinal position (for example, c3 instead of geo_name), and it’s slow as molasses–almost half a minute.

SQLite database

The main benefit of the virtual module was that it skipped the initial data loading step. Let’s try a more traditional approach where we import the CSV into SQLite’s database format.

cldellow@furfle:~$ sqlite3 census.sqlite
SQLite version 3.11.0 2016-02-15 17:29:24
Enter ".help" for usage hints.
sqlite> .mode tabs
sqlite> .import census.tsv census

This creates a 1235 MB SQLite DB file. D’oh. And running our query against it takes 3.4 seconds! At a cost of 268 MB, we can add an index on geo_name that cuts query time down to 16 ms.

This is nicer than Postgres in that it’s more self-contained and doesn’t require us to administer a Postgres cluster, but it uses about the same amount of disk space.

SQLite database, indexed + squashed

Our intuition is that the SQLite database is very compressible, since it still has all the repetitive data. Since this is for a read-only webapp, let’s try using SquashFS to transparently compress it.

cldellow@furfle:~$ mkdir census
cldellow@furfle:~$ mv census.sqlite census
cldellow@furfle:~$ mksquashfs census/ census.squashfs -comp xz
[ ... buncha spam ... ]
cldellow@furfle:~$ mkdir /mnt/census
cldellow@furfle:~$ mount census.squashfs /mnt/dir -t squashfs -o loop

The compressed file is 185 MB. The first query takes about 130 ms, which is not bad! It’s a bit slower because it has to uncompress the necessary pages from the file. Subsequent queries are about 30 ms, because the pages are in the kernel buffer cache.

Parquet

While easy to use, the previous solutions still use too much disk space! What if we explored Parquet files?

Parquet files partition your data into row groups which each contain some number of rows. Within those row groups, data is stored (and compressed!) by column, rather than by row. For example, if you had a dataset with 1,000 columns but only wanted to query the Name and Salary columns, Parquet files can efficiently ignore the other 998 columns. Row groups also know the minimum and maximum values for each column. If you had 10,000 row groups sorted by Name and only wanted to see entries for Colin Dellow, you could easily skip row groups whose statistics show that they don’t contain any responsive entries.

One downside of Parquet files is that they’re usually used in “big data” contexts. You need to have heavy-duty infrastructure like a Hive cluster to read them. Some recent work, like the Apache Arrow and parquet-cpp projects, are changing this. I built a SQLite virtual table extension for Parquet on top of these libraries 3.

Stringly typed

Let’s create a Parquet file from our CSV! To begin, we’ll treat every column as a string. I created it with row groups of size 20,000, hoping that would allow us to efficiently query for a given city, since a city’s 2,247 entries should be usually be entirely contained within a single row group. I used the brotli compression codec, which I found gave slightly better compression than Snappy while not sacrificing query speed.

It results in a 42 MB file. That’s almost 3% the size of the CSV! But is it fast to query?

We’ll change the query a bit – the original one is optimized for Postgres’s treatment of CTEs.

sqlite> .load parquet/libparquet.so
sqlite> CREATE VIRTUAL TABLE census USING parquet('./census.parquet');
sqlite> WITH total AS (
  SELECT
    geo_name,
    female,
    male
  FROM census
  WHERE
    geo_name in ('Dawson Creek', 'Victoria', 'Kitchener') AND
    csd_type_name = 'CY' AND
    profile_id = '1930'
), cyclist as (
  SELECT
    geo_name,
    female,
    male
  FROM census
  WHERE
    geo_name in ('Dawson Creek', 'Victoria', 'Kitchener') AND
    csd_type_name = 'CY' AND 
    profile_id = '1935'
)
SELECT
  total.geo_name,
  cyclist.male,
  cyclist.female,
  100.0 * cyclist.male / total.male,
  100.0 * cyclist.female / total.female
FROM total
JOIN cyclist USING (geo_name)

Dawson Creek|25|0|0.863557858376511|0.0
Kitchener|905|280|1.51299841176962|0.514563998897363
Victoria|2650|2130|12.5741399762752|9.73047053449064
Run Time: real 0.038 user 0.040000 sys 0.000000

40ms! Not as fast as Postgres with an index, but not bad considering how much less disk space it uses.

Strongly typed

Because Parquet is so good at encoding and compressing repetitive values, creating a typed Parquet file isn’t actually much of a savings: it decreases the file size by only 600 KB.

Still, it’s useful to do, because it results in more meaningful statistics. For example, the string "2" is considered larger than the string "10", because 2 comes after 1 in the alphabets computers use.

Conclusion

By using SQLite and Parquet files, you can achieve great compression without sacrificing query time too much. You can also use it with xerial’s excellent JDBC connector for SQLite if you’d like to expose it in a JVM application.

Give Parquet files and SQLite some thought the next time you’re thinking about releasing a standalone webapp.


  1. Dataset 98-401-X2016055, if you want to dig it up yourself. 

  2. See Erwin Brandstetter’s excellent Stack Overflow answer on this topic. 

  3. Why SQLite? Remember that this is in the context of a hobby app running in a constrained environment. I explored a custom Presto connector that would let it read parquet files from the local file system, but didn’t like the overhead requirements. I also considered writing a a custom table function for Apache Derby and a user-defined table for H2 DB. These had less overhead than Presto, but were architecturally much more primitive than both Presto and SQLite, and I was skeptical I could get good performance from them.