Monday, February 15, 2016

Databases and big data

Just some initial thoughts on the database side of things.  This is mostly from the perspective of a single person or small group doing research that needs to use these database technologies to try to get use / meaning out of their data.

There is a lot of hype around "big data" these days.  The reality is that what is big data for one person is little data for another.  It all depends on what resources are available for the problem.  I work at Google (this post is my take, not Google's) and I sometimes work with databases that are Petabytes in size.  But, size really should not be the metric for big data.  A crazy complex database that fits complete in memory (say <10GB) can be way more challenging that a well organized and rather boring multi-petabyte database.  Things like how you want to access the data mater greatly.  Think about the different between random row accesses, complete table traversal, and in-order table traversal.  How will the database change?  Will you be only reading, appending, or doing random updates?  What parts of the database will be used for search (if there is search)?  How uniform is the data?  Will you do joins inside the database and if so why types?  The load they put on the database and how you can optimize are all different for the types of data.  The more flexibility there is, the more difficult it is for the database to be efficient.

What are some of the types of databases?

  • Dumb files.  Think CSV and things you can use grep, awk, sed, perl, python etc on.
  • Serialization systems - e.g. XML, JSON, BSON, Yaml, MessagePack, Protobuf,  Avro, Thrift, etc.
  • Filesystems as databases - Content Addressable Storage (CAS) or just things like Google Cloud Storage / S3 as keys with data blobs with the filename (could be a hash) as the key
  • Spreadsheets as databases - e.g. Pandas, Google Sheets, OpenOffice/LibreOffice sheets or databases
  • Simple hash lookups: gdbm, dbm, Google LevelDB - e.g. Using user id's to check a password hash.
  • memcache - Worth a separate mention.  This is used to speed up lots of other systems and is basically a large scale version of the memoize concept of caching results of computations or lookups.
  • SQLite - The most stripped down SQL like database system based on a single file
  • Tradition SQL databases - MySQL/MariaDB, PostgreSQL, Oracle, Ingres, etc.  Working with rows
  • NoSQL databases - Think key lookups like gdbm but designed to contain all sorts of extra craziness.  BigTable, MongoDBGoogle DataStoreCouchDBCassandraRiak, etc
  • Column Oriented databases: Google Dremel / BigQueryApache HBase
  • And so many more... graph databases, databases embedded in other product, object stores, NewSQL, yada yada.
Each system has its strengths and weaknesses.

Which one(s) should you pick for a project?

This really boils down to requirements.  In all likelihood, you will end up using multiple systems for a project, sometimes for the same data, sometimes for subsets and sometimes for different data.

Even in Apache, the project directory lists 25 database entries.   For Python, PyPi Database Engines has > 250 entries.

Things like which license (many people can't use AGPL software), platform (Windows, Mac OSX, Linux, others) and program languages supported (both clients and stored procedures) will help restrict things.  Does it have spatial support and does OGC Spatial support matter?  Do you need features like row history?  How much supporting infrastructure is required and how many people do you need to have on staff for the system?  How fast do you need responses from the system?  How much money are you willing to spend to make life simpler?  Which databases does your hosting service provide or are you self hosting?  It's also possible that you pick a database and that database is implemented in terms of another database type.

It is worth while to pick a working set and get to know that.  Once you know a few, it will be easier to learn new systems.  Here is my take on a working set that might get someone starting out going with a single Ubuntu machine and little money to start.
  1. Text/CSV files with grep and python's CSV module
  2. SQLite - This is the simples "sort of SQL" database out there.  A good place to learn simple SQL.  RasterLite & Spatialite for spatial data
  3. PostgreSQL + PostGIS for a rigorous SQL environment.
  4. CouchDB + GeoCouch for a starter NoSQL.  I'm least sure about this
  5. Memcached to speed up all of the above when you hit performance problems that have repeated queries
If you were looking for cloud hosted setup, you will like want to keep everything in one companies dataset.  You can setup virtual machines to do pretty much anything you want if you have the time.  For minimal setup time, you could, for example, choose the Google stack of options:
  1. Cloud Storage for storing blobs with keys (aka paths) for things that don't change often
  2. SQL for MySQL
  3. DataStore for simple NoSQL
  4. BigTable for heavy weight NoSQL
  5. BigQuery for an SQL like column orient database for massive speed
  6. memcache to speed up common queries 
e.g. For the All the Ships in the World demo (2015 preso), we combined Cloud Storage, BigQuery, DataStore and MemCache. 

System diagram for 2013 All the Ships in the World

For a small acadeic research team, I might suggest focusing on something like IPython/Jupyter Notebooks with Matplotlib/basemap, Pandas, SciPy, scikit-learn, and other python libraries talking to MySQL/MariaDB/PostgreSQL (pick one) and Google BigQuery databases for the main data.  You would then put the results of common queries in DataStore for fast and cheap access.  Longer term storage of data would be in Cloud Storage.  Local caching of results can be done in the Numpy binary format, HDF5, or SQLite via Pandas.  You then would have a slick analysis platform that can use the big data systems to subset the data to work with locally.  You could even run the python stack in a cloud hosted VM and just use a web browser locally to interact with a remove Notebook.  Keeping as much as possible in the data center prevents having to send too much over the network back to your local machine.

The tool options and variations are endless.  I've left out so many interesting and useful technologies and concepts, but I wanted to get something out.

No comments:

Post a Comment