Wednesday, February 17, 2016

Monday, February 15, 2016

Databases and big data

Just some initial thoughts on the database side of things.  This is mostly from the perspective of a single person or small group doing research that needs to use these database technologies to try to get use / meaning out of their data.

There is a lot of hype around "big data" these days.  The reality is that what is big data for one person is little data for another.  It all depends on what resources are available for the problem.  I work at Google (this post is my take, not Google's) and I sometimes work with databases that are Petabytes in size.  But, size really should not be the metric for big data.  A crazy complex database that fits complete in memory (say <10GB) can be way more challenging that a well organized and rather boring multi-petabyte database.  Things like how you want to access the data mater greatly.  Think about the different between random row accesses, complete table traversal, and in-order table traversal.  How will the database change?  Will you be only reading, appending, or doing random updates?  What parts of the database will be used for search (if there is search)?  How uniform is the data?  Will you do joins inside the database and if so why types?  The load they put on the database and how you can optimize are all different for the types of data.  The more flexibility there is, the more difficult it is for the database to be efficient.

What are some of the types of databases?

  • Dumb files.  Think CSV and things you can use grep, awk, sed, perl, python etc on.
  • Serialization systems - e.g. XML, JSON, BSON, Yaml, MessagePack, Protobuf,  Avro, Thrift, etc.
  • Filesystems as databases - Content Addressable Storage (CAS) or just things like Google Cloud Storage / S3 as keys with data blobs with the filename (could be a hash) as the key
  • Spreadsheets as databases - e.g. Pandas, Google Sheets, OpenOffice/LibreOffice sheets or databases
  • Simple hash lookups: gdbm, dbm, Google LevelDB - e.g. Using user id's to check a password hash.
  • memcache - Worth a separate mention.  This is used to speed up lots of other systems and is basically a large scale version of the memoize concept of caching results of computations or lookups.
  • SQLite - The most stripped down SQL like database system based on a single file
  • Tradition SQL databases - MySQL/MariaDB, PostgreSQL, Oracle, Ingres, etc.  Working with rows
  • NoSQL databases - Think key lookups like gdbm but designed to contain all sorts of extra craziness.  BigTable, MongoDBGoogle DataStoreCouchDBCassandraRiak, etc
  • Column Oriented databases: Google Dremel / BigQueryApache HBase
  • And so many more... graph databases, databases embedded in other product, object stores, NewSQL, yada yada.
Each system has its strengths and weaknesses.

Which one(s) should you pick for a project?

This really boils down to requirements.  In all likelihood, you will end up using multiple systems for a project, sometimes for the same data, sometimes for subsets and sometimes for different data.

Even in Apache, the project directory lists 25 database entries.   For Python, PyPi Database Engines has > 250 entries.

Things like which license (many people can't use AGPL software), platform (Windows, Mac OSX, Linux, others) and program languages supported (both clients and stored procedures) will help restrict things.  Does it have spatial support and does OGC Spatial support matter?  Do you need features like row history?  How much supporting infrastructure is required and how many people do you need to have on staff for the system?  How fast do you need responses from the system?  How much money are you willing to spend to make life simpler?  Which databases does your hosting service provide or are you self hosting?  It's also possible that you pick a database and that database is implemented in terms of another database type.

It is worth while to pick a working set and get to know that.  Once you know a few, it will be easier to learn new systems.  Here is my take on a working set that might get someone starting out going with a single Ubuntu machine and little money to start.
  1. Text/CSV files with grep and python's CSV module
  2. SQLite - This is the simples "sort of SQL" database out there.  A good place to learn simple SQL.  RasterLite & Spatialite for spatial data
  3. PostgreSQL + PostGIS for a rigorous SQL environment.
  4. CouchDB + GeoCouch for a starter NoSQL.  I'm least sure about this
  5. Memcached to speed up all of the above when you hit performance problems that have repeated queries
If you were looking for cloud hosted setup, you will like want to keep everything in one companies dataset.  You can setup virtual machines to do pretty much anything you want if you have the time.  For minimal setup time, you could, for example, choose the Google stack of options:
  1. Cloud Storage for storing blobs with keys (aka paths) for things that don't change often
  2. SQL for MySQL
  3. DataStore for simple NoSQL
  4. BigTable for heavy weight NoSQL
  5. BigQuery for an SQL like column orient database for massive speed
  6. memcache to speed up common queries 
e.g. For the All the Ships in the World demo (2015 preso), we combined Cloud Storage, BigQuery, DataStore and MemCache. 

System diagram for 2013 All the Ships in the World

For a small acadeic research team, I might suggest focusing on something like IPython/Jupyter Notebooks with Matplotlib/basemap, Pandas, SciPy, scikit-learn, and other python libraries talking to MySQL/MariaDB/PostgreSQL (pick one) and Google BigQuery databases for the main data.  You would then put the results of common queries in DataStore for fast and cheap access.  Longer term storage of data would be in Cloud Storage.  Local caching of results can be done in the Numpy binary format, HDF5, or SQLite via Pandas.  You then would have a slick analysis platform that can use the big data systems to subset the data to work with locally.  You could even run the python stack in a cloud hosted VM and just use a web browser locally to interact with a remove Notebook.  Keeping as much as possible in the data center prevents having to send too much over the network back to your local machine.

The tool options and variations are endless.  I've left out so many interesting and useful technologies and concepts, but I wanted to get something out.

Thursday, January 28, 2016

QR Codes on the roofs of building

QR codes on the roofs of buildings could be used for automatic calibration on imagery from aircraft or spacecraft.  If only this QR code linked to machine readable georeferencing information. This is at the Naval Postgraduate School in Monterey, CA.

Image from Google Maps.



Update 2017-May-25:

More info here...

https://wiki.nps.edu/display/NOW/Optical+Signaling
https://www.gearthblog.com/tag/calibration-targets

Thursday, January 21, 2016

HF-AIS... what????

Or rather wat?

http://www.businesswire.com/news/home/20160113005517/en/Advanced-core-AIS-technology-innovation-global-leaders

"advanced DSP core technology which enables an AIS transceiver to reliably and accurately receive and decode every AIS transmission in real time"


So... AIS is 2x9600 baud channels.  Today that is like crazy slow, so why is special DSP technology needed?  Seriously?  Why?  No really... I am totally not understanding.  These are 25kHz max channels which is nothing in today's computational landscape.  Even if you decided to do beam steering and create multiple synthetic channels, this really isn't much.  Shine Micro has had the RadarPlusSM1680 for > 7 years with 8 receivers.  I think that uses GNU Radio inside.

Thursday, December 10, 2015

exactEARTH on Satellite AIS

http://video.esri.com/watch/4884/advances-in-global-ship-tracking

As I listen to this video, I'm amused by the satellite AIS has been around for 5 years.  Really?  I first got data from SpaceQuest in 2009 and I know that there were receivers up before 2009.  The US spooks had one up pre-2007.  I am still bothered by the ABSEA Class B-ish thing as they don't publish a spec that tells me anything useful about what it really is.  And yes, I've talk to the folks doing this, and I still don't know what it really is at a technical level.

The other sorts of data discussion is interesting.  Finally... IMO 289/290 came out back in 2010. exactEARTH was not at the RTCM SC121 meetings in 2007-2009 despite the meetings being open to anyone in the community.  So great that they are talking about icebergs and fish catch, but it's not new.  Hey there is also the Voluntary Observing Ship program (VOS) that demonstrates how this works for sending back super important data about our world.

The preso does have some nice info about Satellite AIS.  Just remember, that ERMA had Orbcomm satellite AIS data integrated for the Deepwater Horizon oil spill response back in 2010, so this stuff isn't that new.  But it is getting easier and better.

Wednesday, December 9, 2015

The joys of being on the internet

The endless brute force attacks that make up the intertubes...  so kind of them to rate limit the attempts.

Jan 25 08:08:55 tide3 sshd[8192]: Failed password for invalid user alex from 92.61.46.145 port 50959 ssh2
Jan 25 08:10:52 tide3 sshd[8195]: Invalid user arbab from 92.61.46.145
Jan 25 08:10:52 tide3 sshd[8195]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:10:52 tide3 sshd[8195]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145 
Jan 25 08:10:54 tide3 sshd[8195]: Failed password for invalid user arbab from 92.61.46.145 port 42321 ssh2
Jan 25 08:12:52 tide3 sshd[8198]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145  user=backup
Jan 25 08:12:54 tide3 sshd[8198]: Failed password for backup from 92.61.46.145 port 52016 ssh2
Jan 25 08:14:51 tide3 sshd[8201]: Invalid user bob from 92.61.46.145
Jan 25 08:14:51 tide3 sshd[8201]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:14:51 tide3 sshd[8201]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145 
Jan 25 08:14:53 tide3 sshd[8201]: Failed password for invalid user bob from 92.61.46.145 port 38871 ssh2
Jan 25 08:16:52 tide3 sshd[8204]: Invalid user christian from 92.61.46.145
Jan 25 08:16:52 tide3 sshd[8204]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:16:52 tide3 sshd[8204]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145 
Jan 25 08:16:54 tide3 sshd[8204]: Failed password for invalid user christian from 92.61.46.145 port 36075 ssh2
Jan 25 08:17:01 tide3 CRON[8207]: pam_unix(cron:session): session opened for user root by (uid=0)
Jan 25 08:17:01 tide3 CRON[8207]: pam_unix(cron:session): session closed for user root
Jan 25 08:18:51 tide3 sshd[8210]: Invalid user cisco from 92.61.46.145
Jan 25 08:18:51 tide3 sshd[8210]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:18:51 tide3 sshd[8210]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145 
Jan 25 08:18:53 tide3 sshd[8210]: Failed password for invalid user cisco from 92.61.46.145 port 58544 ssh2
Jan 25 08:20:51 tide3 sshd[8213]: Invalid user cusadmin from 92.61.46.145
Jan 25 08:20:51 tide3 sshd[8213]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:20:51 tide3 sshd[8213]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145 
Jan 25 08:20:53 tide3 sshd[8213]: Failed password for invalid user cusadmin from 92.61.46.145 port 53667 ssh2
Jan 25 08:22:51 tide3 sshd[8216]: Invalid user david from 92.61.46.145
Jan 25 08:22:51 tide3 sshd[8216]: pam_unix(sshd:auth): check pass; user unknown
Jan 25 08:22:51 tide3 sshd[8216]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=92.61.46.145