Monday, October 10, 2016

Maritime cyber security

Please someone tell me why this is not just totally stupid.

http://www.marinelink.com/news/cybersecurity-maritime416578

"Cybersecurity incursions and threats to the Marine Transportation System (MTS) and port facilities throughout the country are increasing," said Dr. Hady Salloum, Director of MSC. "This research project will support the missions of the DHS Center of Excellence and the U.S. Coast Guard to address these concerns and vulnerabilities and will identify policies and risk management strategies to bolster the cybersecurity posture of the MTS enterprise."
I don't think there is anything unique about the maritime environment when it comes to computer security.  It's just like other industrial control situations with the twist that you have people living at the "site."  That isn't really even a twist as there are lots of jobs where people live at the site.  The main problems I have seen or heard of on ships are no different than just basic computer security issues.

I take this a bit personally because of an incident back in 2002.  I was on the RVIB Palmer (NSF's ice breaker) as a scientist.  The head of ship board computer pulled me aside one morning with the statement, "If you tell me what you did, I won't press charges."  Needless to say, I had no idea what he was referring to.  After quite the standoff, he finally revealed that most of the ship's network had gone down.  He was accusing me of hacking into ships systems and trashing things.  After even more of a standoff, he said the logs show me breaking in.  I tried to explain all that I had done on the ship, which was to, with permission from the other tech first, log into a machine with the navigation logs on it and write a shared memory reader program to write out a csv file of the ship's track.  He didn't believe me.  He said I was the only person on the ship that knew how to do this kind of malicious thing.  I finally turned my back on him and walked off.  Many hours later, he found me to apologize.  He said a Windows virus got in through someone's email and had run rampant on the system.  At the time, I only had a linux laptop and was using the Solaris workstations on the ship...  he said he now understood that it wasn't me.  I very much appreciated the apology with the accompanying explanation, but not his initial reaction.

So when it comes to comes to computer security on ships, the things I see are:


  • Get the software vendors to provide systems that are better tested and audited.  Which ECDIS has user visible unittests and integration tests?  Which use all the available static analyzers like Coverity and Clang Static Analyzer? 
  • Force all the specifications for software and hardware to be public without paywalls
  • Consider using open source code where anyone can audit the system
  • Switch away from Windows to stripped down OSes with the components needed for the task
  • Teach your users about safely using computers
  • Run system and network scanners
  • Ensure that updates are done through proper techniques with signed data and apps
  • Hardware watchdogs on machines
  • Too many alarms on the bridge for irrelevant stuff
  • Stupid expensive connectors and custom protocols blocking using industry standard tools
Common issues I've heard of:
  • Crew browsing the internet and watching videos from critical computers getting viruses
  • Updates of navigation data opening machines up for being owned
  • Corrupt data going uncheck through the network (e.g. NMEA has terrible integrity checking)
  • Windows machines being flaky and operations guides saying to continuously check that the clock changes to machine sure the machine has not crashed
  • Single points of failure by banning things like hand held GPSes or mobile devices
  • Devices not functioning, that are impossible to diagnose / fix except by the manufacturer
  • Updates taking all the ships bandwidth because every computer wants to talk to home
  • Nobody on the ship knowing how systems work and the installer has left the company
  • TWIC cards... spending billions for no additional security
  • No easy two factor authentication for ships systems
  • Being unable to connect data from system A to system B cause of one or more reasons
    •  No way to run cables without massive cost
    • Incompatible connectors
    • Incompatible data protocols
    • Proprietary data protocols (e.g. NMEA paywalled BS)
  • Inability to do the correct thing because a standard says you can't.  e.g. IMO, IALA, ECDIS, NMEA, USCG, etc. is full of standards designed by people who did not know what they were doing w.r.t. engineering.  Just because you are an experienced mariner, doesn't me you know anything about designing hardware or software.
  • Interference between devices with no way for the crew to test
  • Systems being locked down in the incorrect configuration.  e.g. USCG's no-change rule for Class-B AIS
  • No consequences incorrect systems and positive feed back systems for fixing them.  e.g. why can't it be a part of the standard watch guides to check your own AIS with another ship or the port?  The yearly check of AIS settings guideline is dumb.
  • People hooking the wrong kind of electrical system to the ship
  • No plans to the ship, so people make assumptions about what is where and how it should work.  As-builts don't have xray vision.  Open ship designs would be a huge win
  • Ability of near by devices to access the ship's systems
  • And so on....
Sounds just like cars, IOT devices, home automation, etc.


Saturday, October 1, 2016

Autonomous ocean mapping

Back when I was in grad school at SIO(somewhere in the 2001-2004 time frame), I proposed an autonomous surface vessel to map the southern oceans.  I was resoundly told it was a terrible idea.  I was thinking diesel + solar powered, but now I know that there are a good number of other alternatives.

It's great to see that the idea is now getting notice.  It really is much more cost effective than sending manned vessels.  People are expensive and it takes a lot to keep us alive.

http://www.newsweek.com/2016/10/07/mapping-sea-ocean-floor-504061.html

Currently, the most efficient way to map involves the use of multibeam sonar, which sends pulses of sound that bounce off the seafloor and back. Autonomous underwater vehicles can also be used, though they are less efficient. At the meeting, Mayer floated the idea of an unmanned barge, equipped with multibeam sonar, that could roam the seas while continuously mapping, which would cost about one-third as much as a manned vessel.
I wouldn't say "barge", but things like Argo floats could be made larger and have single beam sonars.

Also, it would be great to have larger ships have depth sounders that were capable of deeper water, but that is tough with the shipping industry having such thin margins and massive ships lasting as little as 11 years before they are sent to the breaker.

I was sure that I had written about this someplace before, but I can't find it.  I need to snap a picture of my early thoughts on the topic and add it here.  Some of which is totally irrelevant now.  If you used a wave glider type system, there is no need for propulsion.

Saturday, September 3, 2016

Remembering and honoring Captain Ben Smith

My good friend Captain Ben Smith passed away last week :(  I don't know what to say.


Tide Tools from CCOM JHC on Vimeo.




Thursday, September 1, 2016

topcoder - NASA Fishing for Fishermen Marathon Match Part 1

I am not associated with this.  I am wondering what their AIS data will be?  Perhaps NAIS and/or Orbcomm since I see USCG / DHS(maybe?) logos on their splash.

http://crowdsourcing.topcoder.com/fishingforfishermen1




Tuesday, August 30, 2016

Course material for open source software engineering

For Earth Science, Carlton College has the Teach the Earth SERC portal for Geoscience Educators.  I used that with Margaret Boettcher when we were co-teach geophysics and intro to earth science.

What is there for teach materials with open source software in computer science and software engineering?  Here is what I know of:


What else is there?

Sunday, July 31, 2016

Image hash functions

I really need to spend some time with image hash functions (geometric hashing).  It would be really helpful for testing to have these in the core GDAL toolbox.  And maybe there needs to be a gdal compare commandline tool that does those plus what perceptualdiff does.

https://fullstackml.com/2016/07/02/wavelet-image-hash-in-python/

http://stackoverflow.com/questions/11336209/fast-and-simple-image-hashing-algorithm

Weird that VisionWorkbench doesn't have image hashes already in its code base.

Saturday, July 23, 2016

BBS door games of the 1980's

I just remembered playing "Nukem" on the Haunted House BBS in Los Altos, CA back around 1985-6 or so.  I don't remember them being called "door games".  You had to use your modem to call into the BBS and tell it that you wanted to enter the game as you logged out.  It would then drop into the game on your way out and you could specify for each round how many factories or missiles you wanted to build and how many missiles you wanted to launch at who.  This memory was triggered by hearing about Sumer/Hamurabi while listening to the Dreaming in Code audio book this morning.   I also remember "Inge's Abode" BBS run by David S? who was a few years older than me and was also in Boy Scout Troop 37.  It ran on his Apple II.  Ah the days of using an HP150 with thermal paper records of my sessions.  And before that it was an HP 2640A terminal.  The fun of Hayes modems.  Back when HP was cool and I got to meet Bill Hewlett when he would say hi to all the kids at the company parties up in the Santa Cruz mountains.

Monday, July 18, 2016

Playing with Javascript

There are lots of ways to play with Javascript, but here is one I broke out from the vaults...

sudo apt-get install rhinosudo apt-get install rlwrap
rlwrap rhino
Rhino 1.7 release 4 2013 08 27js> var s = "1234567890abc";js> s.slice(-5);90abcjs> 

Wednesday, July 6, 2016

Ricin - the poison

Ricin comes from the caster-oil plant (Ricinus).  I grew a few of these plants because the are a very fun / crazy looking plant.  I've noticed a whole bunch of what I think is the same plant growing along the river in Santa Cruz.  It's hard to see in the photo, but is is at the edge of the brown grass.

How to tap an AIS transceiver

Really?

The VTS provider has said there is no way to share the raw AIS data from this unit as it is a single serial port output. 
Do not use a splitter on the antenna.  That will drop the signal strength to both the AIS transceiver and any radio you add.

This is a failure of the VTS provider to think through much of anything.

You can tap RS-232 or RS-422 and listen in on the serial.  Or you can probably tap the serial port inside the OS.  Or the app could easily do a localhost UDP broadcast of the NMEA.  Or provide a TCP localhost service providing the raw TCP. Or the app could write the raw NMEA to log files that you tail (e.g. what tail -f does).  Or setup a service that listens to the serial port and provides a connection for the VTS software so it does not talk directly to the serial port (similar to what GPSD does) Or betting that this is windows there are options I don't know because I avoid MS windows.   If the VTS provider is Transis, they know better than to say this.  Or can put a computer with two ports in between the AIS transceiver and VTS machine and do forwarding of all messages. On a linux machine, you can modify the serial driver to send the data to a 2nd location.  Or maybe fiddle with the pts setup.  Or ...

Just look through socat for more ideas!

Here is an example socat command line I used a while back with the USCG AISUser.jar:


socat -u TCP:localhost:31414 - | grep --line-buffered 'r003669945' |  socat -d -v -u - TCP4-LISTEN:35001,fork,reuseaddr
An old AntiLog device with tap:

Sunday, July 3, 2016

Inter-VTS Exchange Format (IVEF)

The argument I heard for IVEF over AIS is that you get Radar info with IVEF. That seems like a bit wonky to me. When you talk about AIS, that usually implies that the system is talking NMEA and more recently, that might mean NMEA 0183 sentences wrapped in NMEA TAG BLOCK metadata. While I am not a big fan of NMEA, it does have messages for all sorts of stuff and there are message for Radar and ARPA targets in the spec. So why invent another protocol encoded in XML when you could either use NMEA or go something like MessagePack, protobuf/grcp, thrift, etc. Taking a look at the IVEF, there are things like logging in. One thing that has been well established is that security is very hard to get correct. So why not use a standard that tons of people use and beat up like OAUTH2?  I have no idea what IVEF would really buy anyone. Really, this feels like reinventing basic RPC services that the internet has had for ages.

And if you really want to pass radar targets via AIS, you could make RAVDM messages and use type 27 position reports with mmsi equal to your arpa target number.  Someone could even propose a change to ITU-1371 for type 1 or 27  to specifically designate radar targets.

http://openivef.org/ https://github.com/openivef

Additionally, the reference implementation is GPL licensed, so any proprietary vendor will have to write their own implementation.  Outside of OpenCPN, there hasn't been a lot of open source software on ships and none that I know of in VTS operations centers. The big defense contractors like Lockheed are particularly against using open source from what I have seen.

Some figures from IALA Recommendation on the Inter-VTS Exchange Format (IVEF) Service, IALA V-145:









In NMEA 0183 4.0 (I bought this damn doc):

  • Sentence: RSD – Radar System Data  
  • Talker: Radar and/or Radar Plotting RA
See also:
My notes for people (not in the United States) setting up their own VTS  and/or WhaleAlert and/or Met/Hydro systems:

You can transmit standard IMO 289 AIS msg 8-1-22 using a simple computer (e.g. RaspberryPi + Linux), ssh via keys (just use passwordless), using any of a Class-A, ATON, or Basestation transceiver.  You just have to send it serial BBM messages. I originally used a class-A transceiver which left a "ship" at the location of the transceiver on short and then switched to a blue force Class A transceiver (which really is just a bios patch to the unit) to transmit only the area notices without sending the 1-2-3-5-27 messages a normal class b will send.

https://github.com/schwehr/ais-area-notice will send IMO 289 standard 8-1-22 messages.

Or...

You could even do this with a software defined radio and have the entire hardware system cost < $300 USD (raspberrypi, gps board, sdr board, cell modem, power amplifier, cables, rf filters and antennas [ build your own J-Pole antennas ]) and the software would then be free after you get the basic glue scripts written.  SDR AIS transmission is specifically not allowed in the US, but would be totally awesome if allowed by the "competent regional authority" [that's the official term] in your area.

A word of warning:

As I tell everyone I talk to, I strongly encourage you to make sure you are always passthrough the original AIS NMEA through the system.  You will likely need it at a later date as your needs expand.  And IVEF does not give you very much of what comes through on the AIS channels and there are some super important missing things in the specification.

Wednesday, June 29, 2016

How bad is comcast?

So, not that this is news to anyone, but comcast is a bunch of morons.  They called me.  They can find 3 accounts for me, but they can find my internet account.  I have business class internet from them at the same address as cable tv.  I can't effectively do anything with their website as I have to log in and out of several accounts all of which are f-ed up and randomly log me in or out as I try to check bills and the refund of $150 they say they owe me but haven't paid me yet on one of the accounts.  So I'm stuck with their internet for a year until my contract runs out, but it's time to cancel cable TV with them.  Now I'll be just stuck with 24/7 guaranteed support that only works during mountain time business hours.  Last I called, the lady said she had accidentally kicked me off the internet and couldn't fix it until I called back during business hours.  Should couldn't fix what she had just done... which turned out to be nothing.  Yeah for customer service.

Tuesday, June 28, 2016

libtiff security bug

I just had a chance to work on a security bug behind the scenes that might end up having a CVE ( Update: CVE-2016-5875 ).  All the stuff I did in GDAL was so much of a torrent that it hardly seemed worth noting.  While I was just a reviewer and connecting people behind the scenes, it still feels good to help out.  The log entry by Even Rouault:

cvs log -r1.44 tif_pixarlog.c | egrep -i '^[a-z]'
RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_pixarlog.c,v
Working file: tif_pixarlog.c
head: 1.45
total revisions: 51; selected revisions: 1
description:
revision 1.44
date: 2016-06-28 08:12:19 -0700;  author: erouault;  state: Exp;  lines: +9 -1;  commitid: 2SqWSFG5a8Ewffcz;
PixarLogDecode() on corrupted/unexpected images (reported by Mathias Svensson)
The patch: (Also in GDAL as r34459)

cvs diff -r1.43 -r1.44 -u tif_pixarlog.c
Index: tif_pixarlog.c
===================================================================
RCS file: /cvs/maptools/cvsroot/libtiff/libtiff/tif_pixarlog.c,v
retrieving revision 1.43
retrieving revision 1.44
diff -u -r1.43 -r1.44
--- tif_pixarlog.c 27 Dec 2015 20:14:11 -0000 1.43
+++ tif_pixarlog.c 28 Jun 2016 15:12:19 -0000 1.44
@@ -1,4 +1,4 @@
-/* $Id: tif_pixarlog.c,v 1.43 2015-12-27 20:14:11 erouault Exp $ */
+/* $Id: tif_pixarlog.c,v 1.44 2016-06-28 15:12:19 erouault Exp $ */

 /*
  * Copyright (c) 1996-1997 Sam Leffler
@@ -459,6 +459,7 @@
 typedef struct {
  TIFFPredictorState predict;
  z_stream stream;
+ tmsize_t tbuf_size; /* only set/used on reading for now */
  uint16 *tbuf;
  uint16 stride;
  int state;
@@ -694,6 +695,7 @@
  sp->tbuf = (uint16 *) _TIFFmalloc(tbuf_size);
  if (sp->tbuf == NULL)
  return (0);
+ sp->tbuf_size = tbuf_size;
  if (sp->user_datafmt == PIXARLOGDATAFMT_UNKNOWN)
  sp->user_datafmt = PixarLogGuessDataFmt(td);
  if (sp->user_datafmt == PIXARLOGDATAFMT_UNKNOWN) {
@@ -783,6 +785,12 @@
  TIFFErrorExt(tif->tif_clientdata, module, "ZLib cannot deal with buffers this size");
  return (0);
  }
+ /* Check that we will not fill more than what was allocated */
+ if (sp->stream.avail_out > sp->tbuf_size)
+ {
+ TIFFErrorExt(tif->tif_clientdata, module, "sp->stream.avail_out > sp->tbuf_size");
+ return (0);
+ }
  do {
  int state = inflate(&sp->stream, Z_PARTIAL_FLUSH);
  if (state == Z_STREAM_END) {
 There is still tons of room for even beginners to find bugs.  So grab your fuzzers (e.g. AFL), static analyzers, and mark 1 eyeballs.  Then go find an open source package and get to work!

http://www.openwall.com/lists/oss-security/2016/06/29/5 Heap-based buffer overflow in LibTIFF when using the PixarLog compression format

Thursday, June 23, 2016

Public review of Handling and Analyzing Marine Traffic Data

This mornings reading: Handling and Analyzing Marine Traffic Data, Masters Thesis by ERIC AHLBERG, JOAKIM DANIELSSON.  I hate to be harsh in public, but this thesis is more of a tease than anything else.  I was hoping for more and I hope that those involved follow on with more depth to the work and next time give better background to increase the value of the research.  This thesis shows that there is a start to interesting work.
Abstract
With the emergence of the Automatic Identification System (AIS), the ability to track and analyze vessel behaviour within the marine domain was introduced. Nowadays, the ubiquitous availability of huge amounts of data presents challenges for systems aimed at using AIS data for analysis purposes regarding computability and how to extract valuable information from the data. This thesis covers the process of developing a system capable of performing AIS data analytics using state of the art Big data technologies, supporting key features from a system called Marine Traffic Analyzer 3. The results show that the developed system has improved performance, supports larger files and is accessible by more users at the same time. Another problem with AIS is that since the technology was initially constructed for collision avoidance-purposes, there is no solid mechanism for data validation. This introduces several issues, among them is what is called identity fraud, that is when a vessel impersonates another vessel for various malicious purposes. This thesis explores the possibility of detecting identity fraud by using clustering techniques for extracting voyages of vessels using movement patterns and presents a prototype algorithm for doing so. The results concerning the validation show some merits, but also exposes weaknesses such as time consuming tuning of parameters.
I skimmed to the reference section and conclusion and, while they reference some key relevant papers, they are missing a lot of references that you might expect.  No reference to ITU, IEC, IMO, IALA, or other relevant specifications.  No references to papers, presentations, or blog posts by me, ESR, or SkyTruth about AIS troubles or using "Big Data" type methods for AIS.  I'm uncomfortable tooting my own horn here, but come on.

Reading through the thesis, I couldn't find any real meat to the introduction and, when I got to the evaluation section, I was disappointed by this.  No references to even what model they used.  They could have easily reached out to a number of folks with AIS data and stats about data errors.  The thesis hasn't even described how AIS messages really work or any background on what perfectly functioning AIS message traffic might look like and its error characteristics.  Their one reference to spoofing was to the annoying web hack of injecting AIS messages into a companies feed, which no other ships would even see on their bridge.
Evaluation
The problem of AIS validation has been studied before, but to the knowledge of the authors of this thesis, there is no data consisting of documented cases of invalid data openly accessible. In addition, there is no measure of how often the specific problem occurs in real situations, which means that it might be too time consuming to use real data. Therefore, the evaluation focused on constructing dummy data to realistically model interesting scenarios which could be a sign of invalid AIS messages, and thereby get an indication of how well the solution performs.
Hey guys, check out my 2012 blog post: AIS Security and Integrity:



It was nice to see them go through various computing platforms, but the analysis was rather weak.  I have to wonder what they mean that a command line interface is hard to upgrade.  That to me seems easier that updating web apps.

Later we get to 4.2.2 AIS message validation.  When they refer to "Static validation, i.e. checking that the messages conform to the syntax of an AIS message" I really have no idea what they mean.  They haven't even defined a syntax for AIS nor told the reader where it might be defined.

The clustering stuff is okay, but the figures are very difficult to read until you get to 4.10.  Just when things are starting to get interesting, the thesis ends.  There is a section on ethical concerns that appears to be an afterthought and provides no new information (and not even a reference to the IMO announcement of > 15 years ago on the topic), analysis or opinions.  There were a whole pile of thoughts submitted to the US Federal Gov for a request some years ago.  Both sides of the argument submitted opinions.



Wishing for more...

Wednesday, June 22, 2016

How many slots can an AIS message have?

Since up to 9 VDM messages can be chained together, there is some confusion as to how many bits can be in an AIS message.  Over the radio, you are allowed to have up to 5 slots.  See ITU-1371-5:


The first slot is short with just 128 bits available.  The other following 4 slots are 256 bits each.  That gives a total over the VHF radio payload of 1152 bits.  Once that is armored into NMEA VDM characters at 6 bits per character, that gives 192 characters.  Those 192 characters can be spread out across multiple NMEA lines that are grouped together.  Each NMEA sentence can be up to 80 characters long.  Those sentences are defined in the NMEA standard and IEC_61162, both of which are paywalled.  According to NMEA 4.0, there can be up to 62 characters per sentence.


While it is possible to chain 9 sentences (1 to 9) together of VDM armored NMEA with 558 characters of armored data at 6 bits per character for a total of 3348 bits, the VDL (VHF Data Link) will only let you send those 5 slots or 1152 bits.  A full length 1152 bit message could be packed in as little as 4 NMEA lines if fully using the 62 characters per sentence limit.

And to sounds like a broken recorded, paywalled specifications are evil.  In the case of specs for maritime systems, closed specs are detrimental to safety of life at sea and reduce the quality of the tools available to mariners.  I take strong issue with NMEA, IEC, ITU and ISO for paywalling so many specifications documents.

Here is what some multi-sentence TAG BLOCK encoded NMEA looks like:


Or as actual text:

\g:1-3-238720,n:554283,s:r14RHAL1,c:1466553606*61\!AIVDM,2,1,4,B,85PH6vAKf<L=a`<;U5g8uT;Mqw6H9:fWHnoMuTEEE@K,0*29

\g:2-3-238720,n:554284*22\!AIVDM,2,2,4,B,=ep`wk9ks,0*64

Thursday, June 16, 2016

Git for Ages 4 and up

If you are interested in git (and why wouldn't you be?), and you haven't watched this video, you should watch it!



Wednesday, June 15, 2016

Using libais C++ to dump a human readable form of AIS NMEA messages

I was just asked this via email, so here is the answer for all.
I am trying to use your library to decode AIS in C++. I can’t seem to figure out how to actually decode the body into meaningful string data, or if your library has that functionality.

The libais library is designed to convert the NMEA to bits and then break down the bits into a C++ struct.  From python you get a dict that you can print.   Otherwise, you can extend the stream operators to print more for C++, but it's really application specific what format you might want for the output: xml, json, msgpack, csv, sql for database insertion, etc. plus variations on those themes like gpsd json. And do you want to mix message 5 data into 1,2,3 position reports.  It's up to you how you want to print them.

libais just tries to nail the details of the low level packets in C++.   Everything else is considered out of scope.  Pull requests welcome if you want to add a C++ layer above that does more.  I only added the python layer at this point.

If you just want to get a look at a json for quickly, you can pass the data through gpsd.  It's missing some of the messages and field components (e.g. commstate), but it does make an assumption about output format.

Friday, March 4, 2016

Coordinate transformations

I am trying to learn about WKT projection information and how projects are actually done in GDAL and PROJ.  Needless to say, I knew very little before this week and I still do not know very much.  First some links:

http://www.geoapi.org/snapshot/javadoc/org/opengis/referencing/doc-files/WKT.html#TOWGS84
http://spatial-analyst.net/ILWIS/htm/ilwisapp/find_datum_trans_params_methodpage.htm
http://edndoc.esri.com/arcsde/9.2/concepts/geometry/coordref/coordsys/geographic/transformationmethods.htm

So what is the best resource to learn about various methods?  e.g. Molodensky Badekas Bursa Wolf

Google low level C++ APIs

This is one of those questions that keeps coming up... "What core/low-level libraries has Google released as open source?"  As someone who works inside of Google, I already know a lot of them and when I work on external code (e.g. GDAL) I have to work extra hard because the tools I normally reach for are not there.  One of those is the C++ strings substitute.h.  I just went looking and found it:

https://github.com/google/lmctfy/blob/master/strings/substitute.h
https://github.com/google/protobuf/blob/master/src/google/protobuf/stubs/substitute.h
https://github.com/google/supersonic/blob/master/supersonic/utils/strings/substitute.h
https://github.com/google/sensei/blob/master/sensei/strings/substitute.h

The question then becomes, where do I find the latest open sourced version and how do I use it in a library that will be used by many other people such as GDAL or libais?  I probably have to move it into a gdal or osgeo namespace or it's going to clash at link time.  If it's just an end tool, there there is less worry.

Wednesday, February 17, 2016

Monday, February 15, 2016

Databases and big data

Just some initial thoughts on the database side of things.  This is mostly from the perspective of a single person or small group doing research that needs to use these database technologies to try to get use / meaning out of their data.

There is a lot of hype around "big data" these days.  The reality is that what is big data for one person is little data for another.  It all depends on what resources are available for the problem.  I work at Google (this post is my take, not Google's) and I sometimes work with databases that are Petabytes in size.  But, size really should not be the metric for big data.  A crazy complex database that fits complete in memory (say <10GB) can be way more challenging that a well organized and rather boring multi-petabyte database.  Things like how you want to access the data mater greatly.  Think about the different between random row accesses, complete table traversal, and in-order table traversal.  How will the database change?  Will you be only reading, appending, or doing random updates?  What parts of the database will be used for search (if there is search)?  How uniform is the data?  Will you do joins inside the database and if so why types?  The load they put on the database and how you can optimize are all different for the types of data.  The more flexibility there is, the more difficult it is for the database to be efficient.

What are some of the types of databases?

  • Dumb files.  Think CSV and things you can use grep, awk, sed, perl, python etc on.
  • Serialization systems - e.g. XML, JSON, BSON, Yaml, MessagePack, Protobuf,  Avro, Thrift, etc.
  • Filesystems as databases - Content Addressable Storage (CAS) or just things like Google Cloud Storage / S3 as keys with data blobs with the filename (could be a hash) as the key
  • Spreadsheets as databases - e.g. Pandas, Google Sheets, OpenOffice/LibreOffice sheets or databases
  • Simple hash lookups: gdbm, dbm, Google LevelDB - e.g. Using user id's to check a password hash.
  • memcache - Worth a separate mention.  This is used to speed up lots of other systems and is basically a large scale version of the memoize concept of caching results of computations or lookups.
  • SQLite - The most stripped down SQL like database system based on a single file
  • Tradition SQL databases - MySQL/MariaDB, PostgreSQL, Oracle, Ingres, etc.  Working with rows
  • NoSQL databases - Think key lookups like gdbm but designed to contain all sorts of extra craziness.  BigTable, MongoDBGoogle DataStoreCouchDBCassandraRiak, etc
  • Column Oriented databases: Google Dremel / BigQueryApache HBase
  • And so many more... graph databases, databases embedded in other product, object stores, NewSQL, yada yada.
Each system has its strengths and weaknesses.

Which one(s) should you pick for a project?

This really boils down to requirements.  In all likelihood, you will end up using multiple systems for a project, sometimes for the same data, sometimes for subsets and sometimes for different data.

Even in Apache, the project directory lists 25 database entries.   For Python, PyPi Database Engines has > 250 entries.

Things like which license (many people can't use AGPL software), platform (Windows, Mac OSX, Linux, others) and program languages supported (both clients and stored procedures) will help restrict things.  Does it have spatial support and does OGC Spatial support matter?  Do you need features like row history?  How much supporting infrastructure is required and how many people do you need to have on staff for the system?  How fast do you need responses from the system?  How much money are you willing to spend to make life simpler?  Which databases does your hosting service provide or are you self hosting?  It's also possible that you pick a database and that database is implemented in terms of another database type.

It is worth while to pick a working set and get to know that.  Once you know a few, it will be easier to learn new systems.  Here is my take on a working set that might get someone starting out going with a single Ubuntu machine and little money to start.
  1. Text/CSV files with grep and python's CSV module
  2. SQLite - This is the simples "sort of SQL" database out there.  A good place to learn simple SQL.  RasterLite & Spatialite for spatial data
  3. PostgreSQL + PostGIS for a rigorous SQL environment.
  4. CouchDB + GeoCouch for a starter NoSQL.  I'm least sure about this
  5. Memcached to speed up all of the above when you hit performance problems that have repeated queries
If you were looking for cloud hosted setup, you will like want to keep everything in one companies dataset.  You can setup virtual machines to do pretty much anything you want if you have the time.  For minimal setup time, you could, for example, choose the Google stack of options:
  1. Cloud Storage for storing blobs with keys (aka paths) for things that don't change often
  2. SQL for MySQL
  3. DataStore for simple NoSQL
  4. BigTable for heavy weight NoSQL
  5. BigQuery for an SQL like column orient database for massive speed
  6. memcache to speed up common queries 
e.g. For the All the Ships in the World demo (2015 preso), we combined Cloud Storage, BigQuery, DataStore and MemCache. 

System diagram for 2013 All the Ships in the World

For a small acadeic research team, I might suggest focusing on something like IPython/Jupyter Notebooks with Matplotlib/basemap, Pandas, SciPy, scikit-learn, and other python libraries talking to MySQL/MariaDB/PostgreSQL (pick one) and Google BigQuery databases for the main data.  You would then put the results of common queries in DataStore for fast and cheap access.  Longer term storage of data would be in Cloud Storage.  Local caching of results can be done in the Numpy binary format, HDF5, or SQLite via Pandas.  You then would have a slick analysis platform that can use the big data systems to subset the data to work with locally.  You could even run the python stack in a cloud hosted VM and just use a web browser locally to interact with a remove Notebook.  Keeping as much as possible in the data center prevents having to send too much over the network back to your local machine.

The tool options and variations are endless.  I've left out so many interesting and useful technologies and concepts, but I wanted to get something out.

Thursday, January 28, 2016

QR Codes on the roofs of building

QR codes on the roofs of buildings could be used for automatic calibration on imagery from aircraft or spacecraft.  If only this QR code linked to machine readable georeferencing information. This is at the Naval Postgraduate School in Monterey, CA.

Image from Google Maps.




Thursday, January 21, 2016

HF-AIS... what????

Or rather wat?

http://www.businesswire.com/news/home/20160113005517/en/Advanced-core-AIS-technology-innovation-global-leaders

"advanced DSP core technology which enables an AIS transceiver to reliably and accurately receive and decode every AIS transmission in real time"


So... AIS is 2x9600 baud channels.  Today that is like crazy slow, so why is special DSP technology needed?  Seriously?  Why?  No really... I am totally not understanding.  These are 25kHz max channels which is nothing in today's computational landscape.  Even if you decided to do beam steering and create multiple synthetic channels, this really isn't much.  Shine Micro has had the RadarPlusSM1680 for > 7 years with 8 receivers.  I think that uses GNU Radio inside.