By Roger Barga, Amazon Web Services
He starts with three key points: (1) Streaming Analytitcs is foundational for business critical workflows. (2) It enables a new class of applications and services that process data continuously, and (3) Thnk of algorithms in terms of streams of data and continuous processing.
Almost any device is nowadays generating data continuously. However, the analytics we typically use, are in batches. For example, fraud is being detected only afterwards, instead of being able to respond in time. So, compressing the batches and time analytics take is very benefitial. This requires a change also in your software architecture. Of course he shows it with their own streaming services: Amazon Kinesis. One of the pictures he shows is on a game: they spot all kinds of performance data, and can create hotspots on the gameplay.
Another example is Amazon Go: you just log in with your phone, walk around and pick items. Once you walk out, you automatically pay, without passing a cashier. This is all processed using Kinesis and model learning on video streams.
One of the big problems in streams is how you detect what normal behavior is. In batch processing, this is quite doable, but how do you do this on streams? Anomaly detection on streams! The idea is to build a bounding box (in a high dimensional space), and make a random cut, weighted by the dimensions. Make a range-biased cut, make a new bounding box, etc. (Random cut tree). Do this on a random sample. So, do reservoir sampling for each new item. For inserting the new item, start at the root, and add it to corresponding box. If there is no such box (which you already know at the root), create a new root. When is it an anomaly? If you need to greatly expand the bounding box! (sum of path lengths from root to leaves, or description length) They did this with publicly available taxi data, and published here. They are now working (in the process of publishing) to have this for directed graphs as well!