Wednesday, 17 April 2013

Big data in finance


great video - http://www.infoq.com/presentations/finance-sql-nosql-newsql

All technologies mentioned where open source


Google releases papers on their technologies, open source world copies their implementations
ie. everyone else is playing "catchup"


GFS(2003)  -> HDFS
MapReduce(2004) -> Hadoop
BigTable(2006)  -> Hbase / Cassandra / any column family database

Stream processing frameworks


Percolator(2010) ->
  • Twitter Storm - twitter has taken 3 years to create this since paper was released
  • yahoo S4 - apache incubator project
spoke quite enthusiastically about "storm"



Realtime("response in seconds") sql-like query functionality

Dremel(2010) ->

  • Cloudera impala  - presenter comments that this is "most mature" option
  • Hortonworks Tez/Stinger  



Scalable graph computation framework

Pregel(2010) ->

  • Apache Giraph 
  • Titan 
  • Microsoft research trinity

Search engine
business example: finding a clients CDS ISDA (can take months for ISDA to be located, client's gets tired of waiting for it and deal is lost)
business example: ability to search research document - pretty certain most banks use
  • lucene / solr
  • lily (on top of hbase)
  • datastax (on top cassandra)

Future for hadoop is "reatime"
concept of using twitter storm + zookeeper  + hadoop to achieve a realtime CEP engine

CEP useful for "event driven investment banking use cases"

Lambda architecture (Nathan Marz)
mentioned upcoming book - http://manning.com/marz/

<insert diagram here>

batch layer -
use hadoop to preprocess data into QFDs - (question focused datasets)

speed layer - process deltas

both share same data store (HDFs) so data can be shared

CAP theorem
12 years later - creator of theorem is disputing its application by the creator of the theorem
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

Four Vs
reference to Clouderas four Vs - "to help you sell big data in your firm"
last V - is "value"

Montecarlo simulation using Hadoop
"map" part of map-reduce can run a single simulation
if the nodes that are running the "map" part have a GPU - this could be used
or you can just use raw CPU power

you just need to implement the standard map reduce interface
mathermatica has a new release that abstracts aways GPU

Time series data 
this data tricky
kdb - can be hard to scale
open ts db - open source time series database
I think cassandra is good for timeseries - unlimited number of columns (Nicolas covered this in Acunu training)
presenter referenced concept of processing different scenarios including time series data on a graph database as a possible future idea

Clustering / classification
could classify stream of research documents and route to the correct trader
can use mahout - extract sentiment

Concept of data hub
all data can be stored in one place
eg. correlation of weather, commodity prices - so that many different teams have access to this data

its not difficult to get 1000x3TBs drives and have a petabyte of space
you can run 32 core cluster at home

provide R for statistics

Hortonworks porting hadoop to Microsoft technologies... - eeuw










video on how to implement a ESB in finance

http://www.infoq.com/presentations/Large-Scale-Integration-in-Financial-Services

Great video with ideas how to improve ESB (enterprise service bus)


  • fix - is used for front office
  • fpML - middle office 
  • SWIFT - interbank transfer (done at end of day)


concepts:

  • store message in its raw format as "clob"
  • could be transformed into pojo
  • create a xpath like language that can refer
  • xpath 2.0 can also check message type
  • message routing etc can be based on xpath
  • provided the message format and version is know - we can always go though the messages and extract additional fields. ie. make those fields available as an "index" fields 

ESB could just be a in-memory cache - (which may persist the messages)