Tech in Investment Banking: Big data in finance

great video - http://www.infoq.com/presentations/finance-sql-nosql-newsql

All technologies mentioned where open source

Google releases papers on their technologies, open source world copies their implementations
ie. everyone else is playing "catchup"

GFS(2003) -> HDFS
MapReduce(2004) -> Hadoop
BigTable(2006) -> Hbase / Cassandra / any column family database

Stream processing frameworks

Percolator(2010) ->

Twitter Storm - twitter has taken 3 years to create this since paper was released
yahoo S4 - apache incubator project

spoke quite enthusiastically about "storm"

Realtime("response in seconds") sql-like query functionality

Dremel(2010) ->

Cloudera impala - presenter comments that this is "most mature" option
Hortonworks Tez/Stinger

Scalable graph computation framework

Pregel(2010) ->

Apache Giraph
Titan
Microsoft research trinity

Search engine
business example: finding a clients CDS ISDA (can take months for ISDA to be located, client's gets tired of waiting for it and deal is lost)
business example: ability to search research document - pretty certain most banks use

lucene / solr
lily (on top of hbase)
datastax (on top cassandra)

Future for hadoop is "reatime"
concept of using twitter storm + zookeeper + hadoop to achieve a realtime CEP engine

CEP useful for "event driven investment banking use cases"

Lambda architecture (Nathan Marz)
mentioned upcoming book - http://manning.com/marz/

<insert diagram here>

batch layer -
use hadoop to preprocess data into QFDs - (question focused datasets)

speed layer - process deltas

both share same data store (HDFs) so data can be shared

CAP theorem
12 years later - creator of theorem is disputing its application by the creator of the theorem
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

Four Vs
reference to Clouderas four Vs - "to help you sell big data in your firm"
last V - is "value"

Montecarlo simulation using Hadoop
"map" part of map-reduce can run a single simulation
if the nodes that are running the "map" part have a GPU - this could be used
or you can just use raw CPU power

you just need to implement the standard map reduce interface
mathermatica has a new release that abstracts aways GPU

Time series data
this data tricky
kdb - can be hard to scale
open ts db - open source time series database
I think cassandra is good for timeseries - unlimited number of columns (Nicolas covered this in Acunu training)
presenter referenced concept of processing different scenarios including time series data on a graph database as a possible future idea

Clustering / classification
could classify stream of research documents and route to the correct trader
can use mahout - extract sentiment

Concept of data hub
all data can be stored in one place
eg. correlation of weather, commodity prices - so that many different teams have access to this data

its not difficult to get 1000x3TBs drives and have a petabyte of space
you can run 32 core cluster at home

provide R for statistics

Hortonworks porting hadoop to Microsoft technologies... - eeuw

Tech in Investment Banking

Wednesday, 17 April 2013

Big data in finance

No comments:

Post a Comment