great video - http://www.infoq.com/presentations/finance-sql-nosql-newsql
All technologies mentioned where open source
Google releases papers on their technologies, open source world copies their implementations
ie. everyone else is playing "catchup"
GFS(2003) -> HDFS
MapReduce(2004) -> Hadoop
BigTable(2006) -> Hbase / Cassandra / any column family database
Stream processing frameworks
Percolator(2010) ->
- Twitter Storm - twitter has taken 3 years to create this since paper was released
- yahoo S4 - apache incubator project
spoke quite enthusiastically about "storm"
Realtime("response in seconds") sql-like query functionality
Dremel(2010) ->
- Cloudera impala - presenter comments that this is "most mature" option
- Hortonworks Tez/Stinger
Scalable graph computation framework
Pregel(2010) ->
- Apache Giraph
- Titan
- Microsoft research trinity
Search engine
business example: finding a clients CDS ISDA (can take months for ISDA to be located, client's gets tired of waiting for it and deal is lost)
business example: ability to search research document - pretty certain most banks use
- lucene / solr
- lily (on top of hbase)
- datastax (on top cassandra)
Future for hadoop is "reatime"
concept of using twitter storm + zookeeper + hadoop to achieve a realtime CEP engine
CEP useful for "event driven investment banking use cases"
Lambda architecture (Nathan Marz)
mentioned upcoming book - http://manning.com/marz/
<insert diagram here>
batch layer -
use hadoop to preprocess data into QFDs - (question focused datasets)
speed layer - process deltas
both share same data store (HDFs) so data can be shared
CAP theorem
12 years later - creator of theorem is disputing its application by the creator of the theorem
http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed
Four Vs
reference to Clouderas four Vs - "to help you sell big data in your firm"
last V - is "value"
Montecarlo simulation using Hadoop
"map" part of map-reduce can run a single simulation
if the nodes that are running the "map" part have a GPU - this could be used
or you can just use raw CPU power
you just need to implement the standard map reduce interface
mathermatica has a new release that abstracts aways GPU
Time series data
this data tricky
kdb - can be hard to scale
open ts db - open source time series database
I think cassandra is good for timeseries - unlimited number of columns (Nicolas covered this in Acunu training)
presenter referenced concept of processing different scenarios including time series data on a graph database as a possible future idea
Clustering / classification
could classify stream of research documents and route to the correct trader
can use mahout - extract sentiment
Concept of data hub
all data can be stored in one place
eg. correlation of weather, commodity prices - so that many different teams have access to this data
its not difficult to get 1000x3TBs drives and have a petabyte of space
you can run 32 core cluster at home
provide R for statistics
Hortonworks porting hadoop to Microsoft technologies... - eeuw
No comments:
Post a Comment