Professional Documents
Culture Documents
ONeill,
CTO
boneill@healthmarketscience.com
@boneill42
Data
Pipelines
:
Improving
on
the
Lambda
Architecture
Talk
Breakdown
Topics
20%
29%
(1)
MoIvaIon
(2)
Polyglot
Persistence
(3)
AnalyIcs
(4)
Lambda
Architecture
31%
20%
What we were.
M
a
t
Da
Data Pipelines
I/O
The Input
From
government,
state
boards,
etc.
From
the
internet,
social
data,
networks
/
graphs
From
third-parIes,
medical
claims
From
customers,
expenses,
sales
data,
beneciary
informaIon,
quality
scores
Data
Pipeline
The
Output
Provider MasterFileTM
Expense ManagerTM
Division
Contact
RepresentaIve
Address
Expense
PracIIoner
OrganizaIon
CredenIals
SancIon
Script
Customer
Feed(s)
Customer
Master
Referrals
Drug
Provider Verification
Claims
MarketViewTM
Agile MDM
1
billion
claims
per
year
Sounds
easy
Except...
Incomplete
Capture
No
foreign
keys
Diering
schemas
Changing
schemas
ConicIng
informaIon
Ad-hoc
Analysis
(is
hard)
Point-In-Time
Retrieval
faddress F@t0
Government
Private
flicense F@t5
fsanction F@t1
fsanction F@t4
Schema Change!
Golden
Record
Why?
?s
Mark
eIng
O
p Im
izaIo
Sales
Ope
raIo
n
Feeds
(mulIple
formats,
changing
over
Ime)
API
/
FTP
IngesIon
-
SemanIc
Tagging
-
StandardizaIon
-
Data
Mapping
Rules
Data
Stewardship
Data
ScienIsts
Business
Analysts
Web Interface
IncorporaIon
-
ConsolidaIon
-
EnumeraIon
-
AssociaIon
Logic
Insight
-
Search
-
Reports
-
AnalyIcs
Dimensions
Sweet!
Dirt
Simple
Lightning
Fast
Highly
Available
Scalable
MulI-Datacenter
(DR)
Not
Sweet.
How
do
we
query
the
data?
NoSQL
Indexes?
Do
such
things
exist?
Data
model
to
support
your
queries.
ONC
:
PA
:
19460
9
32
74
99
12
42
$3.50
$7.00
$8.75
$1.00
$4.20
$3.17
$8.88
DOh!
What
if
ES
fails?
What
about
schema
/
type
informaBon?
Polyglot
Persistence
The
Right
Tool
for
the
Job
Kaoa
Storm
C*
ES
Titan
SQL
Design
Principles
What
we
got:
At-least-once
processing
Simple
data
ows
Idempotent
OperaBons!
Immutable
Data!
Storm
Cassandra
Storm
ElasIc Search
All good!
But...
What
was
the
average
amount
for
a
medical
claim
associated
with
procedure
X
by
zip
code
over
the
last
ve
years?
hzp://www.slideshare.net/prash1784/introducIon-to-hadoop-and-pig-15036186
Yuck. Nu Said.
Lesson
Learned
If
possible,
avoid
re-parIIoning
operaIons!
(e.g.
LOG.error!)
hzps://github.com/nathanmarz/storm/wiki/Trident-API-Overview
Why
so
hard?
DOh!
What
we
dont
want:
LOCKS!
Whats
the
alternaIve?
CONSENSUS!
19
!=
9
Cassandra 2.0!
hzp://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
hzp://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
CondiIonal
Updates
The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. After one has been
accepted, it will be returned to future
leaders in the promise, and the new leader
will have to re-propose it again.
hzp://www.datastax.com/dev/blog/lightweight-transacIons-in-cassandra-2-0
UPDATE value=9 WHERE word=fox IF value=6
Love
CQL
CondiIonal
Updates
+
Batch
Statements
+
CollecIons
=
BADASS
DATA
MODELS
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}
Incremental
State!
Collapse
aggregaIon
into
the
state
object.
Value
fox
brown
No More GroupBy!
ParIIon 1
C*
Key
Value
fox
lazy
72
ParIIon
2
hzp://architects.dzone.com/arIcles/nathan-marzs-lamda
A
TradiIonal
InterpretaIon
Speed
Layer
(Storm)
Serving Layer
Data
Stream
HBase
Batch
Layer
(Hadoop)
Impala
Grand Finale
Client
DropWizard
IncrementalCqlState
aggregate(tuple)
C*
Kaoa
Batch
Layer
(Storm)
The
Wins
Reuse
AggregaIons
and
State
Code!
To
re-compute
(or
backll)
a
dimension,
simply
re-queue!
Storm
is
the
safety
net
If
a
DW
host
fails
during
aggregaIon,
Storm
will
ll
in
the
gaps
for
all
ACKd
events.
Plug
Shout
out:
Taylor
Goetz
The
Book
Thanks