Stormnyc04 140501100434 Phpapp02

Brian
ONeill, CTO
boneill@healthmarketscience.com
@boneill42

Data Pipelines :
Improving on the Lambda
Architecture
Talk Breakdown
Topics
20%
29%
(1) MoIvaIon
(2) Polyglot Persistence
(3) AnalyIcs
(4) Lambda Architecture
31%
20%
Health Market Science - Then
What we were.
Health Market Science - Now

t
n
e
m
e
anag
M

a
t
Da
Were xing healthcare!
IntersecIng Big Data

w/ Healthcare
Data Pipelines
I/O
The Input
From government,
state boards, etc.

From the internet,
social data,
networks / graphs

From third-parIes,
medical claims

From customers,
expenses,
sales data,
beneciary informaIon,
quality scores
Data
Pipeline
The Output
Provider MasterFileTM
Expense ManagerTM
Division
Contact
(phone, fax, etc.)
RepresentaIve
Address
Expense
PracIIoner
OrganizaIon
CredenIals
SancIon
Script
Customer
Feed(s)
Customer
Master
Referrals
Drug
Provider Verification
Claims
MarketViewTM
Agile MDM
1 billion claims
per year
Sounds easy
Except...
Incomplete Capture
No foreign keys
Diering schemas
Changing schemas
ConicIng informaIon
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval
Master Data Management

Harvested
faddress F@t0
Government
Private
flicense F@t5
fsanction F@t1
fsanction F@t4
Schema Change!
Golden
Record
Why?
?s
Mark
eIng
O
p Im
izaIo
Sales
Ope
raIo
n
Our MDM Pipeline

-
-
-
Feeds
(mulIple formats,
changing over Ime)
API / FTP
IngesIon
- SemanIc Tagging
- StandardizaIon
- Data Mapping
Rules
Data Stewardship
Data ScienIsts
Business Analysts
Web Interface
IncorporaIon
- ConsolidaIon
- EnumeraIon
- AssociaIon
Logic
Insight
- Search
- Reports
- AnalyIcs
Dimensions
Our rst Pipeline
Sweet!

Dirt Simple
Lightning Fast
Highly Available
Scalable
MulI-Datacenter (DR)
Not Sweet.

How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 Wide Rows!

AOP
Triggers!

Data model to
support your
queries.
ONC : PA : 19460
9
32
74
99
12
42
$3.50
$7.00
$8.75
$1.00
$4.20
$3.17
$8.88
DOh! What about ad hoc?
Rev 2 ElasIc Search!

AOP
Triggers!
TransformaIon
DOh!
What if ES fails?
What about schema / type informaBon?

Rev 3 - Apache Storm!
Polyglot Persistence
The Right Tool for the Job
Oracle is a registered trademark

of Oracle CorporaIon and/or its
aliates. Other names may be
trademarks of their respecIve
owners.
Back to the Pipeline

DW
Kaoa
Storm
C*
ES
Titan
SQL
Design Principles
What we got:
At-least-once processing
Simple data ows
What we needed to account for:

Replays
Idempotent OperaBons!
Immutable Data!

Cassandra State (v0.4.0)

git@github.com:hmsonline/storm-cassandra.git

Storm
Cassandra
{tuple} <mapper> (ks, cf, row, k:v[])

Trident ElasIc Search (v0.3.1)

git@github.com:hmsonline/trident-elasIcsearch.git

Storm
ElasIc Search
{tuple} <mapper> (idx, docid, k:v[])

Storm Graph (v0.1.2)

Coming soon to...
git@github.com:hmsonline/storm-graph.git

for (tuple : batch)

<processor> (graph, tuple)
Storm JDBI (v0.1.14)

INTERNAL ONLY (so far)
Worth releasing?

{tuple} <mapper> (JDBC Statement)

All good!
But...
What was the average amount for a
medical claim associated with procedure
X by zip code over the last ve years?
Hadoop (<2)? Batch?
hzp://www.slideshare.net/prash1784/introducIon-to-hadoop-and-pig-15036186
Yuck. Nu Said.
Lets Pre-Compute It!

stream
.groupBy(new Field(ICD9))
.groupBy(new Field(zip))
.aggregate(new Field(amount),
new Average())
DOh!
GroupBys.
They set data in moBon!
Lesson Learned
If possible, avoid
re-parIIoning
operaIons!
(e.g. LOG.error!)
hzps://github.com/nathanmarz/storm/wiki/Trident-API-Overview
Why so hard?
DOh!

What we dont want:
LOCKS!

Whats the alternaIve?
CONSENSUS!

19 != 9
Cassandra 2.0!
hzp://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
hzp://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf

CondiIonal Updates
The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. After one has been
accepted, it will be returned to future
leaders in the promise, and the new leader
will have to re-propose it again.
hzp://www.datastax.com/dev/blog/lightweight-transacIons-in-cassandra-2-0

UPDATE value=9 WHERE word=fox IF value=6
Love CQL

CondiIonal Updates
+
Batch Statements
+
CollecIons
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!

git@github.com:hmsonline/storm-cassandra-cql.git

{tuple} <mapper> (CQL Statement)

Trident Batching =? CQL Batching
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}
Incremental State!
Collapse aggregaIon into the state object.
This allows the state object to aggregate with current state

in a loop unIl success.
Uses Trident Batching to perform in-memory

aggregaIon for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}
In-Memory AggregaIon by Key!

Key
Value
fox
brown
No More GroupBy!
ParIIon 1
C*
Key
Value
fox
lazy
72
ParIIon 2
To protect against replays

Use parBBon + batch idenBer(s) in
your condiBonal update!

BatchId + parIIonIndex consistently represents the

same data as long as:
1.Any reparIIoning you do is determinisIc (so
parIIonBy is, but shue is not)
2.You're using a spout that replays the exact same
batch each Ime (which is true of transacIonal spouts
but not of opaque transacIonal spouts)
- Nathan Marz
The Lambda Architecture
hzp://architects.dzone.com/arIcles/nathan-marzs-lamda
Lets Challenge This a Bit

because addiIonal tools and techniques cost
money and Ime.
QuesIons:
Can we solve the problem with a single tool and a
single approach?
Can we re-use logic across layers?
Or bezer yet, can we collapse layers?
A TradiIonal InterpretaIon
Speed Layer
(Storm)
Serving Layer
Data
Stream
HBase
Batch Layer
(Hadoop)
DOh! Two pipelines!
Impala
IntegraIng Web Services

We need a web service that receives an event
and provides,
an immediate acknowledgement
a high likelihood that the data is integrated very soon
a guarantee that the data will be integrated
eventually
We need an architecture that provides for,

Code / Logic and approach re-use
Fault-Tolerance
Grand Finale
The Idea : Embedding State!
Client
DropWizard
IncrementalCqlState
aggregate(tuple)
C*
Kaoa
Batch Layer
(Storm)
The Sequence of Events
The Wins
Reuse AggregaIons and State Code!
To re-compute (or backll) a dimension,
simply re-queue!
Storm is the safety net
If a DW host fails during aggregaIon, Storm will ll
in the gaps for all ACKd events.
Is there an opportunity to reuse more?

BatchingStrategy & ParIIonStrategy?
In the end, all good. =)
Plug
Shout out:
Taylor Goetz
The Book

Brian ONeill, CTO

boneill@healthmarketscience.com
@boneill42
Thanks

Stormnyc04 140501100434 Phpapp02

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stormnyc04 140501100434 Phpapp02

Uploaded by

Copyright:

Available Formats

Brian

Health Market Science - Then

Health Market Science - Now

Were xing healthcare!

IntersecIng Big Data

(phone, fax, etc.)

Master Data Management

Our MDM Pipeline

Our rst Pipeline

Rev. 1 Wide Rows!

DOh! What about ad hoc?

Rev 2 ElasIc Search!

Rev 3 - Apache Storm!

Oracle is a registered trademark

Back to the Pipeline

What we needed to account for:

Cassandra State (v0.4.0)

{tuple} <mapper> (ks, cf, row, k:v[])

Trident ElasIc Search (v0.3.1)

{tuple} <mapper> (idx, docid, k:v[])

Storm Graph (v0.1.2)

for (tuple : batch)

Storm JDBI (v0.1.14)

{tuple} <mapper> (JDBC Statement)

Hadoop (<2)? Batch?

Lets Pre-Compute It!

Announcing : Storm Cassandra CQL!

{tuple} <mapper> (CQL Statement)

Trident Batching =? CQL Batching

This allows the state object to aggregate with current state

Uses Trident Batching to perform in-memory

In-Memory AggregaIon by Key!

To protect against replays

BatchId + parIIonIndex consistently represents the

The Lambda Architecture

Lets Challenge This a Bit

DOh! Two pipelines!

IntegraIng Web Services

We need an architecture that provides for,

The Idea : Embedding State!

The Sequence of Events

Is there an opportunity to reuse more?

In the end, all good. =)

Brian ONeill, CTO

You might also like