You are on page 1of 20

Apache Flink

1. What is Apache Flink


Apache Flink is distributed data flow processing system.
.
- flink core flink streaming dataflow engine
.
- Web UI

Flinks core is a streaming dataflow engine that provides data distribution, communication, and
fault tolerance for distributed computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
1. DataStream API for unbounded streams embedded in Java and Scala, and
2. DataSet API for static data embedded in Java, Scala, and Python,
3. Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1. CEP, a complex event processing library,
2. Machine Learning library, and
3. Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input
and output as well as deployment.

Streaming First

High throughput and low latency stream processing with exactly-once guarantees.

Batch on Streaming
Batch processing applications run efficiently as special cases of stream processing applications.

APIs, Libraries, and Ecosystem


DataSet, DataStream, and more. Integrated with the Apache Big Data stack.

Streaming
High Performance & Low Latency
Flink's data streaming runtime achieves high throughput rates
and low latency with little configuration. The charts below show
the performance of a distributed item counting task, requiring
streaming data shuffles.

Support for Event Time and Out-of-Order


Events
Flink supports stream processing and windowing with Event
Time semantics.

Event time makes it easy to compute over streams where


events arrive out of order, and where events may arrive
delayed.

Exactly-once Semantics for Stateful


Computations
Streaming applications can maintain custom state during their
computation.
Flink's checkpointing mechanism ensures exactly
once semantics for the state in the presence of failures.

Highly flexible Streaming Windows


Flink supports windows over time, count, or sessions, as well
as data-driven windows.
Windows can be customized with flexible triggering conditions,
to support sophisticated streaming patterns.

Continuous Streaming Model with


Backpressure
Data streaming applications are executed with continuous (long
lived) operators.
Flink's streaming runtime has natural flow control: Slow data
sinks backpressure faster sources.

Fault-tolerance via Lightweight Distributed


Snapshots
Flink's fault tolerance mechanism is based on Chandy-Lamport
distributed snapshots.
The mechanism is lightweight, allowing the system to maintain
high throughput rates and provide strong consistency
guarantees at the same time.

Batch and Streaming in


One System
One Runtime for Streaming and Batch
Processing
Flink uses one common runtime for data streaming applications
and batch processing applications.
Batch processing applications run efficiently as special cases of
stream processing applications.

Memory Management
Flink implements its own memory management inside the JVM.
Applications scale to data sizes beyond main memory and
experience less garbage collection overhead.

Iterations and Delta Iterations


Flink has dedicated support for iterative computations (as in
machine learning and graph analysis).

Delta iterations can exploit computational dependencies for


faster convergence.

Program Optimizer
Batch programs are automatically optimized to exploit situations
where expensive operations (like shuffles and sorts) can be
avoided, and when intermediate data should be cached.

APIs and Libraries


Streaming Data Applications
The DataStream API supports functional transformations on
data streams, with user-defined state, and flexible windows.
The example shows how to compute a sliding histogram of
word occurrences of a data stream of texts.

Batch Processing Applications


Flink's DataSet API lets you write beautiful type-safe and
maintainable code in Java or Scala. It supports a wide range of
data types beyond key/value pairs, and a wealth of operators.

Library Ecosystem
Flink's stack offers libraries with high-level APIs for different use
cases: Machine Learning, Graph Analytics, and Relational Data
Processing.
The libraries are currently in beta status and are heavily
developed.

FlinkCEP - Complex event


processing for Flink
FlinkCEP is the complex event processing library for Flink. It allows you to easily detect complex
event patterns in a stream of endless data. Complex events can then be constructed from
matching sequences. This gives you the opportunity to quickly get hold of whats really important
in your data.

- FlinkML: Machine learning library on Flink / Python R


.
-> scikit-learn (Python , FlinkML
.)
-> CoCoA, Linear Regression, ALS .

FlinkML - Machine Learning for


Flink
FlinkML is the Machine Learning (ML) library for Flink. It is a new effort in the Flink community,
with a growing list of algorithms and contributors. With FlinkML we aim to provide scalable ML
algorithms, an intuitive API, and tools that help minimize glue code in end-to-end ML systems.
You can see more details about our goals and where the library is headed in our vision and
roadmap here.

- Gelly: Graph processing library on Flink( )


-> Google Pregel Vertex-centric Iteration
-> Gather-Sum-Apply Iteration ( )

Gelly: Flink Graph API


Gelly is a Graph API for Flink. It contains a set of methods and utilities which aim to simplify the
development of graph analysis applications in Flink. In Gelly, graphs can be transformed and
modified using high-level functions similar to the ones provided by the batch processing API.
Gelly provides methods to create, transform and modify graphs, as well as a library of graph
algorithms.


. . ,

.
map map .
flink .


, .
.

Batch

map reduce mapf


map g key value ( ) .
.
, map g of f chaining ,
. reduce i j
3
.

, .
step step .
.
.
step(
) .

. , client( )
single point of failure, SPOF . Flink Client 4

4
. client .

. Hadoop Spark 4
. Flink .

. Flink Spark JVM JVM



,
.
Flink . Unmanaged Heap
Managed Heap .
serialized . Managed Heap
Flink GC . GC
Unmanaged Heap .
GC .

Real-time Streaming

Streaming real-time streaming


. latency

,
Exactly-once guarantees . Event-time
windows 5 5 30
5 5 55 25
, 1~2 ,
.

distributed snapshots

barrier

.

Barrier barrier
. state backend .
job manager .

Sink() .

Kafka .
.

Delivery semantics at least once .


2 . Latency seconds batch
batch batch
sub-second .

Flink vs Spark
At first what do they have in common? Flink and Spark are both general-purpose data
processing platforms and top level projects of the Apache Software Foundation (ASF).
They have a wide field of application and are usable for dozens of big data scenarios.
Thanks to expansions like SQL queries (Spark: Spark SQL, Flink: MRQL), Graph
processing (Spark: GraphX, Flink: Spargel (base) and Gelly(library)), machine learning
(Spark: MLlib, Flink: Flink ML) and stream processing (Spark Streaming, Flink
Streaming). Both are capable of running in standalone mode, yet many are using them
on top of Hadoop (YARN, HDFS). They share a strong performance due to their in
memory nature.
However, the way they achieve this variety and the cases they are specialized on differ.
Differences: At first I'd like to provide two links which go in some detail on differences
between Flink and Spark before summing it up. If you have the time have a look
at Apache Flink is the 4G of BigData Analytics Framework and Flink and Spark
Similarities and Differences
In contrast to Flink, Spark is not capable of handling data sets larger than the RAM
before version 1.5.x
Flink is optimized for cyclic or iterative processes by using iterative transformations on

collections. This is achieved by an optimization of join algorithms, operator chaining and


reusing of partitioning and sorting. However, Flink is also a strong tool for batch
processing. Flink streaming processes data streams as true streams, i.e., data elements
are immediately "pipelined" though a streaming program as soon as they arrive. This
allows to perform flexible window operations on streams. Furthermore Flink provides a
very strong compatibility mode which makes it possible to use your existing storm, map
reduce, ... code on the flink execution engine
Spark on the other hand is based on resilient distributed datasets (RDDs). This (mostly)
in-memory datastructure gives the power to sparks functional programming paradigm. It
is capable of big batch calculations by pinning memory. Spark streaming wraps data
streams into mini-batches, i.e., it collects all data that arrives within a certain period of
time and runs a regular batch program on the collected data. While the batch program is
running, the data for the next mini-batch is collected.

https://www.mapr.com/blog/apache-spark-vs-apache-flink-whiteboard-walkthrough
https://www.youtube.com/watch?v=OHAv6o2fCi8

You might also like