Professional Documents
Culture Documents
Flinks core is a streaming dataflow engine that provides data distribution, communication, and
fault tolerance for distributed computations over data streams.
Flink includes several APIs for creating applications that use the Flink engine:
1. DataStream API for unbounded streams embedded in Java and Scala, and
2. DataSet API for static data embedded in Java, Scala, and Python,
3. Table API with a SQL-like expression language embedded in Java and Scala.
Flink also bundles libraries for domain-specific use cases:
1. CEP, a complex event processing library,
2. Machine Learning library, and
3. Gelly, a graph processing API and library.
You can integrate Flink easily with other well-known open source systems both for data input
and output as well as deployment.
Streaming First
High throughput and low latency stream processing with exactly-once guarantees.
Batch on Streaming
Batch processing applications run efficiently as special cases of stream processing applications.
Streaming
High Performance & Low Latency
Flink's data streaming runtime achieves high throughput rates
and low latency with little configuration. The charts below show
the performance of a distributed item counting task, requiring
streaming data shuffles.
Memory Management
Flink implements its own memory management inside the JVM.
Applications scale to data sizes beyond main memory and
experience less garbage collection overhead.
Program Optimizer
Batch programs are automatically optimized to exploit situations
where expensive operations (like shuffles and sorts) can be
avoided, and when intermediate data should be cached.
Library Ecosystem
Flink's stack offers libraries with high-level APIs for different use
cases: Machine Learning, Graph Analytics, and Relational Data
Processing.
The libraries are currently in beta status and are heavily
developed.
. . ,
.
map map .
flink .
, .
.
Batch
, .
step step .
.
.
step(
) .
. , client( )
single point of failure, SPOF . Flink Client 4
4
. client .
. Hadoop Spark 4
. Flink .
Real-time Streaming
distributed snapshots
barrier
.
Barrier barrier
. state backend .
job manager .
Sink() .
Kafka .
.
Flink vs Spark
At first what do they have in common? Flink and Spark are both general-purpose data
processing platforms and top level projects of the Apache Software Foundation (ASF).
They have a wide field of application and are usable for dozens of big data scenarios.
Thanks to expansions like SQL queries (Spark: Spark SQL, Flink: MRQL), Graph
processing (Spark: GraphX, Flink: Spargel (base) and Gelly(library)), machine learning
(Spark: MLlib, Flink: Flink ML) and stream processing (Spark Streaming, Flink
Streaming). Both are capable of running in standalone mode, yet many are using them
on top of Hadoop (YARN, HDFS). They share a strong performance due to their in
memory nature.
However, the way they achieve this variety and the cases they are specialized on differ.
Differences: At first I'd like to provide two links which go in some detail on differences
between Flink and Spark before summing it up. If you have the time have a look
at Apache Flink is the 4G of BigData Analytics Framework and Flink and Spark
Similarities and Differences
In contrast to Flink, Spark is not capable of handling data sets larger than the RAM
before version 1.5.x
Flink is optimized for cyclic or iterative processes by using iterative transformations on
https://www.mapr.com/blog/apache-spark-vs-apache-flink-whiteboard-walkthrough
https://www.youtube.com/watch?v=OHAv6o2fCi8