How Flink Powers Data-Driven Applications with High Performance and Scalability

Roman Glushach
9 min readJul 10, 2023

--

Flink: High Overview Architecture

Apache Flink is a distributed stream processing framework that enables fast and reliable data processing at scale. Flink is designed to handle both bounded and unbounded data streams, and to support a variety of use cases, such as event-driven applications, real-time analytics, machine learning, and streaming ETL.

Flink Performance and Scalability

Flink achieves high performance and scalability by leveraging several key features:

  • Pipelined execution: Flink executes tasks in a pipelined fashion, meaning that data is streamed from one task to another as soon as it is available, without intermediate buffering or blocking. This reduces latency and improves throughput
  • Event time semantics: Flink supports event time semantics for processing data streams based on their original timestamps, rather than their ingestion time or processing time. This enables consistent and accurate results across different sources and sinks
  • Stateful stream processing: Flink supports stateful stream processing, meaning that tasks can maintain local state that can be accessed and updated during processing. This enables complex operations such as aggregations, joins, windows
  • State backends: Flink supports different types of state backends that store state on different storage systems, such as memory, disk, or external databases
  • Checkpointing and recovery: Flink supports checkpointing and recovery mechanisms that ensure fault tolerance and exactly-once semantics for stateful stream processing. Checkpointing periodically saves a consistent snapshot of the application state to an external system (such as HDFS), while recovery restores the application state from a checkpoint in case of failures
  • Dynamic scaling: Flink supports dynamic scaling of tasks based on load changes or resource availability. Dynamic scaling allows adding or removing TaskManagers or tasks without interrupting or restarting the application

Architecture

Flink High Overview Architecture

Core Concepts

Flink is based on a few core concepts that define its abstraction layer and programming model:

  • DataStream: A DataStream is an immutable sequence of data records that can be processed as a stream (unbounded) or a batch (bounded). A DataStream can be created from various sources such as files, sockets, Kafka topics, or custom functions. A DataStream can also be transformed by applying operators such as map, filter, join, window, aggregate, or custom functions. A DataStream can be written to various sinks such as files, databases, Kafka topics, or custom functions
  • DataSet: A DataSet is a collection of data records that can be processed as a batch. A DataSet can be created from various sources such as files, databases, or custom functions. A DataSet can also be transformed by applying operators such as map, filter, join, groupBy, aggregate, or custom functions. A DataSet can be written to various sinks such as files, databases, or custom functions
  • Table: A Table is a structured or semi-structured representation of data that can be processed using SQL or Table API. A Table can be created from various sources such as DataStreams, DataSets, files, databases, or custom functions. A Table can also be transformed by applying SQL queries or Table API expressions. A Table can be written to various sinks such as DataStreams, DataSets, files, databases, or custom functions
  • Function: A Function is a user-defined piece of code that can be used to implement custom logic for sources, sinks, operators, or tables. Flink supports various types of functions such as map functions, filter functions, join functions, window functions, aggregate functions, table functions, scalar functions, or async functions

Flink’s layered architecture

Flink has a layered architecture that consists of four main components

Client

The Client is the entry point for submitting Flink applications to the cluster. The Client can be a command-line interface (CLI), a web dashboard (Flink UI), a REST API (Flink REST), or a programmatic interface (Flink API). The Client communicates with the JobManager to submit the application code and configuration.

JobManager

The JobManager is the master node that coordinates the execution of Flink applications on the cluster. The JobManager consists of three sub-components:

  • JobGraph Generator: The JobGraph Generator takes the application code and configuration and translates it into a JobGraph, which is a logical representation of the data flow and operators of the application
  • Scheduler: The Scheduler takes the JobGraph and assigns it to one or more TaskManagers based on the available resources and parallelism settings. The Scheduler also handles failures and reschedules tasks if needed
  • Checkpoint Coordinator: The Checkpoint Coordinator periodically triggers checkpoints for the application state and coordinates with the TaskManagers to store the checkpoints in a durable storage such as HDFS or S3

TaskManager

The TaskManager is the worker node that executes the tasks of Flink applications on the cluster. The TaskManager consists of two sub-components:

  • Task Slot: The Task Slot is a unit of resource allocation that determines how many tasks a TaskManager can run concurrently. Each Task Slot has a fixed amount of CPU cores and memory assigned to it
  • Task: The Task is the actual execution unit that runs the operators and functions of Flink applications on the Task Slot. Each Task has an input buffer, an output buffer, and a state backend that manages its local state

External Systems

The External Systems are the sources and sinks that Flink applications interact with to read and write data. These can be files systems (HDFS), message brokers (Kafka), databases (MySQL), or custom systems.

Data Model

Flink’s data model is based on two core concepts: Dataset and Datastream.

A Dataset represents a static collection of data elements that can be processed in parallel using batch operators. A DataStream represents a dynamic stream of data elements that can be processed in real-time using stream operators.

Both Dataset and DataStream are generic types that can hold any kind of data element, such as tuples, records, objects, or primitives. Flink provides a rich set of built-in types and serializers for common data formats, such as CSV, JSON, Avro, Parquet, etc. Users can also define their own custom types and serializers for complex or domain-specific data formats.

Flink supports various types of data sources and sinks for both Dataset and DataStream. For example, users can read and write data from/to files, databases, message queues, sockets, etc. Flink also provides connectors for popular external systems, such as Kafka, HDFS, Cassandra, Elasticsearch, etc.

One of the key features of Flink’s data model is that it supports both bounded and unbounded data sources. A bounded data source has a finite amount of data elements, such as a file or a database table. An unbounded data source has an infinite or unknown amount of data elements, such as a message queue or a sensor stream.

Flink can automatically detect whether a data source is bounded or unbounded by checking its properties and behavior. For example, if a file source has an end-of-file marker or a database source has a query limit, Flink will treat it as a bounded source. If a socket source has no end-of-stream signal or a Kafka source has no end-of-partition marker, Flink will treat it as an unbounded source.

Depending on the type of the data source, Flink will apply different execution strategies for Dataset and DataStream. For bounded sources, Flink will execute Dataset operators in batch mode, which means that it will process the entire data set in one go and produce a final result. For unbounded sources, Flink will execute DataStream operators in streaming mode, which means that it will process the data elements as they arrive and produce incremental results.

Flink’s data model allows users to seamlessly switch between batch and stream processing without changing the code or the logic of the application. Users can simply use the same operators and APIs for both Dataset and DataStream and let Flink handle the execution mode automatically. This enables users to develop applications that can handle both static and dynamic data sources with high performance and scalability.

Execution Model

Flink has a pipelined execution model that enables high-throughput and low-latency processing of data streams and batches. The execution model consists of three main aspects:

Parallelism

Parallelism is the degree of concurrency that Flink can achieve for a given application. Parallelism is determined by the number of Task Slots in the cluster and the parallelism settings of the application. Flink supports two types of parallelism:

  • Horizontal Parallelism: Horizontal Parallelism is the ability to split a DataStream or a DataSet into multiple partitions and process them in parallel by multiple Tasks. Horizontal Parallelism is achieved by applying keyBy or rebalance operators on the DataStream or DataSet
  • Vertical Parallelism: Vertical Parallelism is the ability to chain multiple operators and functions into a single Task and process them in a pipelined fashion. Vertical Parallelism is achieved by applying forward or hash partitioning on the DataStream or DataSet

Fault Tolerance

Fault Tolerance is the ability to recover from failures and ensure exactly-once or at-least-once processing semantics for Flink applications. Fault Tolerance is achieved by using checkpoints and state backends. Checkpoints are consistent snapshots of the application state that are periodically triggered by the Checkpoint Coordinator and stored in a durable storage. State backends are pluggable components that manage how the state of Tasks is stored, accessed, and restored. Flink supports various state backends such as memory, RocksDB, or custom state backends.

Watermarks

Watermarks are special data records that carry information about the progress of time in a DataStream. Watermarks are used to handle out-of-order events and define event time windows for Flink applications. Watermarks are generated by sources or custom functions and propagated downstream by operators. Flink supports two types of watermarks:

  • Low Watermarks: Low Watermarks indicate that no events with a timestamp lower than the watermark will arrive in the future. Low Watermarks are used to trigger window computations and emit results based on event time
  • High Watermarks: High Watermarks indicate that no events with a timestamp higher than the watermark will arrive in the past. High Watermarks are used to discard late events and handle side outputs based on event time

Flink’s main features

  • High-throughput and low-latency streaming engine: Flink can process millions of events per second with sub-second latency
  • Event-time processing and watermarking: Flink can handle out-of-order events and assign timestamps based on their actual occurrence time rather than their ingestion time
  • Sophisticated windowing and triggering: Flink can define windows of different sizes and types (e.g., tumbling, sliding, session) over streams of data and trigger computations based on various criteria (e.g., time, count, early/late arrival)
  • Exactly-once state consistency: Flink guarantees that the state of a streaming application is consistent with the input data even in case of failures or restarts
  • Incremental checkpointing and savepoints: Flink periodically takes snapshots of the state of a streaming application and stores them in a durable storage system. These snapshots can be used to restore the application in case of failures or to update the application without losing any data
  • Iterative algorithms: Flink supports iterative algorithms natively by allowing feedback loops in the dataflow programs
  • Layered APIs: Flink provides different levels of abstraction for users to express their logic from SQL queries to low-level functions
  • Scalable architecture: Flink can scale out to thousands of nodes and handle very large state sizes
  • Stateful computations over data streams: Stateful computations are those that maintain some form of internal state across events, such as counters, aggregations, windows, or machine learning models. Flink provides various mechanisms to manage state in a scalable and fault-tolerant way

Flink’s architecture and features enable it to handle various use cases such as event-driven applications, stream and batch analytics, data pipelines and ETL (extract-transform-load), machine learning and AI (artificial intelligence)

  • Uber uses Flink to power its real-time platform for stream processing billions of events per day from various sources such as mobile apps, GPS devices, etc. Uber uses Flink for anomaly detection, fraud prevention, dynamic pricing, driver incentives
  • Netflix uses Flink to process billions of events per day from its streaming service to generate recommendations, monitor quality of service, detect anomalies
  • Alibaba uses Flink to process trillions of events per day from its e-commerce platform to provide real-time insights, personalized recommendations, targeted advertising

Final words

Flink is a powerful and versatile framework that can handle a wide range of data-intensive applications with high performance, scalability, fault tolerance, and consistency.

--

--

Roman Glushach

Senior Software Architect & Engineer Manager at Freelance