Apache Flink 1.14.0 Highlights
Core Features
- Checkpointing for Bounded Streams.
- Mixed DataStream and Table/SQL Applications in Batch Execution Mode.
- Introduction of the Hybrid Source for seamless reading across multiple sources.
- Buffer Debloating to minimize checkpoint latency.
- Fine-Grained Resource Management for dynamic Slot sizing.
- New Pulsar Connector for the DataStream API.
Unified Stream & Batch Processing
Bounded Stream Checkpointing
Enable the creation of checkpoints after some tasks finish and a final checkpoint at stream end to ensure output commits. Activate by setting execution.checkpointing.checkpoints-after-tasks-finish.enabled: true.
Hybrid Source Sequentially reads from multiple sources (e.g., S3 for historical data, then Kafka for new data) as a single logical stream.
Operational Enhancements
Buffer Debloating Automatically adjusts network buffer usage to ensure high throughput while minimizing buffered data for low checkpoint overhead.
Fine-Grained Resource Management Enables dynamic resizing of Slots within a TaskManager.
PyFlink Improvements
- Performance gains via chaining of Python functions, similar to Java API operator chaining.
- Loopback debug mode allows execution of Python UDFs in the client process, enabling IDE breakpoint debugging.
- Support for Yarn Application mode and
.tgzPython archives.
Deprecations
- Legacy SQL engine and Mesos support were removed.
Apache Flink 1.15 Highlights
Operations and Management
Incremental Savepoints Savepoints become incremental when using native format with RocksDB state backend.
Reactive Mode & Adaptive Scheduler Improved Reactive Mode with working job-level metrics. The Adaptive Scheduler now records failure history and speeds up scale-down operations.
Adaptive Batch Scheduler Automatically determines optimal parallelism for batch job vertices based on their input data volume. Key benefits include ease of use, adaptability to changing data, and per-vertex granularity.
Cross-Source Watermark Alignment Sources using the new Source API can be grouped. If one source's watermark advances too far ahead, its consumption can be paused to align with others.
SQL Version Compatibility Ensures identical SQL queries produce the same topology after a Flink version upgrade, maintaining snapshot compatibility.
Changelog State Backend Continuously uploads state changes to enable shorter end-to-end latency, prdeictable checkpoint intervals, and less recovery work.
Repeatable Cleanup Retries cleanup operations on failure to avoid residual job data.
OpenAPI Support REST API now follows the OpenAPI standard.
Application Mode Improvements
Enhanced support for stop-with-savepoint on job termination and improved recovery/cleanup with local state metadata persistence.
Unified Processing
Final Checkpoint Before Job End Enabled by default, ensuring a checkpoint completes before job termination.
Window Table-Valued Functions (TVF) in Batch Window TVFs are now supported in batch execution mode.
SQL Enhancements
- Failed
CASToperations now throw an error by default instead of returning null. - New JSON processing functions.
Community Contributions
- New Elasticsearch Sink based on the latest Sink API, providing async writes and end-to-end consistency.
- Flink's Java API can be used with any Scala version (including Scala 3).
- PyFlink introduces a "thread" execution mode where Python UDFs run as threads within the JVM.
- Support for CSV format and small file compaction.
Apache Flink 1.16 Highlights
Batch Processing
SQL Gateway Extends SQL Client with multi-tenancy and pluggable Endpoints (REST API, HiveServer2). Enables submitting stream/batch/OLAP jobs via standard tools.
Hive Syntax Compatibility Enhanced Hive dialect support. Using HiveServer2 protocol with SQL Gateway auto-registers Hive Catalog, switches dialect, and uses batch mode.
Join Hints Allow users to specify join strategies to override the optimizer's choice.
Adaptive Hash Join Hash Join operators can automatically fallback to Sort-Merge Join at the task level if they fail, improving job stability.
Speculative Execution for Batch Detects slow tasks on problematic machines, blacklists those machines, and launches duplicate task instances elsewhere. The first instance to finish wins.
Hybrid Shuffle (Experimental) Combines advantages of blocking and pipeline shuffle. Allows downstream tasks to start before upstream finishes without requiring simultaneous execution.
Blocking Shuffle Optimizations Improvements include adaptive network buffer allocation, sequential I/O, and result partition reuse for multiple consumers.
Dynamic Partition Pruning Reduces reading of invalid partitions from partitioned tables at runtime based on data from related tables.
Stream Processing
Generalized Incremental Checkpoint Enhances Changelog State Backend with support for state migration, local recovery, file caching, switching from checkpoints, and improved monitoring.
Improved RocksDB Rescaling Uses RocksDB's range deletion for faster rescaling, improving recovery speed 2-10x for high-state operations.
State Backend Monitoring RocksDB logs now reside in Flink's log directory by default. Added DB-level statistics (e.g., block cache hit/miss counts).
Overdraft Buffers Subtask can request extra network buffers (default 5) when backpressured, reducing checkpoint alignment time, especially with Unaligned Checkpoint.
Aligned Checkpoint Timeout
Checkpoints start as Aligned (AC) but switch to Unaligned (UC) if the global checkpoint duration exceeds execution.checkpointing.aligned-checkpoint-timeout. Upstream subtasks can now switch to UC independently to propagate barriers.
Streaming Nondeterminism Detection Pre-runtime detection of potential correctness issues in complex streaming jobs, with suggestions for SQL adjustments.
Dimension Table Lookup Enhancements
Generic caching, configurable async mode (ALLOW_UNORDERED), and retry mechanisms for delayed dimension data updates.
Async I/O Retry Built-in retry mechanism for Async I/O operations.
PyFlink
"Thread" mode execution is now supported for Python DataStream API and Python Table-Valued Functions, running UDFs via JNI in the JVM.
Other Features
New SQL Syntax
USING JARfor dynamic UDF JAR loading.CREATE TABLE AS SELECT.ANALYZE TABLEfor manual statistics generation.
DataStream Cache
DataStream#cache() caches transformation results in batch mode, useful for ML and interactive Python programming.
History Server & Job Info Enhanced WebUI with detailed execution time metrics, aggregated subtask statistics, environment info, and external log archive browsing.
Protobuf Format Support for Protobuf in Table API/SQL applications.
Configurable RateLimitingStrategy for Async Sink
Customizable behavior for async sinks on request failure, defaulting to AIMDScalingStrategy.
Apache Flink 1.17 Highlights
Batch Processing
Speculative Execution Now supports Sink operators. Improved detection of slow tasks.
Adaptive Batch Scheduler Now the default batch scheduler. Configuration simplified; auto-parallelism no longer requires setting global parallelism to -1. Enhanced data distribution and removed power-of-two restriction for derived parallelism.
Hybrid Shuffle Optimizations Supports adaptive batch scheduler and speculative execution. Allows reuse of intermediate data. Improved stability for large-scale deployments.
SQL Client/Gateway SQL Client supports gateway mode to submit queries to SQL Gateway. SQL statements can manage job lifecycle (e.g., show info, stop jobs).
SQL API for Batch
New Delete and Update APIs for batch mode, enabling connector implementations for row-level modifications. Extended ALTER TABLE syntax for adding/modifying/dropping columns, primary keys, and watermarks.
Hive Connector Optimizations
Automatic file compaction now works in both stream and batch modes for Hive writes, reducing small files. Native Hive aggregate functions (SUM, COUNT, etc.) in HiveModule use hash-based aggregation for performance gains.
TPC-DS & Performance
- Dynamic programming join-reorder algorithm (experimental, off by default).
- Dynamic local hash aggregation based on data distribution.
- Removal of unnecessary virtual function calls for faster execution.
Stream Processing
Streaming SQL Semantic Improvements
Experimental PLAN_ADVICE feature detects potential correctness risks (e.g., Non-Deterministic Updates) and performance optimization opportunities, providing specific suggestions.
Enhanced Watermark Alignment Addresses data skew in event-time jobs by aligning data emission across partitions within a Source operator considering watermark boundaries, preventing excessive downstream buffering.
Extended StreamingFileSink Functional enhancements for file-based streaming outputs.
Unaligned Checkpoint (UC) File Management Mitigates the issue of UC creating excessive small files on HDFS. Provides REST API to manually trigger checkpoints with custom types during job runtime.
Upgrades RocksDBStateBackend and Calcite upgraded.
Observability & Performance
- Performance regression monitoring via Slack channel reports.
- Support for Task-level flame graphs.
- Support for generic token mechanisms.