· Talweg Team · Product Update  · 4 min read

Bringing Apache Flink ML to YAML: Zero-Boilerplate Streaming Machine Learning

Introducing native Apache Flink ML support in Flinkflow. Embed Estimators and Transformers directly in your YAML pipelines using dynamic reflection and StreamTableEnvironment.

Introducing native Apache Flink ML support in Flinkflow. Embed Estimators and Transformers directly in your YAML pipelines using dynamic reflection and StreamTableEnvironment.

In modern data architectures, the bridge between machine learning (ML) and real-time stream processing has always been notoriously difficult to build.

Data scientists develop powerful ML models using high-level abstractions, but operationalizing them inside stateful streaming applications historically required translating those designs into complex JVM code. Even when using Apache Flink ML—Flink’s native library for machine learning algorithms and pipelines—developers still had to write significant Java or Scala boilerplate to wire Estimators and Transformers into their streaming jobs.

Today, we are thrilled to announce a game-changing feature for Flinkflow: Native Apache Flink ML integration.

Now, you can embed native Flink ML stages directly inside your declarative Flinkflow pipeline YAML. No Java compilation, no Maven dependency hell—just clean, GitOps-ready configuration.


⚡ The Old Way vs. The Flinkflow Way

To understand the leap forward, let’s look at how you would traditionally configure a simple Flink ML stage (like a MinMaxScaler) in Java:

// Traditional Java Boilerplate
StreamTableEnvironment tEnv = StreamTableEnvironment.create(env);
Table inputTable = tEnv.fromDataStream(inputStream);

MinMaxScaler minMaxScaler = new MinMaxScaler()
    .setInputCol("features")
    .setOutputCol("scaledFeatures")
    .setMin(0.0)
    .setMax(1.0);

MinMaxScalerModel model = minMaxScaler.fit(inputTable);
Table outputTable = model.transform(inputTable);

With Flinkflow, this entire sequence is declared directly in your pipeline YAML:

  - type: ml
    name: scaler
    class: org.apache.flink.ml.feature.minmaxscaler.MinMaxScaler
    properties:
      inputCol: "features"
      outputCol: "scaledFeatures"
      min: 0.0
      max: 1.0

🛠️ How It Works Under the Hood

Flinkflow makes this seamless integration possible through two key architectural components:

1. Dynamic Property Mapping via Reflection

Rather than hardcoding integrations for each of Flink ML’s dozens of Estimators and Transformers, Flinkflow leverages Java reflection.

When the pipeline starts, Flinkflow:

  1. Instantiates the specified Flink ML class dynamically.
  2. Inspects its setters and configurations.
  3. Automatically maps the YAML properties keys to the corresponding Java setters (converting types such as lists, numbers, and strings on the fly).

This means that any current or future Flink ML stage—whether it’s a pre-built feature extractor like OneHotEncoder or a complex estimator like KMeans or LogisticRegression—is supported out of the box with zero code changes.

2. The StreamTableEnvironment Bridge

Flink ML algorithms operate natively on Flink’s Table API. However, Flinkflow pipelines are designed to handle flexible stream processing.

To bridge this gap, Flinkflow implements a dynamic translation layer:

  • The streaming data is converted into a temporary table using Flink’s StreamTableEnvironment.
  • The Flink ML stages (Estimators are fitted, and Transformers are executed) are run against this table.
  • The resulting transformed Table is seamlessly bridged back into Flink’s DataStream API for downstream processing.

🏗️ A Complete Example: Real-time Clustering Pipeline

Here is a complete, deployable pipeline demonstrating how to chain feature engineering and an ML algorithm (KMeans clustering) directly inside a single YAML file:

name: "Real-time Customer Segmentation Pipeline"
parallelism: 1

steps:
  - type: source
    name: web-traffic-source
    properties:
      content: |
        {"pages_visited": 2.0, "time_spent_mins": 5.0}
        | {"pages_visited": 12.0, "time_spent_mins": 45.0}
        | {"pages_visited": 1.0, "time_spent_mins": 2.0}
        | {"pages_visited": 15.0, "time_spent_mins": 55.0}

  # Step 1: Parse string JSON into separate fields
  - type: process
    name: json-parser
    language: python
    code: |
      import json
      obj = json.loads(input)
      return obj

  # Step 2: Combine fields into a single Feature Vector
  - type: ml
    name: vector-assembler
    class: org.apache.flink.ml.feature.vectorassembler.VectorAssembler
    properties:
      inputCols: ["pages_visited", "time_spent_mins"]
      outputCol: "features"

  # Step 3: Run the KMeans Estimator to predict customer segment clusters
  - type: ml
    name: kmeans-clusterer
    class: org.apache.flink.ml.clustering.kmeans.KMeans
    properties:
      k: 2
      featuresCol: "features"
      predictionCol: "segment_id"

  # Step 4: Output the results to the console
  - type: sink
    name: console-sink
    properties:
      type: console

🚀 Key Advantages

  • No Java Required: Data scientists can configure, tune, and test Flink ML pipelines using a language-agnostic YAML interface.
  • Dynamic Adaptability: Property mapping happens at runtime. If Flink ML releases a new stage or updates properties, your pipelines can adopt them immediately by updating the YAML.
  • GitOps for Machine Learning: Version control your ML stages and configurations directly in Git. Promoting a pipeline from staging to production is as simple as running a kubectl apply or approving a PR.
  • Perfect for Hybrid Pipelines: Easily mix Apache Camel integration sources, Python processors, and native Flink ML stages within a single unified topology.

🏁 Get Started Today

Want to start running machine learning directly on your real-time streams?

  1. Check out the codebase: Review the latest release details in the Flinkflow Repository.
  2. Review ML Examples: Visit our examples/ directory to see templates for classification, regression, and feature engineering.
  3. Read the Docs: Find full configuration guides for Flink ML stages on talweg.ai.

Happy stream training! 🌊🤖

Back to Blog

Related Posts

View All Posts »