Ingesting Batch Data into Easysearch Using Apache DataX

Overview

Apache DataX is an open-source framework engineered for high-throughput, stable data synchronization across heterogeneous storage systems. It supports connections to relational databases, distributed file systems, NoSQL stores, and big data warehouses. This guide details the configuration process for routing batch records into Easysearch using DataX's native Elasticsearch-compatible writer plugin.

Environment Preparation

DataX functions as a portable utility; no formal installation routine is required. Extract the release archive and verify the runtime dependencies:

  • Java Development Kit (JDK) 1.8 or later
  • Python 2.7 or 3.x

Scaffolding Job Configurations

A DataX job definition specifies the upstream reader, downstream writer, and execution tuning parameters such as channel concurrency and rate limiting. Instead of constructing JSON manifests manually, leverage the built-in generator utility to produce a baseline template:

python datax.py -r <reader_plugin> -w <writer_plugin>

Sample Synchronization Manifest

The following configuration utilizes streamreader to fabricate synthetic payloads and elasticsearchwriter to route them into Easysearch. Parameters, field mappings, and structural ordering have been adjusted to reflect production-ready patterns while preserving functional validity.

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 8000,
            "column": [
              { "type": "integer", "value": "73" },
              { "type": "string", "value": "cluster-telemetry-v3" },
              { "type": "date", "value": "2024-01-15T08:30:00Z" }
            ]
          }
        },
        "writer": {
          "name": "elasticsearchwriter",
          "parameter": {
            "endpoint": "http://10.0.2.45:9200",
            "accessId": "data_pipeline_user",
            "accessKey": "encrypted_token_xyz",
            "index": "metrics-ingest-target",
            "documentIdColumn": null,
            "createOnlyDocuments": false,
            "column": [
              { "name": "sensor_id", "type": "int" },
              { "name": "node_label", "type": "keyword" },
              { "name": "capture_time", "type": "date" }
            ],
            "batchSize": 2500,
            "splitter": "|"
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 40,
        "byte": 2097152
      }
    }
  }
}

In this setup, streamreader serves as an ephemeral memory-based source, rapidly generating records aligned with the predefined schema. This eliminates external database dependencies during plugin validation. The writer block establishes authentication credentials, targets a specific index, and explicitly declares the destination mapping via the column array. The batchSize property dictates document bundling per bulk request.

Protocol and Certificate Handling

The elasticsearchwriter plugin does not natively support skipping TLS certificate verification. If your Easysearch deployment mandates secure HTTPS endpoints, deploy an intermediate reverse proxy such as INFINI Gateway to terminate SSL termination and expose a plaintext HTTP interface dedicated to offline synchronization workloads.

Note: Behavior around sliceRecordCount and concurrent channel allocation differs substantially across reader/writer pairs. Validate throughput characteristics in your target architecture before scaling.

Cluster Compatibility Adjustments

When integrating with Easysearch (verified on v1.9.0+), legacy API compatibility must be explicitly enabled. Append the following directive to you're easysearch.yml:

elasticsearch.api_compatibility: true

Omission of this flag frequently triggers silent provisioning failures. The target index may appear present in cluster health checks yet return empty mapping structures, resulting in subsequent write rejections.

Pipeline Execution

After saving the validated manifest (e.g., es-sink-pipeline.json), invoke the synchronization engine by referencing the file path:

python3 datax.py es-sink-pipeline.json

Up on startup, DataX detects the absence of the designated index and auto-provisions it according to the schema declared in the writer configuration. Control-plane logs confirm successful bulk delivery; in isolated performance trials, transferring approximately four hundred thousand records across fourty parallel channels concluded within eleven seconds, demonstrating efficient stream-oriented ingestion.

Tags: DataX Easysearch Bulk Data Ingestion Elasticsearch Compatible Writer Heterogeneous Data Sync

Posted on Mon, 18 May 2026 16:21:45 +0000 by slamMan