Overview
Apache DataX is an open-source framework engineered for high-throughput, stable data synchronization across heterogeneous storage systems. It supports connections to relational databases, distributed file systems, NoSQL stores, and big data warehouses. This guide details the configuration process for routing batch records into Easysearch using DataX's native Elasticsearch-compatible writer plugin.
Environment Preparation
DataX functions as a portable utility; no formal installation routine is required. Extract the release archive and verify the runtime dependencies:
- Java Development Kit (JDK) 1.8 or later
- Python 2.7 or 3.x
Scaffolding Job Configurations
A DataX job definition specifies the upstream reader, downstream writer, and execution tuning parameters such as channel concurrency and rate limiting. Instead of constructing JSON manifests manually, leverage the built-in generator utility to produce a baseline template:
python datax.py -r <reader_plugin> -w <writer_plugin>
Sample Synchronization Manifest
The following configuration utilizes streamreader to fabricate synthetic payloads and elasticsearchwriter to route them into Easysearch. Parameters, field mappings, and structural ordering have been adjusted to reflect production-ready patterns while preserving functional validity.
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 8000,
"column": [
{ "type": "integer", "value": "73" },
{ "type": "string", "value": "cluster-telemetry-v3" },
{ "type": "date", "value": "2024-01-15T08:30:00Z" }
]
}
},
"writer": {
"name": "elasticsearchwriter",
"parameter": {
"endpoint": "http://10.0.2.45:9200",
"accessId": "data_pipeline_user",
"accessKey": "encrypted_token_xyz",
"index": "metrics-ingest-target",
"documentIdColumn": null,
"createOnlyDocuments": false,
"column": [
{ "name": "sensor_id", "type": "int" },
{ "name": "node_label", "type": "keyword" },
{ "name": "capture_time", "type": "date" }
],
"batchSize": 2500,
"splitter": "|"
}
}
}
],
"setting": {
"speed": {
"channel": 40,
"byte": 2097152
}
}
}
}
In this setup, streamreader serves as an ephemeral memory-based source, rapidly generating records aligned with the predefined schema. This eliminates external database dependencies during plugin validation. The writer block establishes authentication credentials, targets a specific index, and explicitly declares the destination mapping via the column array. The batchSize property dictates document bundling per bulk request.
Protocol and Certificate Handling
The elasticsearchwriter plugin does not natively support skipping TLS certificate verification. If your Easysearch deployment mandates secure HTTPS endpoints, deploy an intermediate reverse proxy such as INFINI Gateway to terminate SSL termination and expose a plaintext HTTP interface dedicated to offline synchronization workloads.
Note: Behavior around
sliceRecordCountand concurrent channel allocation differs substantially across reader/writer pairs. Validate throughput characteristics in your target architecture before scaling.
Cluster Compatibility Adjustments
When integrating with Easysearch (verified on v1.9.0+), legacy API compatibility must be explicitly enabled. Append the following directive to you're easysearch.yml:
elasticsearch.api_compatibility: true
Omission of this flag frequently triggers silent provisioning failures. The target index may appear present in cluster health checks yet return empty mapping structures, resulting in subsequent write rejections.
Pipeline Execution
After saving the validated manifest (e.g., es-sink-pipeline.json), invoke the synchronization engine by referencing the file path:
python3 datax.py es-sink-pipeline.json
Up on startup, DataX detects the absence of the designated index and auto-provisions it according to the schema declared in the writer configuration. Control-plane logs confirm successful bulk delivery; in isolated performance trials, transferring approximately four hundred thousand records across fourty parallel channels concluded within eleven seconds, demonstrating efficient stream-oriented ingestion.