Resolving Field Invisibility in Easysearch: The Conflict Between source_reuse and ignore_above

The Problem: Searchable but Invisible Data

When utilizing advanced data compression features in Easysearch, such as source_reuse combined with ZSTD, developers may encounter a perplexing issue: a field is successfully indexed and can be found via search queries, yet the actual content is missing or "invisible" in the search results.

Consider a typical index configuration using dynamic templates to handle log messages:

{
  "settings": {
    "index": {
      "codec": "ZSTD",
      "source_reuse": "true"
    }
  },
  "mappings": {
    "dynamic_templates": [
      {
        "logs": {
          "path_match": "log_msg",
          "mapping": {
            "norms": false,
            "type": "text"
          }
        }
      },
      {
        "standard_strings": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "type": "keyword"
              }
            }
          }
        }
      }
    ]
  }
}

In this scenario, certain documents return an empty source for specific fields, even though they matched the search criteria.

Workflow Analysis

To understand why this happens, we must examine the lifecycle of a field during indexing and retrieval:

  1. Indexing: The system processes both the text field and its keyword sub-field.
  2. Keyword Constraints: The keyword sub-field ignores any content exceeding the ignore_above threshold.
  3. Source Optimization: With source_reuse enabled, Easysearch removes data from the _source field if it determines that the data is already stored in doc_values or the inverted index.
  4. Compression: ZSTD compresses the physical files. This affects storage size but does not alter the logical data structure.
  5. Query Execution: A search hits the inverted index and identifies the document ID.
  6. Fetching: The system attempts to retrieve the field content from _source or doc_values. If both are empty, the field appears missing.

Deconstructing Source Reuse

The source_reuse feature is designed to reduce index size by eliminating redundancy. It is particularly effective for log data. It supports types like keyword, integer, boolean, and ip. For text fields to benefit, they must have a keyword multi-field with doc_values enabled.

Essentially, source_reuse assumes that if a field's value is stored in doc_values, it doesn't need to be kept in the _source JSON. However, if the _source is stripped and the doc_values are also missing, the data effective vanishes from the retrieval path.

The Role of ignore_above

The ignore_above parameter dictates that strings longer than the specified limit will not be indexed. Crucially, in many implementations, this also means the value is not stored in doc_values.

We can demonstrate this behavior by disabling _source and testing different string lengths against a low ignore_above limit.

# Create index with source disabled and a small ignore_above limit
PUT /trace_test
{
  "mappings": {
    "_source": { "enabled": false },
    "properties": {
      "content": {
        "type": "keyword",
        "ignore_above": 5
      }
    }
  }
}

# Index two documents: one short, one long
POST /trace_test/_doc/101
{ "content": "short" }

POST /trace_test/_doc/102
{ "content": "this_is_too_long" }

# Attempt to retrieve fields via docvalue_fields
GET /trace_test/_search
{
  "docvalue_fields": ["content"]
}

In the results, document 101 will show the "short" value in the fields object. Document 102, however, will show no fields contant. Because the string exceeded the 5-character limit, Easysearch skipped the doc_values generation for that specific field instance.

The "Invisibility" Trap

When source_reuse is active, it coordinates with the keyword sub-field. If a field has a keyword sub-field, source_reuse may drop the original string from the _source block, relying on the keyword's doc_values for future display.

The trap is sprung when:

  • The string length is greater than ignore_above.
  • source_reuse is enabled.

In this case:

  1. The keyword mapping rejects the string because it's too long, so no doc_values are created.
  2. The source_reuse logic removes the string from _source because it "expects" the keyword sub-field to have it.
  3. Result: The data is gone from both potential display sources.

Technical Recommendations

  • Adjust Thresholds: When using source_reuse, ensure ignore_above is set to a value large enough to accommodate your expected data, or stick to the default (typically 256).
  • Selective Optimization: Only enable source_reuse on indices where you are certain the sub-fields will capture the necessary data for display.
  • Doc_values Requirement: Remember that source_reuse depends on doc_values. If doc_values are disabled for a field, the optimization will not (and should not) occur, or it may lead to data retrieval issues.

Tags: Easysearch elasticsearch DataCompression indexing StorageOptimization

Posted on Sat, 04 Jul 2026 16:52:51 +0000 by dirkie