The Problem: Searchable but Invisible Data
When utilizing advanced data compression features in Easysearch, such as source_reuse combined with ZSTD, developers may encounter a perplexing issue: a field is successfully indexed and can be found via search queries, yet the actual content is missing or "invisible" in the search results.
Consider a typical index configuration using dynamic templates to handle log messages:
{
"settings": {
"index": {
"codec": "ZSTD",
"source_reuse": "true"
}
},
"mappings": {
"dynamic_templates": [
{
"logs": {
"path_match": "log_msg",
"mapping": {
"norms": false,
"type": "text"
}
}
},
{
"standard_strings": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw": {
"ignore_above": 256,
"type": "keyword"
}
}
}
}
}
]
}
}
In this scenario, certain documents return an empty source for specific fields, even though they matched the search criteria.
Workflow Analysis
To understand why this happens, we must examine the lifecycle of a field during indexing and retrieval:
- Indexing: The system processes both the
textfield and itskeywordsub-field. - Keyword Constraints: The
keywordsub-field ignores any content exceeding theignore_abovethreshold. - Source Optimization: With
source_reuseenabled, Easysearch removes data from the_sourcefield if it determines that the data is already stored indoc_valuesor the inverted index. - Compression: ZSTD compresses the physical files. This affects storage size but does not alter the logical data structure.
- Query Execution: A search hits the inverted index and identifies the document ID.
- Fetching: The system attempts to retrieve the field content from
_sourceordoc_values. If both are empty, the field appears missing.
Deconstructing Source Reuse
The source_reuse feature is designed to reduce index size by eliminating redundancy. It is particularly effective for log data. It supports types like keyword, integer, boolean, and ip. For text fields to benefit, they must have a keyword multi-field with doc_values enabled.
Essentially, source_reuse assumes that if a field's value is stored in doc_values, it doesn't need to be kept in the _source JSON. However, if the _source is stripped and the doc_values are also missing, the data effective vanishes from the retrieval path.
The Role of ignore_above
The ignore_above parameter dictates that strings longer than the specified limit will not be indexed. Crucially, in many implementations, this also means the value is not stored in doc_values.
We can demonstrate this behavior by disabling _source and testing different string lengths against a low ignore_above limit.
# Create index with source disabled and a small ignore_above limit
PUT /trace_test
{
"mappings": {
"_source": { "enabled": false },
"properties": {
"content": {
"type": "keyword",
"ignore_above": 5
}
}
}
}
# Index two documents: one short, one long
POST /trace_test/_doc/101
{ "content": "short" }
POST /trace_test/_doc/102
{ "content": "this_is_too_long" }
# Attempt to retrieve fields via docvalue_fields
GET /trace_test/_search
{
"docvalue_fields": ["content"]
}
In the results, document 101 will show the "short" value in the fields object. Document 102, however, will show no fields contant. Because the string exceeded the 5-character limit, Easysearch skipped the doc_values generation for that specific field instance.
The "Invisibility" Trap
When source_reuse is active, it coordinates with the keyword sub-field. If a field has a keyword sub-field, source_reuse may drop the original string from the _source block, relying on the keyword's doc_values for future display.
The trap is sprung when:
- The string length is greater than
ignore_above. source_reuseis enabled.
In this case:
- The
keywordmapping rejects the string because it's too long, so nodoc_valuesare created. - The
source_reuselogic removes the string from_sourcebecause it "expects" thekeywordsub-field to have it. - Result: The data is gone from both potential display sources.
Technical Recommendations
- Adjust Thresholds: When using
source_reuse, ensureignore_aboveis set to a value large enough to accommodate your expected data, or stick to the default (typically 256). - Selective Optimization: Only enable
source_reuseon indices where you are certain the sub-fields will capture the necessary data for display. - Doc_values Requirement: Remember that
source_reusedepends ondoc_values. Ifdoc_valuesare disabled for a field, the optimization will not (and should not) occur, or it may lead to data retrieval issues.