Handling Bulk Excel Data Ingestion with Java
When building enterprise applications, developers often encounter the need to process large datasets stored in Excel files. Directly loading such data into memory using conventional methods can lead to performance degradation or out-of-memory errors. This article explores an optimized approach for importing substantial Excel content using Java, focusing on efficient resource management and scalability.
Leveraging Apache POI for Streamed Excel Processing
The Apache POI library is widely used for reading and writing Microsoft Office documents in Java. While the standard Workbook interface works well for small files, it loads the entire document into memory. For hendling large spreadsheets—especially those exceeding tens of thousands of rows—it's bettter to use event-based parsing through org.apache.poi.xssf.eventusermodel, which processes data incrementally without full in-memory representation.
Maven Dependency Configuration
To enable Excel file processing, include the following dependency in your pom.xml:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>5.2.4</version>
</dependency>
Implementing a Memory-Efficient Importer
Below is a refined implementation that reads an XLSX file row by row using XSSFReader and SAX-style event handling:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;
public class StreamingExcelImporter extends DefaultHandler {
private SharedStringsTable sharedStrings;
private List<String> currentRow;
private StringBuilder characterData;
private boolean isCellActive;
public void process(String filePath) throws Exception {
try (FileInputStream input = new FileInputStream(filePath);
OPCPackage pkg = OPCPackage.open(input)) {
XSSFReader reader = new XSSFReader(pkg);
sharedStrings = reader.getSharedStringsTable();
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(this);
InputStream sheetStream = reader.getSheetsData().next();
InputSource sheetSource = new InputSource(sheetStream);
parser.parse(sheetSource);
sheetStream.close();
}
}
@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if ("row".equals(localName)) {
currentRow = new ArrayList<>();
} else if ("c".equals(localName)) {
isCellActive = true;
characterData = new StringBuilder();
}
}
@Override
public void endElement(String uri, String localName, String qName) {
if ("c".equals(localName) && isCellActive) {
String cellValue = convertCellValue(characterData.toString().trim());
currentRow.add(cellValue);
isCellActive = false;
} else if ("row".equals(localName)) {
// Process complete row
handleRow(currentRow);
}
}
@Override
public void characters(char[] ch, int start, int length) {
if (isCellActive) {
characterData.append(ch, start, length);
}
}
private String convertCellValue(String rawValue) {
try {
if (sharedStrings == null) return rawValue;
int index = Integer.parseInt(rawValue);
return new XSSFRichTextString(sharedStrings.getEntryAt(index)).toString();
} catch (NumberFormatException e) {
return rawValue;
}
}
private void handleRow(List<String> row) {
// Example: print tab-separated values
System.out.println(String.join("\t", row));
// Replace with business logic: save to DB, validate, etc.
}
public static void main(String[] args) {
StreamingExcelImporter importer = new StreamingExcelImporter();
try {
importer.process("data.xlsx");
} catch (Exception e) {
e.printStackTrace();
}
}
}
This implementation avoids loading the entire workbook into memory by parsing the underlying XML structure directly. It uses a callback-driven model where each row is processed as it’s read, making it suitable for files with hundreds of thousands of entries.
Performance Considerations
- Memory Usage: The streaming method keeps memory consumption constant regardless of file size.
- Data Handling: Instead of storing all rows, consider forwarding them immediately to a database via batch inserts or reactive streams.
- Error Resilience: Wrap critical sections in try-catch blocks and implement checkpointing for long-runing imports.