investigate why OoxmlStrictHelper.getStylesTable needs the temp part

Excel Streaming Reader

This is a fork of monitorjbl/excel-streaming-reader.

This implementation supports Apache POI 5.x and only supports Java 8 and above. v2.3.x supports POI 4.x.

Sample

This implementation has some extra features

OOXML Strict format support (see below)
More methods are implemented. Some require that features are enabled in the StreamingReader.Builder instance because they might have an additional overhead.
Check Builder implementation to see what options are available.

Used By

Include

To use it, add this to your POM:

<dependencies>
  <dependency>
    <groupId>com.github.pjfanning</groupId>
    <artifactId>excel-streaming-reader</artifactId>
    <version>5.0.1</version>
  </dependency>
</dependencies>

You may also need to add a dependency on poi-shared-strings.

Usage

The package name is different from the monitorjbl/excel-streaming-reader jar. The code is very similar.

import com.github.pjfanning.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size (in bytes) to use when reading InputStream to file (defaults to 1024)
        .open(is);            // InputStream or File for XLSX file (required)

Once you've done this, you can then iterate through the rows and cells like so:

for (Sheet sheet : workbook){
  System.out.println(sheet.getSheetName());
  for (Row r : sheet) {
    for (Cell c : r) {
      System.out.println(c.getStringCellValue());
    }
  }
}

Or open a sheet by name or index:

Sheet sheet = workbook.getSheet("My Sheet")

The StreamingWorkbook is an autocloseable resource, and it's important that you close it to free the filesystem resource it consumed. With Java 8, you can do this:

try (
        InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
        Workbook workbook = StreamingReader.builder()
          .rowCacheSize(100)
          .bufferSize(4096)
          .open(is)
){
  for (Sheet sheet : workbook){
    System.out.println(sheet.getSheetName());
    for (Row r : sheet) {
      for (Cell c : r) {
        System.out.println(c.getStringCellValue());
      }
    }
  }
}

You may access cells randomly within a row, as the entire row is cached. However, there is no way to randomly access rows. As this is a streaming implementation, only a small number of rows are kept in memory at any given time.

Temp File Shared Strings

By default, the /xl/sharedStrings.xml data for your xlsx is stored in memory and this might cause memory problems.

You can use the setUseSstTempFile(true) option to have this data stored in a temp file (a H2 MVStore). There is also a setEncryptSstTempFile(true) option if you are concerned about having the raw data in a cleartext temp file.

  Workbook workbook = StreamingReader.builder()
          .setUseSstTempFile(true)
          .setEncryptSstTempFile(false)
          .setFullFormatRichText(true) //if you want the rich text formatting as well as the text
          .open(is);

You will need to add a dependency on poi-shared-strings.

Temp File Comments

As with shared strings, comments are stored in a separate part of the xlsx file and by default, excel-streaming-reader does not read them. You can configure excel-streaming-reader to read them and choose whether you want them stored in memory or in a temp file while reading the xlsx file.

  Workbook workbook = StreamingReader.builder()
          .setReadComments(true)
          .setUseCommentsTempFile(true)
          .setEncryptCommentsTempFile(false)
          .setFullFormatRichText(true) //if you want the rich text formatting as well as the text
          .open(is);

You will need to add a dependency on poi-shared-strings.

Reading Very Large Excel Files

excel-streaming-reader uses some Apache POI code under the hood. That code uses memory and/or temp files to store temporary data while it processes the xlsx. With very large files, you will probably want to favour using temp files.

With StreamingReader.builder(), do not set setAvoidTempFiles(true). You should also consider, tuning POI settings too. In particular, consider setting these properties:

  org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(16384); //16KB
  org.apache.poi.openxml4j.opc.ZipPackage.setUseTempFilePackageParts(true);

Modifying Existing Large Xlsx Files

Streaming copy with SXSSF

Supported Methods

Not all POI Cell and Row functions are supported. The most basic ones are (Cell.getStringCellValue(), Cell.getColumnIndex(), etc.), but don't be surprised if you get a NotSupportedException on the more advanced ones.

I'll try to add more support as time goes on, but some items simply can't be read in a streaming fashion. Methods that require dependent values will not have said dependencies available at the point in the stream in which they are read.

This is a brief and very generalized list of things that are not supported for reads:

Recalculating Formulas - you will get values that Excel cached in the xlsx when the file was saved
Macros

OOXML Strict format

This library focuses on spreadsheets in OOXML Transitional format - despite the name, this format is more widely used. The wikipedia entry on OOXML formats has a good description.

From version 3.0.2, the standard streaming code will also try to read OOXML Strict format.
- support is still evolving, it is recommended you use the latest available excel-streaming-reader version if you are interested in supporting OOXML Strict format
Version 3.2.0 drops StreamingReader.Builder convertFromOoXmlStrict option (previously deprecated) as this is supported by default now.

Logging

This library uses SLF4j logging. This is a rare use case, but you can plug in your logging provider and get some potentially useful output. POI 5.1.0 switched to Log4j 2.x for logging. If you need logs from both libraries, you will need to use one of the bridge jars to map slf4j to log4j or vice versa.

Implementation Details

This library will take a provided InputStream and output it to the file system. The stream is piped safely through a configurable-sized buffer to prevent large usage of memory. Once the file is created, it is then streamed into memory from the file system.

The reason for needing the stream being outputted in this manner has to do with how ZIP files work. Because the XLSX file format is basically a ZIP file, it's not possible to find all of the entries without reading the entire InputStream.

This is a problem that can't really be gotten around for POI, as it needs a complete list of ZIP entries. The default implementation of reading from an InputStream in POI is to read the entire stream directly into memory. This library works by reading out the stream into a temporary file. As part of the auto-close action, the temporary file is deleted.

If you need more control over how the file is created/disposed of, there is an option to initialize the library with a java.io.File. This file will not be written to or removed:

File f = new File("/path/to/workbook.xlsx");
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    
        .bufferSize(4096)     
        .open(f);

This library will ONLY work with XLSX files. The older XLS format is not capable of being streamed.

Contributing

	/**
	* Returns a new streaming iterator to loop through rows. This iterator is not
	* guaranteed to have all rows in memory, and any particular iteration may
	* trigger a load from disk to read in new data.
	*
	* @return the streaming iterator
	*/
	@Override
	public Iterator<Row> iterator() {
	return new StreamingRowIterator();
	}

	class StreamingRowIterator implements Iterator<Row> {
	public StreamingRowIterator() {
	if(rowCacheIterator == null) {
	if(!hasNext()) {
	LOG.debug("there appear to be no rows");
	}
	}
	}

	@Override
	public boolean hasNext() {
	return (rowCacheIterator != null && rowCacheIterator.hasNext()) \|\| getRow();
	}

	@Override
	public Row next() {
	try {
	return rowCacheIterator.next();
	} catch(NoSuchElementException nsee) {
	//see https://github.com/monitorjbl/excel-streaming-reader/issues/176
	if (hasNext()) {
	return rowCacheIterator.next();
	}
	throw nsee;
	}
	}

	@Override
	public void remove() {
	throw new NotSupportedException();
	}
	}

	//Iterate over the loaded streams
	int i = 0;
	for(Map.Entry<PackagePart, InputStream> entry : sheetStreams.entrySet()) {
	XMLEventReader parser = getXmlInputFactory().createXMLEventReader(entry.getValue());
	sheets.add(new StreamingSheet(
	sheetProperties.get(i++).get("name"),
	new StreamingSheetReader(this, entry.getKey(), sst, stylesTable,
	sheetComments.get(entry.getKey()), parser, use1904Dates, builder.getRowCacheSize())));
	}

	@Override
	public Iterator<Row> iterator() {
	return reader.iterator();
	}

pjfanning / excel-streaming-reader Goto Github PK

excel-streaming-reader's Introduction

Excel Streaming Reader

Used By

Include

Usage

Temp File Shared Strings

Temp File Comments

Reading Very Large Excel Files

Modifying Existing Large Xlsx Files

Supported Methods

OOXML Strict format

Logging

Implementation Details

Contributing

excel-streaming-reader's People

Contributors

Stargazers

Watchers

Forkers

excel-streaming-reader's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs