GithubHelp home page GithubHelp logo

pjfanning / excel-streaming-reader Goto Github PK

View Code? Open in Web Editor NEW

This project forked from monitorjbl/excel-streaming-reader

106.0 5.0 26.0 26.73 MB

An easy-to-use implementation of a streaming Excel reader using Apache POI

License: Apache License 2.0

Java 100.00%
excel poi apache-poi java excel-streaming-reader xlsx xlsx-parser

excel-streaming-reader's Introduction

OpenSSF Best Practices Build Status Maven Central Javadoc

Excel Streaming Reader

This is a fork of monitorjbl/excel-streaming-reader.

This implementation supports Apache POI 5.x and only supports Java 8 and above. v2.3.x supports POI 4.x.

This implementation has some extra features

  • OOXML Strict format support (see below)
  • More methods are implemented. Some require that features are enabled in the StreamingReader.Builder instance because they might have an additional overhead.
  • Check Builder implementation to see what options are available.

Used By

Include

To use it, add this to your POM:

<dependencies>
  <dependency>
    <groupId>com.github.pjfanning</groupId>
    <artifactId>excel-streaming-reader</artifactId>
    <version>4.3.0</version>
  </dependency>
</dependencies>  

Usage

The package name is different from the monitorjbl/excel-streaming-reader jar. The code is very similar.

import com.github.pjfanning.xlsx.StreamingReader;

InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    // number of rows to keep in memory (defaults to 10)
        .bufferSize(4096)     // buffer size (in bytes) to use when reading InputStream to file (defaults to 1024)
        .open(is);            // InputStream or File for XLSX file (required)

Once you've done this, you can then iterate through the rows and cells like so:

for (Sheet sheet : workbook){
  System.out.println(sheet.getSheetName());
  for (Row r : sheet) {
    for (Cell c : r) {
      System.out.println(c.getStringCellValue());
    }
  }
}

Or open a sheet by name or index:

Sheet sheet = workbook.getSheet("My Sheet")

The StreamingWorkbook is an autocloseable resource, and it's important that you close it to free the filesystem resource it consumed. With Java 8, you can do this:

try (
        InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
        Workbook workbook = StreamingReader.builder()
          .rowCacheSize(100)
          .bufferSize(4096)
          .open(is)
){
  for (Sheet sheet : workbook){
    System.out.println(sheet.getSheetName());
    for (Row r : sheet) {
      for (Cell c : r) {
        System.out.println(c.getStringCellValue());
      }
    }
  }
}

You may access cells randomly within a row, as the entire row is cached. However, there is no way to randomly access rows. As this is a streaming implementation, only a small number of rows are kept in memory at any given time.

Temp File Shared Strings

By default, the /xl/sharedStrings.xml data for your xlsx is stored in memory and this might cause memory problems.

You can use the setUseSstTempFile(true) option to have this data stored in a temp file (a H2 MVStore). There is also a setEncryptSstTempFile(true) option if you are concerned about having the raw data in a cleartext temp file.

  Workbook workbook = StreamingReader.builder()
          .setUseSstTempFile(true)
          .setEncryptSstTempFile(false)
          .setFullFormatRichText(true) //if you want the rich text formatting as well as the text
          .open(is);

Temp File Comments

As with shared strings, comments are stored in a separate part of the xlsx file and by default, excel-streaming-reader does not read them. You can configure excel-streaming-reader to read them and choose whether you want them stored in memory or in a temp file while reading the xlsx file.

  Workbook workbook = StreamingReader.builder()
          .setReadComments(true)
          .setUseCommentsTempFile(true)
          .setEncryptCommentsTempFile(false)
          .setFullFormatRichText(true) //if you want the rich text formatting as well as the text
          .open(is);

Reading Very Large Excel Files

excel-streaming-reader uses some Apache POI code under the hood. That code uses memory and/or temp files to store temporary data while it processes the xlsx. With very large files, you will probably want to favour using temp files.

With StreamingReader.builder(), do not set setAvoidTempFiles(true). You should also consider, tuning POI settings too. In particular, consider setting these properties:

  org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.setThresholdBytesForTempFiles(16384); //16KB
  org.apache.poi.openxml4j.opc.ZipPackage.setUseTempFilePackageParts(true);

Modifying Existing Large Xlsx Files

Supported Methods

Not all POI Cell and Row functions are supported. The most basic ones are (Cell.getStringCellValue(), Cell.getColumnIndex(), etc.), but don't be surprised if you get a NotSupportedException on the more advanced ones.

I'll try to add more support as time goes on, but some items simply can't be read in a streaming fashion. Methods that require dependent values will not have said dependencies available at the point in the stream in which they are read.

This is a brief and very generalized list of things that are not supported for reads:

  • Recalculating Formulas - you will get values that Excel cached in the xlsx when the file was saved
  • Macros

OOXML Strict format

This library focuses on spreadsheets in OOXML Transitional format - despite the name, this format is more widely used. The wikipedia entry on OOXML formats has a good description.

  • From version 3.0.2, the standard streaming code will also try to read OOXML Strict format.
    • support is still evolving, it is recommended you use the latest available excel-streaming-reader version if you are interested in supporting OOXML Strict format
  • Version 3.2.0 drops StreamingReader.Builder convertFromOoXmlStrict option (previously deprecated) as this is supported by default now.

Logging

This library uses SLF4j logging. This is a rare use case, but you can plug in your logging provider and get some potentially useful output. POI 5.1.0 switched to Log4j 2.x for logging. If you need logs from both libraries, you will need to use one of the bridge jars to map slf4j to log4j or vice versa.

Implementation Details

This library will take a provided InputStream and output it to the file system. The stream is piped safely through a configurable-sized buffer to prevent large usage of memory. Once the file is created, it is then streamed into memory from the file system.

The reason for needing the stream being outputted in this manner has to do with how ZIP files work. Because the XLSX file format is basically a ZIP file, it's not possible to find all of the entries without reading the entire InputStream.

This is a problem that can't really be gotten around for POI, as it needs a complete list of ZIP entries. The default implementation of reading from an InputStream in POI is to read the entire stream directly into memory. This library works by reading out the stream into a temporary file. As part of the auto-close action, the temporary file is deleted.

If you need more control over how the file is created/disposed of, there is an option to initialize the library with a java.io.File. This file will not be written to or removed:

File f = new File("/path/to/workbook.xlsx");
Workbook workbook = StreamingReader.builder()
        .rowCacheSize(100)    
        .bufferSize(4096)     
        .open(f);

This library will ONLY work with XLSX files. The older XLS format is not capable of being streamed.

Contributing

Contributing

excel-streaming-reader's People

Contributors

alvaroandrescarral avatar bhdrk avatar bvp avatar bzil avatar cfsimplicity avatar daniel-shuy avatar daniell avatar dependabot[bot] avatar doctorgester avatar joeljons avatar jymigeon avatar lilyliuce avatar lvsant avatar matthiasblaesing avatar monitorjbl avatar ms1111 avatar pjfanning avatar prf-tore99 avatar rlconst avatar rvdwenden avatar selalerercapitolis avatar shawnsmith avatar slawo-ch avatar slugmandrew avatar thadguidry avatar thomastardy avatar toddwarwaruk avatar waxxxd avatar whicken avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

excel-streaming-reader's Issues

Decoding unicode characters without rich text formatting

I have an application handling very large excel files. I am using the TEMP_FILE_BACKED shared Strings implementation.
I would like to avoid using the setFullFormatRichText option, as I do not need any format details and as I see a non-negligible performance drop when using it.

Everything is working great, except I sometimes get some encoded unicode characters, mostly _x000D_.

Is there any way to get the text decoded without the overhead of the rich text format option ?

Consider saving file to ByteArrayInputStream instead of temp file

The reason to write to a file is the reduced memory consumption.
However, the zip file is usually very much smaller than the amount of RAM you would need to open the entire workbook with POI.
So if you gave the option not to write the temp file but instead save it into RAM as a byte array, you would have more memory usage than today, but still much less than standard POI.
That, of course, should be optional and would allow working in environments where writing files is forbidden.

Exception using DataFormatter from POI 4.1

Code using
org.apache.poi.ss.usermodel.DataFormatter.formatCellValue(Cell, FormulaEvaluator) when Cell is com.github.pjfanning.xlsx.impl.StreamingCell is working with POI 4.0 but is failing with POI 4.1 (java.lang.UnsupportedOperationException)

The code has been changed in POI 4.1 POI (which was not there in POI 4.0)

private boolean isDate1904(Cell cell) {
        if ( cell != null && cell.getSheet().getWorkbook() instanceof Date1904Support) {
            return ((Date1904Support)cell.getSheet().getWorkbook()).isDate1904();

        }
        return false;
    }

cell.getSheet().getWorkbook() is throwing the exception (see com.github.pjfanning.xlsx.impl.StreamingSheet.getWorkbook())

support active cell

appears in sheet xml - child of root element (worksheet) -- seems to be before sheetData so good from streaming perspective

<selection activeCell="E20" sqref="E20"/>

change to using checked exceptions

Change the exceptions like ReadExcption and OpenException to not subclass RuntimeException and make them subclass Exception instead.

This will need a major release as it is not a backward compatible change.

See #227

Support for cached formula results in StreamingCell.getBooleanCellValue

When reading a cell value that is a formula with boolean type I'm currently getting a NotSupportedException originating from com.github.pjfanning.xlsx.impl.StreamingCell.getBooleanCellValue(StreamingCell.java:219)

Sample file:
boolean_formula.xlsx

Cached formula results are read successfully from cells of other types.
Is there some inherent problem with making it work for booleans?

Happy to take a stab at it if you're open to PRs.

Best Wishes
Slawo

How to read XLSX from file and not wait it's entire download to process it

Hey, I have a question. How can we process one XLSX file stored remotely, in this example as URL, without waiting the code to entire download it first? For large files it takes a long time.

Example:

InputStream is = new URL("https://filebin.net/qe5ynsl7ikmzap8a/LINEITEM_6M.xlsx").openStream();
Workbook workbook = StreamingReader
		.builder()
		.rowCacheSize(100)
		.bufferSize(4096) 
		.open(is); 
for (Sheet sheet : workbook) {
	System.out.println(sheet.getSheetName());
	for (Row r : sheet) {
		for (Cell c : r) {
			System.out.println(c.getStringCellValue());
		}
	}
}

Blank cells are ignored by parser

Hello PJ. Thanks for pointing me to your fork
I'm trying to parse the xlsx file which have some blank cells.
My initial thought was that they have to be returned as null or empty strings but seems they are completely ignored by parser
I'm using this library from Clojure but it shouldn't be an issue in theory

Here's my code

(ns xsls-test.core
  (:require
   [clojure.java.io :as io])
  (:import
   [com.github.pjfanning.xlsx StreamingReader]
   [org.apache.poi.ss.usermodel Cell CellType]))


(defn cell-data [^Cell cell]
  (let [cell-type (.getCellType cell)]
    (cond
      (= cell-type CellType/NUMERIC) (.getNumericCellValue cell)
      (= cell-type CellType/BOOLEAN) (.getBooleanCellValue cell)
      :otherwise (.getStringCellValue cell))))


(let [stream   (io/input-stream (io/resource "trialling_sergey_change.xlsx"))
      workbook (-> (StreamingReader/builder)
                   (.rowCacheSize 100)
                   (.bufferSize 4096)
                   (.open stream))
      sheet    (first workbook)]
  (->> (seq sheet)
       (mapv (fn [row]
               (let [cells (seq row)]
                 (mapv cell-data cells))))))

And it gives my such result

[["person" "role'" "depart-ment"]
 ["harbs" "head" "devOps" "kkmjksnkja"]
 ["ridders" "product"]
 ["pascal" "head" "afsaaa" "product"]
 ["Dos" "legend" "engineering"]]

Actual file looks like

ะกะฝะธะผะพะบ ัะบั€ะฐะฝะฐ 2023-10-25 ะฒ 11 44 39

I'm using the latest version of the library [com.github.pjfanning/excel-streaming-reader "4.2.0"]

Is there a way of parsing empty cells as null's?

look at optimising workbook.getSheetAt

Currently, the code preps all sheets for use (without fully reading the sheet data). There might be performance gains for workbook.getSheetAt if the prep of the sheet was lazy - so that each sheet was prepped only when it was needed.

Illegal reflective access warning

When using StreamingReader to open our file, we are seeing the following warnings:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.github.pjfanning.xlsx.XmlUtils (file:/Users/username/.m2/repository/com/github/pjfanning/excel-streaming-reader/2.0.0/excel-streaming-reader-2.0.0.jar) to method com.sun.org.apache.xerces.internal.util.SecurityManager.setEntityExpansionLimit(int)
WARNING: Please consider reporting this to the maintainers of com.github.pjfanning.xlsx.XmlUtils
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

Code causing warning:

val workBook = StreamingReader.builder()
            .rowCacheSize(1000)
            .bufferSize(4096)
            .open(fileInputStream)

Would be good to fix to ensure it isn't denied in future releases ๐Ÿ‘

Formula evaluation

I have the following block of code (regular POI):

...
FormulaEvaluator formulaEvaluator = new XSSFFormulaEvaluator((XSSFWorkbook) book);
...
Row row = iter.next();
for (int i = 0; i < rowSize; i++) {
  Cell c = row.getCell(i, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
  formulaEvaluator.evaluate(c);
  val[i] = dataFormatter.formatCellValue(c, formulaEvaluator);
}

How can I do something like that with this library please? How should I correctly evaluate the formula?
Thank you!

cell.getDateCellValue(); --> Cell type cannot be CELL_TYPE_STRING

When opening an Excel file with Date formatted cells as XSSF Workbook, I can get a Date like:

Date valueDate = cell.getDateCellValue();

When using the Stream Reader on the same workbook, the same code will throw:

java.lang.IllegalStateException: Cell type cannot be CELL_TYPE_STRING
	at com.github.pjfanning.xlsx.impl.StreamingCell.getDateCellValue(StreamingCell.java:202)

Using Apache POI 5.0.0 and Excel-streaming-reader 2.3.6

workbook.getSheetAt(0).getWorkbook() returns null

Hello,

loving your library :)

But when I got a sheet reference from a workbook and then try to get the workbook back, I got NULL as result.

workbook.getSheetAt(0).getWorkbook() // null

I would expect, that I'll just get a reference to the "workbook" I used for my .getSheetAt(0) call?

That makes it impossible for me, to close the workbook after I processed the sheet later in my code (I do not have a direct reference to the workbook there, only for the sheet)

New version of the reader.

Hello,

Our NexusIQ noticed a vulnerability by reference of an apache library, commons-text. However, i also noticed it is al ready updated in the repository. It is about the commons-text library, which is referenced trough poi-shared-strings, from version 1.9.0 to 1.10.0.

Is it possible to create a new version containing the latest libraries?

Sincerely,

Ewout

support shared formulas

monitorjbl#200

The Excel file has the cached values for the formula evaluation anyway and this support only helps with getting access to the formula string (ie getCellValue already works but getCellFunction may not work without this fix).

When you add a formula in Excel and then drag the cell over a range of cells - Excel often stores the formula on only the first cell and then the other cells have a reference to that original cell.

The XML for the sheet can look like this.

      <c r="B2">
        <f t="shared" ref="B2:B20" si="1">A2/100</f>
        <v>0.02</v>
      </c>
      <c r="B3">
        <f t="shared" si="1" />
        <v>0.03</v>
      </c>

The formula in B2 is reused in B3. The si="2" attribute is used to track that the function for B3 is based on the formula in B2 but is shifted by one row. So B2 formula is A2/100 and the B3 formula is derived to be B2/100.

The support added for this fix relies on the fact that Excel normally puts the shared formula on the first cell that uses the shared formula (the top right cell in the cell range). excel-streaming-reader is streaming on cell at a time (row first, then each column within that row before moving onto the next row).

If you have a workbook that stores the shared formula on a cell that is not the first one, you can read the sheet twice. You can get the shared formulas as a Map after the first pass and then you can do a 2nd pass after pre-setting the shared formulas on the sheet created for the 2nd pass. See https://github.com/pjfanning/excel-streaming-reader/blob/main/src/test/java/com/github/pjfanning/xlsx/StreamingReaderTest.java (test called testIteratingRowsOnSheetTwice) - the support for doing 2 passes is a little bit awkward but it is feasible if you follow that example.

Migrating from monitorjbl 2.1.0 to pjfanning latest version, backward compatibility

I was migrating code from monitorjbl 2.1.0 to latest version of pjfanning.
After deploying change saw below exceptions for some of file reads
Exception while reading file
java.lang.IllegalStateException: This cell has a shared formula and it seems setReadSharedFormulas has been set to false or the formula can't be evaluated

I didnot see this exception with monitorjbl 2.1.0 , so is pjfanning latest version not backward compatible with monitorjbl 2.1.0?

Workbook open doesn't declare any possible thrown errors?

So i'm replacing standard poi workbooks with this to avoid a memory issue on my application and I went and replaced the workbook building to

workbook = StreamingReader.builder().open(inputStream);

This was inside a try catch block that would catch an IOException if the file couldn't be opened. Now the class won't compile because the builder open method doesn't declare any possible thrown errors.

This method is opening a file or interpreting an InputStream there's no way it's completely safe and unable to throw errors, why aren't they declared?

EDIT: The exceptions are thrown in the subclass StreamingWorkbookReader but I think they still should be declared as throws in the open methods of the builder

add option to remove boiler-plate on legacy comments

Excel comments are no stored as threaded comments but the comments are also added to the xlsx in the legacy format. Older versions of Excel will only expect the legacy comments. Excel adds a boiler-plate intro to the comment that is best removed.

[Threaded comment]

Your version of Excel allows you to read this threaded comment; however, any edits to it will get removed if the file is opened in a newer version of Excel. Learn more: https://go.microsoft.com/fwlink/?linkid=870924

Comment:
    Real comment here

Unit tests are failing if system time zone is not UTC.

I am building on Mac and my timezone is CST.

com.github.pjfanning.xlsx.impl.StreamingSheetReaderTest > testStrictDates FAILED
    java.lang.AssertionError: expected:<Sat Feb 27 18:00:00 CST 2021> but was:<Sun Feb 28 00:00:00 CST 2021>
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at com.github.pjfanning.xlsx.impl.StreamingSheetReaderTest.testStrictDates(StreamingSheetReaderTest.java:31)

If I set my timezone to UTC then the build succeeds (with tons of warnings and exceptions though).

Expected unit tests to be working everywhere.

My revision:

~/work/dev/excel-streaming-reader$ git show -q
commit f01f40387100ec0a15efabb3520776c3b4d15190 (HEAD -> main, origin/main, origin/HEAD)
Author: PJ Fanning <[email protected]>
Date:   Wed Oct 20 21:43:31 2021 +0100

    Create .fossa.yml
~/work/dev/excel-streaming-reader$

Read large xlsx file with blank/empty cells

There is aim to read large excel files with this library. But some cells of excel files are empty. When I use the algorithm defined in the project, it does not read these blank cells (because it was written with an iterator). How can i fix this problem? Rows need to be read with blank cells. Otherwise, there is an inconsistency in the excel file to be read. Thanks.

ZipPackage takes a lot of memory

zippackage
As you can see in the picture.
xlsx file 97M, 1,000,000 rows. 25 columns
this is my code :
StreamingReader.builder() .rowCacheSize(50) .bufferSize(1024) .open(file);

how can i avoid this or will optimize this in the next release ?
thanks in advance for your response.

StreamingSheet cannot be re-iterated

@pjfanning thanks for forking and maintaining monitorjbl/excel-streaming-reader.

This issue is from the original monitorjbl/excel-streaming-reader (monitorjbl#97).

Essentially, the StreamingSheet breaks the concept of an Iterable - Iterators are not meant to be reusable, hence Iterable#iterator() should provide a new "fresh" Iterator on each call.

StreamingSheet#iterator() creates a new StreamingRowIterator on each call, but StreamingRowIterator uses the StreamingSheetReader's row cache:

@Override
public Iterator<Row> iterator() {
return reader.iterator();
}

/**
* Returns a new streaming iterator to loop through rows. This iterator is not
* guaranteed to have all rows in memory, and any particular iteration may
* trigger a load from disk to read in new data.
*
* @return the streaming iterator
*/
@Override
public Iterator<Row> iterator() {
return new StreamingRowIterator();
}

class StreamingRowIterator implements Iterator<Row> {
public StreamingRowIterator() {
if(rowCacheIterator == null) {
if(!hasNext()) {
LOG.debug("there appear to be no rows");
}
}
}
@Override
public boolean hasNext() {
return (rowCacheIterator != null && rowCacheIterator.hasNext()) || getRow();
}
@Override
public Row next() {
try {
return rowCacheIterator.next();
} catch(NoSuchElementException nsee) {
//see https://github.com/monitorjbl/excel-streaming-reader/issues/176
if (hasNext()) {
return rowCacheIterator.next();
}
throw nsee;
}
}
@Override
public void remove() {
throw new NotSupportedException();
}
}

At the moment the only way to get a fresh Iterator (eg. to iterate through a Sheet more than once) is to reload the entire Workbook, which is really inefficient.

I'm wondering if this can be reworked/restructured by moving the StreamingSheetReader to StreamingRowIterator.

This would probably require moving the XMLEventReader and StreamingSheetReader initialization to StreamingRowIterator:

//Iterate over the loaded streams
int i = 0;
for(Map.Entry<PackagePart, InputStream> entry : sheetStreams.entrySet()) {
XMLEventReader parser = getXmlInputFactory().createXMLEventReader(entry.getValue());
sheets.add(new StreamingSheet(
sheetProperties.get(i++).get("name"),
new StreamingSheetReader(this, entry.getKey(), sst, stylesTable,
sheetComments.get(entry.getKey()), parser, use1904Dates, builder.getRowCacheSize())));
}

Would you be open to a PR for this?

Issue with getNumericFormatIndex method in StreamingCell.java class

Hi Team,

As part of our project we are using this library to read the excel files.
But as part of reading the numeric format index of a cell, we are unable to get the method getNumericFormatIndex() in StreamingCell.java as the access modifier is default.
Can you please make it public, so that we can access this method.

Short getNumericFormatIndex() { return numericFormatIndex; }

Saving Workbook as Non-Strict OOXML file

Hi,

I'm using this lib to open a Strict OOXML Excel Workbook. However, I actually need to process the Excel file with a different library which we are using already (it's even more complicated as we are basically using a low-code platform). Long-story short, I cannot further process the file with your library, I want to use it for reading the strict ooxml file and store it as a "normal" xlsx file (with a different name?) so I can import the usual way.

Right now, I'm at the point where I can read the file:

`File xlsxFile = new File(filename);
if (!xlsxFile.exists()) {
System.err.println("Not found or not a file: " + xlsxFile.getPath());
return;
}

    try (Workbook workbook = StreamingReader.builder()
            .rowCacheSize(100)
            .bufferSize(4096)
            .setSharedStringsImplementationType(SharedStringsImplementationType.TEMP_FILE_BACKED)
            .open(xlsxFile)) {
        for (Sheet sheet : workbook) {
            System.out.println("Sheet: " + sheet.getSheetName());     
        }
    }`

I have two questions:

  1. How can I store the file (optional: with a different filename) as a normal, non-strict XLSX file?
  2. Is there a way to first check if the file is actually a strict ooxml file? Asking because we get different Excel files for further processing, some are strict ooxml, some not.

Thank you very much!

row.getRowStyle()

Could you please implements this method :

row.getRowStyle()

It's will be a nice feature.

Duto

Exception parsing excel with formulas

3.2.3 version

java.lang.IllegalStateException: EvaluationNames are not supported in excel-streaming-reader
at com.github.pjfanning.xlsx.impl.BaseEvaluationWorkbook.getName(BaseEvaluationWorkbook.java:84) ~[excel-streaming-reader-3.2.3.jar:?]
at com.github.pjfanning.xlsx.impl.CurrentRowEvaluationWorkbook.getName(CurrentRowEvaluationWorkbook.java:19) ~[excel-streaming-reader-3.2.3.jar:?]
at org.apache.poi.ss.formula.FormulaParser.function(FormulaParser.java:1307) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parseNonRange(FormulaParser.java:900) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parseRangeable(FormulaParser.java:494) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parseRangeExpression(FormulaParser.java:325) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parseSimpleFactor(FormulaParser.java:1539) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.percentFactor(FormulaParser.java:1497) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.powerFactor(FormulaParser.java:1484) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.Term(FormulaParser.java:1858) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.additiveExpression(FormulaParser.java:1985) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.concatExpression(FormulaParser.java:1969) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.comparisonExpression(FormulaParser.java:1926) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.intersectionExpression(FormulaParser.java:1899) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.unionExpression(FormulaParser.java:1880) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:2027) ~[poi-5.1.0.jar:5.1.0]
at org.apache.poi.ss.formula.FormulaParser.parse(FormulaParser.java:173) ~[poi-5.1.0.jar:5.1.0]
at com.github.pjfanning.xlsx.impl.StreamingSheetReader.handleEvent(StreamingSheetReader.java:361) ~[excel-streaming-reader-3.2.3.jar:?]
at com.github.pjfanning.xlsx.impl.StreamingSheetReader.getRow(StreamingSheetReader.java:125) ~[excel-streaming-reader-3.2.3.jar:?]
at com.github.pjfanning.xlsx.impl.StreamingSheetReader.access$400(StreamingSheetReader.java:41) ~[excel-streaming-reader-3.2.3.jar:?]
at com.github.pjfanning.xlsx.impl.StreamingSheetReader$StreamingRowIterator.hasNext(StreamingSheetReader.java:687) ~[excel-streaming-reader-3.2.3.jar:?]
...

Fix for some atypical excel files

We had trouble handling some excel files in our spark/scala data pipelines, and below might be the root cause.
The excels appear to be legal, given the fact that excel opens them, and schema definitions do not seem to contradict.
If we do read the excels we get stack traces like this:

Exception in thread "main" java.lang.NumberFormatException: For input string: "19149 "
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:580)
	at java.lang.Integer.parseInt(Integer.java:615)
	at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getFormatterForType(StreamingRowIterator.java:507)
	at com.github.pjfanning.xlsx.impl.StreamingRowIterator.formattedContents(StreamingRowIterator.java:495)
	at com.github.pjfanning.xlsx.impl.StreamingRowIterator.handleEvent(StreamingRowIterator.java:317)
	at com.github.pjfanning.xlsx.impl.StreamingRowIterator.getRow(StreamingRowIterator.java:121)
	at com.github.pjfanning.xlsx.impl.StreamingRowIterator.<init>(StreamingRowIterator.java:103)
	at com.github.pjfanning.xlsx.impl.StreamingSheetReader.iterator(StreamingSheetReader.java:265)
	at com.github.pjfanning.xlsx.impl.StreamingSheetReader.getFirstRowNum(StreamingSheetReader.java:165)
	at com.github.pjfanning.xlsx.impl.StreamingSheet.getFirstRowNum(StreamingSheet.java:104)
	at com.crealytics.spark.excel.v2.CellRangeAddressDataLocator.rowIndices(DataLocator.scala:106)
	at com.crealytics.spark.excel.v2.CellRangeAddressDataLocator.readFrom(DataLocator.scala:84)
	at com.crealytics.spark.excel.v2.ExcelHelper.getSheetData(ExcelHelper.scala:140)
	at com.crealytics.spark.excel.v2.ExcelHelper.parseSheetData(ExcelHelper.scala:160)
	at com.crealytics.spark.excel.v2.ExcelTable.infer(ExcelTable.scala:77)
	at com.crealytics.spark.excel.v2.ExcelTable.inferSchema(ExcelTable.scala:48)
	at org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:70)

Note the blank after the 19149.
extra_whitespace.xlsx

I created a small/minimal excel that is attached here to demonstrate the issue, along with some fix. I could create a pull request but I have read creating an issue is preferred in this project first. The problem can be demonstrated with this test:

@Test
  public void testGetFirstRowNum() throws Exception {
    try (
            InputStream is = new FileInputStream("src/test/resources/extra_whitespace.xlsx");
            Workbook wb = StreamingReader.builder().open(is);
    ) {
      int firstRow = wb.getSheetAt(0).getFirstRowNum();
      assertEquals(0, firstRow);
    }
  }

Best regards, Alex

Apache Drill has issue with release 3.2.0

@pjfanning
The latest release the streaming reader seems to have broken Drill. When running queries against Excel files, we get the following error.

(java.lang.NoClassDefFoundError) Could not initialize class com.github.pjfanning.xlsx.impl.StreamingWorkbookReader
    com.github.pjfanning.xlsx.StreamingReader$Builder.open():372
    org.apache.drill.exec.store.excel.ExcelBatchReader.openFile():232
    org.apache.drill.exec.store.excel.ExcelBatchReader.open():196
    org.apache.drill.exec.store.excel.ExcelBatchReader.open():62
    org.apache.drill.exec.physical.impl.scan.framework.ManagedScanFramework.open():208
    org.apache.drill.exec.physical.impl.scan.file.FileScanFramework.open():276
    org.apache.drill.exec.physical.impl.scan.framework.ShimBatchReader.open():75
    org.apache.drill.exec.physical.impl.scan.ReaderState.open():229
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.nextAction():273
    org.apache.drill.exec.physical.impl.scan.ScanOperatorExec.next():229
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.doNext():201
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.start():179
    org.apache.drill.exec.physical.impl.protocol.OperatorDriver.next():129
    org.apache.drill.exec.physical.impl.protocol.OperatorRecordBatch.next():149
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():111
    org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():59
    org.apache.drill.exec.physical.impl.limit.LimitRecordBatch.innerNext():93
    org.apache.drill.exec.record.AbstractRecordBatch.next():170
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():111
    org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():59
    org.apache.drill.exec.record.AbstractRecordBatch.next():170
    org.apache.drill.exec.record.AbstractRecordBatch.next():119
    org.apache.drill.exec.record.AbstractRecordBatch.next():111
    org.apache.drill.exec.record.AbstractUnaryRecordBatch.innerNext():59
    org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():85
    org.apache.drill.exec.record.AbstractRecordBatch.next():170
    org.apache.drill.exec.physical.impl.BaseRootExec.next():103
    org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
    org.apache.drill.exec.physical.impl.BaseRootExec.next():93
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():323
    org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():310
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1762
    org.apache.drill.exec.work.fragment.FragmentExecutor.run():310
    org.apache.drill.common.SelfCleaningRunnable.run():38
    java.util.concurrent.ThreadPoolExecutor.runWorker():1149
    java.util.concurrent.ThreadPoolExecutor$Worker.run():624
    java.lang.Thread.run():748 (state=,code=0)

Reading first row from first sheet before full file available?

Thank you for this library. Is it reasonable to expect this library to enable access to the first row in the first sheet, when the file is not yet completely available (still sequentially arriving)? Based on the nature of the error, it seems like the underlying ZIP might disallow this.

When I try the following, I get the resulting exception.

List<String> list = new LinkedList<String>();
InputStream in = <...>
try (
	Workbook workbook = StreamingReader.builder().open(in)) {
	Sheet sheet = workbook.getSheetAt(0);
	Row r = sheet.getRow(0);
	for (Cell c : r) {
		list.add(c.getStringCellValue());
	}
}

org.apache.poi.openxml4j.exceptions.InvalidOperationException: Can't open the specified file: '<...>.xlsx'
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:137)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:252)
at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:201)
at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:117)
at com.github.pjfanning.xlsx.impl.StreamingWorkbookReader.init(StreamingWorkbookReader.java:93)
at com.github.pjfanning.xlsx.StreamingReader$Builder.open(StreamingReader.java:247)
<...>
Caused by: java.util.zip.ZipException: zip END header not found
at java.base/java.util.zip.ZipFile$Source.zerror(ZipFile.java:1585)
at java.base/java.util.zip.ZipFile$Source.findEND(ZipFile.java:1439)
at java.base/java.util.zip.ZipFile$Source.initCEN(ZipFile.java:1448)
at java.base/java.util.zip.ZipFile$Source.(ZipFile.java:1249)
at java.base/java.util.zip.ZipFile$Source.get(ZipFile.java:1211)
at java.base/java.util.zip.ZipFile$CleanableResource.(ZipFile.java:701)
at java.base/java.util.zip.ZipFile.(ZipFile.java:240)
at java.base/java.util.zip.ZipFile.(ZipFile.java:171)
at java.base/java.util.zip.ZipFile.(ZipFile.java:185)
at org.apache.poi.openxml4j.util.ZipSecureFile.(ZipSecureFile.java:105)
at org.apache.poi.openxml4j.opc.internal.ZipHelper.openZipFile(ZipHelper.java:158)
at org.apache.poi.openxml4j.opc.ZipPackage.(ZipPackage.java:135)
<...>

Invalid Automatic-Module-Name

v3.4.1 cannot be referenced in a module-info.java file because the Automatic-Module-Name 'com.github.pjfanning.excel-streaming-reader' is not valid (I think dashes are not allowed).

High CPU utilisation spikes recorded

Hi team,
I am trying to read max size xlsx files of around 112 MBs . I am able to read them properly in local and staging server but my server which stays at around 30-40% CPU usage normally shows frequent CPU spikes of 90-98% while reading the file . Means this file reading jar is itself hitting CPU of around 60-69 % utilisation.
My usage of jar is as below :
`

InputStream is = null;
Workbook workbook = null;

int minLineDataLength=columnCount;

try {
is = new FileInputStream(new File(filePath));

workbook = StreamingReader.builder()
    .rowCacheSize(100)
    .bufferSize(4096)
    .password(password)
    .open(is);`

how to find out that a numeric value is empty

I have empty numeric cells. How do I find out the cell is empty? In this case, getNumericCellValue returns 0.0, but that is not what is in the Excel document.

How can I find out that the cell is empty.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.