GithubHelp home page GithubHelp logo

murphygao / sun-wordtable-read Goto Github PK

View Code? Open in Web Editor NEW

This project forked from suncht/sun-wordtable-read

0.0 1.0 0.0 2.51 MB

读取Word文档的各种复杂表格内容

License: Apache License 2.0

Batchfile 0.01% Java 53.32% XSLT 46.67%

sun-wordtable-read's Introduction

sun-wordtable-read

======== 读取Word文档的各种复杂表格内容,支持2007以上的Docx文档(暂不支持2007以下的Doc类型文档)

开发背景:

工作上遇到如何读取Word文档中的表格内容,表格是有业务数据意义的,而且有一定规则的,因此不能直接读取表格文本,而是遍历表格单元格进行一行一列读取。

表格规则:

  1. 表格可以有表头,表头也有业务意思
  2. 一行为一个业务数据,可能会跨行
  3. 列可能会有跨列、跨行
  4. 单元格中图片、数学公式、嵌套表格、文件等

比如,以下表格

设计理念:

  1. 读取Word文档中表格数据到内存映射表,再通过自定义读取策略,将内存映射表转换成实际业务表格数据。
  2. 使用统一的内存映射表,屏蔽了实际Word文档读取方式,开发者只关心如何转换为业务数据。

功能现状:

  1. 目前只支持读取2007以上Word文档表格单元格的文本,支持读取图片、数学公式、嵌套表格、附件内嵌对象(除PPT、WORD、EXCEL类型的OLE内嵌对象以外)。
  2. 支持一般性的有规则的复杂表格。
  3. 暂不支持2007以下的Doc类型文档,因为POI中暂未找到关于表格单元格合并信息的API。(目前已有解决方案,正在积极处理中。。。) 目前折中解决方案:为了兼容2007以下的Doc类型文档,利用jodconverter3.0 + LibreOffice 5.3,“先将Doc类型文档转换为Docx类型文档,再进行读取表格内容”。 注意:LibreOffice直接支持Docx类型文档,而OpenOffice不能直接支持Docx类型文档,需要AccessODF插件

后续要增加的功能:

  1. 处理PPT、WORD、EXCEL类型的OLE内嵌对象
  2. 正处理2007以下的Doc类型文档的读取。(Docx文档、Doc文档解析读取单元格时有区别,区别在于Docx有行合并、列合并、列宽,而Doc只有行合并、列宽,而没有列合并)
  3. 直接导入到目标(比如:数据库表、Excel等)的公共功能
  4. 读取大文件的Word、性能优化策略

sun-wordtable-read's People

Contributors

suncht avatar yaoer avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.