GithubHelp home page GithubHelp logo

pyreports_crawler's Introduction

pyreports_crawler

版本

v0.2

运行环境

Python 2.7

pdfminer

jieba

功能

  • 从研报和年报获取正文
  • 利用TextRank算法从研报中提取带有权重的关键词
  • 用研报关键词提取年报关键句子并标权重

改进空间

  • 文件夹导入
  • 从结果里去除特殊标点
  • 训练集文件夹和年报文件夹内无文件时报错
  • 如果训练集变动,对关键词词典进行更新
  • 按行业分类将分析报告和年报分类
  • 从结果中排除冗余文字
    • 改进LTTextBox的比较条件
    • 创建冗余项排除列表
    • 找到合适的统计学习模型来检测冗余项
  • 从正文提取含有关键数据的句子
  • 排除页眉页脚
  • 提取图片
  • 修改为面向对象性编程
  • 训练自定义jieba分词词典
    • 为自定义jieba词典的新单词标注词性
  • 从正文提取关键词和对应数据并导入数据库
  • 使用 Tesseract 或者 ABBYY FineReader Engine进行OCR

其他需求

  • 从年报里提取关键词
    • TextRank权重
    • 词频
  • 优先提取“董事会报告”或“经营情况讨论分析”

Version

v0.2

Requirements

Python 2.7

pdfminer

jieba

Feature

  • Extract context from analytic reports and annual reports.
  • Use TextRank to extract keywords with weight from analytic reports.
  • Use keywords from analytic reports to extract context in annual reports.

Room for improvement

  • Processing in directory.
  • Clean some special Chinese marks from result.
  • Raise ERROR if there is nothing in training and report directory.
  • Update keywords dictionary if training set is changed.
  • Classify analytic reports and annual reports by sectors.
  • Exclude redundant text from result.
    • Improve comparison conditions in LTTextBox (determine_obj_text()).
    • Create exclusion list for redundant keywords.
    • Find and use ML model to detect redundant.
  • Extract sentences those contain important data from context.
  • Page header and page footer excludes.
  • Image Extraction.
  • OOP.
  • Train a custom jieba dictionary for text segmentation.
    • Add POS to new words in the custom jieba dictionary.
  • Extract keywords and data from context and import to database.
  • OCR by using Tesseract or ABBYY FineReader Engine.

Other Requests

  • Extract keywords from annaul report
    • TextRank weight
    • Word count
  • Extract 'Directors' Report' or 'Business Conditions Analysis' first.

pyreports_crawler's People

Contributors

mindzy avatar

Stargazers

hopperhuang avatar ben avatar

Watchers

 avatar ben avatar

Forkers

leolle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.