GithubHelp home page GithubHelp logo

base-docs-textminig-samples's Introduction

Base para Treinamento de Algoritimos de Identificação de Metadados de Documentos

Text Mining

Legenda de Documentos por Categoria

  • 0001 ate 0010 - Contratos de Serviços
  • 0011 até 0020 - Contratos de Locação
  • 0021 até 0025 - Contratos de Cooperação
  • 0026 ate 0030 - Contratos de venda de Produtos/Equipamentos
  • 0031 ate 0035 - Leis Complementares
  • 0036 ate 0039 - Leis
  • 0040 ate 0045 - decretos
  • 0046 até 0050 - Portarias

Script

O Script /extrai-text.sh converte arquivos para extração Textual

Nomenclatura dos arquivos

  • .pdf - Arquivos Originais
  • .pdf.tiff - Arquivos Convertidos em Imagem ( simula escanneamento digital )
  • .pdf.txt - arquivos usados para Text Mining
  • .pdf.txt.meta - Metadados extraidos ( implementação a ser feita )

Layout do Arquivo Meta

Arquivo Delimitado por ; ( ponto e virgula )

  • ID - ID do Documento
  • METADADO - Nome do Metadado ( podem sem inclusos outros alem dos já adicionados no arquivo 0001.pdf.txt.meta de exemplo
  • SEQUENCE - Sequencia para caso o meso metadado posa ser encontrado duas vezes
  • VALUE - Valor da Metadado encontrado.

Fontes dos Dados

Portal da Transparência de Curitiba

Consulta On-Line da Legislação Municipal de Curitiba

Extraindo Text

PDF para TIF

  • convert -density 600 0001.pdf 0001.tiff ou

convert -density 600 0001.pdf -depth 8 -strip -background white -alpha off 0001.tiff

  • Convertendo para TXT

tesseract 0001.tif 0001 -l por

Referências:

base-docs-textminig-samples's People

Contributors

marciojv avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.