Topic: text-extraction Goto Github

Some thing interesting about text-extraction

👇 Here are 197 public repositories matching this topic...

abhinaba-ghosh / any-text

text-extraction,Get text content from any file

User: abhinaba-ghosh

text-extraction text-extractor file-reader text reader

ad-freiburg / pdftotext-plus-plus

text-extraction,A fast and accurate command line tool for extracting text from PDF files.

Organization: ad-freiburg

Home Page: https://pdftotext.cs.uni-freiburg.de

pdf c-plus-plus cli document-analysis metadata-extraction text-extraction

adbar / trafilatura

text-extraction,Python & command-line tool to gather text on the Web: Crawling & scraping, content extraction, metadata. TXT, Markdown, CSV & XML output.

User: adbar

Home Page: https://trafilatura.readthedocs.io

web-scraping text-extraction nlp html2text news text-mining crawler text-cleaning text-preprocessing article-extractor

altabeh / tesseract-ocr-wrapper

text-extraction,This is a highly efficient python wrapper for tesseract-ocr.

User: altabeh

tesseract-ocr text-extraction multiprocessing leptonica xpdf

aman-zishan / textextractor2.0

text-extraction,:fire: This web app extracts text in an image.

User: aman-zishan

Home Page: https://textextractor2.herokuapp.com

python3 flask-application website image-processing text-extraction hacktoberfest

amenezes / aiopytesseract

text-extraction,A Python asyncio wrapper for Tesseract-OCR.

User: amenezes

ocr tesseract asyncio tesseract-ocr optical-character-recognition text-extraction pdftotext pytesseract pytesseract-ocr

archivesunleashed / aut

text-extraction,The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

Organization: archivesunleashed

Home Page: https://aut.docs.archivesunleashed.org/

spark hadoop webarchives analysis apache-spark scala digital-humanities pyspark dataframe big-data-analytics

arxa / video_text_detection

text-extraction,Bachelor Thesis | Text extraction from complex video scenes

User: arxa

text-extraction video image-processing javafx opencv junit gradle testfx

asepmaulanaismail / pdf-to-txt-python

text-extraction,Simple pdf to text with python using PDFtk and PyPDF2

User: asepmaulanaismail

python python3 pdf pdftk pypdf2 text-extraction pdf-extractor pdf-to-text

bookieio / breadability

text-extraction,Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

Organization: bookieio

Home Page: https://bookieio.github.io/breadability/

python text-mining text-extraction html-extraction html-extractor html-parsing

cdown / srt

text-extraction,A simple library and set of tools for parsing, modifying, and composing SRT files.

User: cdown

srt subtitle subtitles subtitles-parsing text-extraction python mit-license subtitle-parser subtitle-fixer tools

chrismattmann / tika-python

text-extraction,Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

User: chrismattmann

tika-server python tika-python tika-server-jar parser-interface parse translation-interface usc text-extraction mime

ckorzen / pdf-text-extraction-benchmark

text-extraction,A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

User: ckorzen

arxiv benchmark evaluation tex pdf extraction text-extraction

docwire / docwire

text-extraction,DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

Organization: docwire

Home Page: https://docwire.io

api c cli cpp linux macos parsing shell terminal windows

dotfurther / opendiscoversdk

text-extraction,.NET 6 API for document file format identification, text/metadata/attachment/embedded object/sensitive item (PII/PHI)/entity extraction.

User: dotfurther

Home Page: https://www.dotfurther.com

text extraction text-extraction sdk metadata embedded-objects file-identification file-format-detection file-deduplication microsoft-office

dwatteau / scummtr

text-extraction,Fan translation tools for SCUMM engine games

User: dwatteau

scumm translation i18n l10n scummtr game-translation fan-translation text-extraction scummfont scummrp

flairnlp / fundus

text-extraction,A very simple news crawler with a funny name

Organization: flairnlp

cc-news commoncrawl corpus crawler news-crawler news-scraping nlp python rss scraper

fourdigits / wagtail_textract

text-extraction,Text extraction for Wagtail document search

Organization: fourdigits

wagtail textract text-extraction tesseract django search

gamemaker1 / office-text-extractor

text-extraction,Yet another library to extract text from MS Office and PDF files

User: gamemaker1

Home Page: https://npm.im/office-text-extractor

text-extraction get-text parser ms-office ms-word ms-excel ms-powerpoint xlsx docx pptx

govind-s-b / pdf-to-text-chroma-search

text-extraction,Python scripts that converts PDF files to text, splits them into chunks, and stores their vector representations using GPT4All embeddings in a Chroma DB. It also provides a script to query the Chroma DB for similarity search based on user input.

User: govind-s-b

chromadb pdf-processing similarity-search text-extraction vector-embeddings

greed2411 / tokyo

text-extraction,tokyo, a REST API, when given any type of document 📄, Identifies mime-type 🧐. Suggests extension 🦔. Alas Extracts text 💪.

User: greed2411

document-processing apache-tika clojure ring mime-types extension text-parsing text-parser extract-text filetype

hscspring / pnlp

text-extraction,NLP预/后处理工具。

User: hscspring

nlp chinese-nlp preprocessing text-processing nlp-preprocess text-cleaning text-extraction nlp-enhancer concurrency normalization

icij / datashare

text-extraction,A self-hosted search engine for documents.

Organization: icij

Home Page: https://datashare.icij.org

named-entity-recognition text-extraction extract investigative-journalism elasticsearch datashare docker web-gui

ingmarboeschen / jatsdecoder

text-extraction,A text extraction and manipulation toolset for NISO-JATS coded XML files

User: ingmarboeschen

cermine niso-jats pubmedcentral r text-extraction text-mining xml-files

iscc / mobi

text-extraction,python based software to unpack kindlegen generated ebooks

Organization: iscc

text-extraction mobi kindle

jonathanraiman / wikipedia_ner

text-extraction,:book: Labeled examples from wiki dumps in Python

User: jonathanraiman

wikipedia python named-entity-recognition dataset text-extraction

lu4p / cat

text-extraction,Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

User: lu4p

text-extraction docx2txt rtf-to-text odt2txt cross-platform go golang textextracting cat pdftotext

miso-belica / justext

text-extraction,Heuristic based boilerplate removal tool

User: miso-belica

Home Page: https://pypi.python.org/pypi/jusText

python text-extraction html-parser html-parsing

miso-belica / sumy

text-extraction,Module for automatic summarization of text documents and HTML pages.

User: miso-belica

Home Page: https://miso-belica.github.io/sumy/

python lsa textteaser html-page summarizer pagerank-algorithm reduction text-extraction html-extraction html-extractor

mknz / mirusan

text-extraction,A PDF collection reader with built-in full-text search engine

User: mknz

electron elm full-text-search pdf pdf-viewer python search-engine text-extraction whoosh

mrgrd56 / textractor-translator

text-extraction,Translate visual novels in real time

User: mrgrd56

textractor electron games text-extraction translation visual-novel javascript typescript anime textractor-extension translator

nainiayoub / pdf-text-data-extractor

text-extraction,PDF text data extraction web app with OCR for scanned documents

User: nainiayoub

Home Page: https://share.streamlit.io/nainiayoub/pdf-text-data-extractor/main/app.py

pdf-to-text streamlit streamlit-webapp text-extraction python ocr ocr-python ocr-text-reader pdf

owenorcan / yirabot-crawler

text-extraction,YiraBot: Simplifying Web Scraping for All. A user-friendly tool for developers and enthusiasts, offering command-line ease and Python integration. Ideal for research, SEO, and data collection.

User: owenorcan

beginner-friendly big-data-analytics command-line-tool data-extraction data-mining html-parser machine-learning open-source python3 seo-analysis