This is work in progress... Click here for project status.
conda create -n mmda python=3.8
pip install -r requirements.txt
-
SymbolScraper - Apache 2.0
- Quoted from their
README
: From the main directory, issuemake
. This will run the Maven build system, download dependencies, etc., compile source files and generate .jar files in./target
. Finally, a bash scriptbin/sscraper
is generated, so that the program can be easily used in different directories.
- Quoted from their
-
PDFPlumber - MIT License
-
Grobid - Apache 2.0
- PDF2Image - MIT License
In this example, we use the SymbolScraperParser
to convert a PDF into a bunch of text and PDF2ImageRasterizer
to convert that same PDF into a bunch of page images.
from typing import List
from mmda.parsers.symbol_scraper_parser import SymbolScraperParser
from mmda.rasterizers.rasterizer import PDF2ImageRasterizer
from mmda.types.document import Document
from mmda.types.image import PILImage
# PDF to text
ssparser = SymbolScraperParser(sscraper_bin_path='...')
doc: Document = ssparser.parse(input_pdf_path='...pdf')
# PDF to images
pdf2img_rasterizer = PDF2ImageRasterizer()
images: List[PILImage] = pdf2img_rasterizer.rasterize(input_pdf_path='...pdf', dpi=72)
# attach those images to the document
doc.annotate_images(images=images)
You can convert a Document into a JSON object.
import os
import json
# usually, you'll probably want to save the text & images separately:
with open('...json', 'w') as f_out:
json.dump(doc.to_json(with_images=False), f_out, indent=4)
os.makedirs('.../', exist_ok=True)
for i, image in enumerate(doc.images):
image.save(os.path.join('.../', f'{i}.png'))
# you can also save images as base64 strings within the JSON object
with open('...json', 'w') as f_out:
json.dump(doc.to_json(with_images=True), f_out, indent=4)
You can create a Document from its saved output.
from mmda.types.image import PILImage, pilimage
# directly from a JSON. This should handle also the case where `images` were serialized as base64 strings.
with open('...json') as f_in:
doc_dict = json.load(f_in)
doc = Document.from_json(doc_dict=doc_dict)
# if you saved your images separately, then you'll want to reconstruct them & re-attach
images: List[PILImage] = []
for i, page in enumerate(doc.pages):
image_path = os.path.join(outdir, f'{i}.png')
assert os.path.exists(image_path), f'Missing file for page {i}'
image = pilimage.open(image_path)
images.append(image)
doc.annotate_images(images=images)
The minimum requirement for a Document
is its .symbols
field, which is just a <str>
. For example:
doc.symbols
> "Language Models as Knowledge Bases?\nFabio Petroni1 Tim Rockt..."
But the usefulness of this library really is when you have multiple different ways of segmenting .symbols
. For example, segmenting the paper into Pages, and then each page into Rows:
for page in doc.pages:
print(f'\n=== PAGE: {page.id} ===\n\n')
for row in page.rows:
print(row.symbols)
> ...
> === PAGE: 5 ===
> ['tence x, s′ will be linked to s and o′ to o. In']
> ['practice, this means RE can return the correct so-']
> ['lution o if any relation instance of the right type']
> ['was extracted from x, regardless of whether it has']
> ...
shows two nice aspects of this library:
-
Document
provides iterables for different segmentations ofsymbols
. Options include things likepages, tokens, rows, sents, paragraphs, sections, ...
. Not every Parser will provide every segmentation, though. For example,SymbolScraperParser
only providespages, tokens, rows
. More on how to obtain other segmentations later. -
Each one of these segments (in our library, we call them
SpanGroup
objects) is aware of (and can access) other segment types. For example, you can callpage.rows
to get all Rows that intersect a particular Page. Or you can callsent.tokens
to get all Tokens that intersect a particular Sentence. Or you can callsent.rows
to get the Row(s) that intersect a particular Sentence. These indexes are built dynamically when theDocument
is created and each time a newDocSpan
type is loaded. In the extreme, one can do:
for page in doc.pages:
for paragraph in page.paragraphs:
for sent in paragraph.sents:
for row in sent.rows:
for token in sent.tokens:
pass
You can check which fields are available in a Document via:
doc.fields
> ['pages', 'tokens', 'rows']
Not all Documents will have all segmentations available at creation time. You may need to load new fields to an existing Document
.
TBD...
We currently don't support any nice tools for mutating the data in a Document
once it's been created, aside from loading new data. Do at your own risk.
TBD...