GithubHelp home page GithubHelp logo

paulrschrater / design2align-scanner Goto Github PK

View Code? Open in Web Editor NEW

This project forked from learningequality/design2align-scanner

0.0 0.0 0.0 14.05 MB

Uses a scan of a curriculum book or pdf to generate a csv structure

License: MIT License

Python 1.25% Jupyter Notebook 98.75%

design2align-scanner's Introduction

Curriculum Scanner

Uses a scan of a curriculum book or pdf to generate a csv structure

Getting Started

You'll need to install the following libraries first:

Next, install the requirements pip install -r requirements.txt

You are now ready to start using the CurriculumScanner class!

The CurriculumScanner Class

The CurriculumScanner class reads the data from running process_scans.py. To use this class, you will need to instantiate an object first:

from scanner import CurriculumScanner

scanner = CurriculumScanner("<path-to-file>")

Note: you may get an error if the Google Vision API hasn't been run on the specified file. To resolve this, you will need to run CurriculumScanner.process("path-to-file"). This accepts png, jpg, and pdf files. If you would like to change the detection settings, you'll need to update config.py

CurriculumScanner.pages

If you would like to access the data directly, you can use the scanner.pages attribute. The data will be formatted as the following:

[
	{
	    "columns": [
	      [starting point, ending point],
	      ...
	    ],
	    "file": "path/to/page/data.json",
	    "image": "path/to/image.png,
	    "boxes": "path/to/image/with/boxes.png"
	}
]

Individual page data

Each item in this list will have a path to the corresponding page data, which is the serialized version of the data returned from the Google Vision API. The basic hierarchy is as follows:

{
	"pages": [{
		"blocks": [{
			"paragraphs": [{
				"words": [{
					"symbols": {...}
				}]
			}]
		}]
	}]
}

If you would like to see each structure's bounds, a visual guide has been generated for you under the boxes field under the scanner.pages data.

  • Blocks = red
  • Paragraphs = blue
  • Words = yellow
Accessing the data

To access this data, you can use the scanner.get_page_data(page number) method. Alternatively, if you are iterating through the pages, you can use the scanner.get_next_page() function.

for page in scanner.get_next_page():
    # Do something with page data

In some cases, you may want to access the page blocks in a certain order as they appear on the page. To do this, you can use the scanner.get_blocks_by_order(page_number, order=BlockOrder) function.

from scanner import BlockOrder

for page_number, page in enumerate(scanner.get_next_page()):
  blocks = scanner.get_blocks_by_order(page_number, order=BlockOrder.LEFTRIGHT)

The BlockOrder enum has the following options:

  • TOPBOTTOM to read from top to bottom
  • BOTTOMTOP to read from the bottom to the top
  • LEFTRIGHT to read from left to right
  • RIGHTLEFT to read from right to left

Additional Methods

text_within

If you would like to get the text that is within a certain boundary, use scanner.text_within

	# To get text within the box (1, 2) (1, 3) (5, 2) (5, 3) on the first page
	text = scanner.text_within(0, x0=1, y0=2, x1=5, y1=3)

find_text_matches

If you would like to find where a text appears across the pages, you can use the scanner.find_text_matches(text) method.

	matches = scanner.find_text_matches('text to find')

This will return a list of where each match is found

[
  {
    "page": int,
    "block": int,
    "paragraph": int,
    "word": int,
    "bounds": [
      {"x": int, "y": int},
      {"x": int, "y": int},
      {"x": int, "y": int},
      {"x": int, "y": int}
    ]
  }
]

find_regex_matches

If you would like to find where a regex appears across the pages, you can use the scanner.find_regex_matches(regex) method.

	matches = scanner.find_text_matches(r'\d+\.\d+\.\d+\.')

This will return a list of where each match is found

[
  {
    "page": int,
    "block": int,
    "paragraph": int,
    "word": int,  # If a word matches the regex
    "text": str,
    "bounds": [
      {"x": int, "y": int},
      {"x": int, "y": int},
      {"x": int, "y": int},
      {"x": int, "y": int}
    ]
  }
]

draw_boxes

If you would like to draw boxes where the OCR bounds are, use the scanner.draw_boxes(page_number) method.

	image = scanner.draw_boxes(0)  # Draw boxes for page 0

This will return an image of the boxes drawn, which can be shown with image.show()

If you would like to draw other boxes, you can create a dict with the relevant bounds data. For instance:

bound = {
	'vertices': [
		{'x': int, 'y': int},
		{'x': int, 'y': int},
		{'x': int, 'y': int},
		{'x': int, 'y': int},
	]
}

image = scanner.get_page_image(0)

scanner.draw_boxes(image, bound)

detect_columns

To get column x_ranges, you may use the scanner.detect_columns(page_number) method.

  columns = scanner.detect_columns(0)  # Get column ranges for page 0

For example, if the page has two columns, the data may look something like this:

[
	(0, 100),   # First column spans from 0-100px
	(120, 200), # Second column spans from 120-200px
]

Please note: This code isn't guaranteed to work for all pages

design2align-scanner's People

Contributors

jayoshih avatar kollivier avatar jamalex avatar ivanistheone avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.