dcs-training / image-to-tech-text-extraction Goto Github PK

View Code? Open in Web Editor NEW

This workshop will give you an overview of text extraction and how Optical Character Recognition (OCR) works in practice, how you can plan your projects, and what potential challenges you might face. Go to the readme file

License: Other

Jupyter Notebook 100.00%

image-to-tech-text-extraction's Introduction

Image to Tech: Introduction to Text Extraction

Welcome to the Readme file for the CDCS Workshop 'Image to Tech: Introduction to Text Extraction'. Below you can find all the details that you will need to get ready and attend this class.

Overview

Do you need to analyse text in your research, but only have images or scanned PDFs to work with? This training session is the perfect introduction to extracting text from images into a machine-readable data format. This workshop will give you an overview of text extraction and how Optical Character Recognition (OCR) works in practice, how you can plan your projects, and what potential challenges you might face. We’ll use free, out-of-the box options that are widely available, and you will leave the workshop prepared to use text extraction in your own projects. This resource is designed to introduce you to the basics of text extraction and will include:

An introduction to text extraction
The history of text extraction
How text extraction works in practice, from document selection to cleaning
How to approach text extraction processes

What do I need to know?

This is a beginer-level workshop and does not require any prior programming knowledge or background of text extraction processes.

What do you need for this class?

A laptop
A Wi-Fi connection

Materials

This repository contains a GoogleColab file that can be used in independent study to learn more about text extraction but will also be used in the workshop to show some basic code in action. Click the logo below and login via a Google account. Select the 'Copy to Drive' option to download and edit.

The Sample Images folder contains images of text and pdf files that can be used for this workshop, although feel free to bring your own images of text to work with if you would like.

License and Authors

All material here collected is free to use but it is covered by a License

The author of this repository is Ash Charlton. The powerpoint presentation has been adapted from a previous workshop on OCR by Lucia Michielin and Jessica Witte

Recommend Projects

dcs-training / image-to-tech-text-extraction Goto Github PK

image-to-tech-text-extraction's Introduction

Image to Tech: Introduction to Text Extraction

Overview

What do I need to know?

What do you need for this class?

Materials

License and Authors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs