Welcome to the Readme file for the CDCS Workshop 'Image to Tech: Introduction to Text Extraction'. Below you can find all the details that you will need to get ready and attend this class.
Do you need to analyse text in your research, but only have images or scanned PDFs to work with? This training session is the perfect introduction to extracting text from images into a machine-readable data format. This workshop will give you an overview of text extraction and how Optical Character Recognition (OCR) works in practice, how you can plan your projects, and what potential challenges you might face. We’ll use free, out-of-the box options that are widely available, and you will leave the workshop prepared to use text extraction in your own projects. This resource is designed to introduce you to the basics of text extraction and will include:
- An introduction to text extraction
- The history of text extraction
- How text extraction works in practice, from document selection to cleaning
- How to approach text extraction processes
This is a beginer-level workshop and does not require any prior programming knowledge or background of text extraction processes.
- A laptop
- A Wi-Fi connection
This repository contains a GoogleColab file that can be used in independent study to learn more about text extraction but will also be used in the workshop to show some basic code in action. Click the logo below and login via a Google account. Select the 'Copy to Drive' option to download and edit.
The Sample Images folder contains images of text and pdf files that can be used for this workshop, although feel free to bring your own images of text to work with if you would like.
All material here collected is free to use but it is covered by a License
The author of this repository is Ash Charlton. The powerpoint presentation has been adapted from a previous workshop on OCR by Lucia Michielin and Jessica Witte