GithubHelp home page GithubHelp logo

image-to-tech-text-extraction's Introduction

Image to Tech: Introduction to Text Extraction

Welcome to the Readme file for the CDCS Workshop 'Image to Tech: Introduction to Text Extraction'. Below you can find all the details that you will need to get ready and attend this class.

Overview

Do you need to analyse text in your research, but only have images or scanned PDFs to work with? This training session is the perfect introduction to extracting text from images into a machine-readable data format. This workshop will give you an overview of text extraction and how Optical Character Recognition (OCR) works in practice, how you can plan your projects, and what potential challenges you might face. We’ll use free, out-of-the box options that are widely available, and you will leave the workshop prepared to use text extraction in your own projects. This resource is designed to introduce you to the basics of text extraction and will include:

  • An introduction to text extraction
  • The history of text extraction
  • How text extraction works in practice, from document selection to cleaning
  • How to approach text extraction processes

What do I need to know?

This is a beginer-level workshop and does not require any prior programming knowledge or background of text extraction processes.

What do you need for this class?

  • A laptop
  • A Wi-Fi connection

Materials

This repository contains a GoogleColab file that can be used in independent study to learn more about text extraction but will also be used in the workshop to show some basic code in action. Click the logo below and login via a Google account. Select the 'Copy to Drive' option to download and edit.

GoogleColab

The Sample Images folder contains images of text and pdf files that can be used for this workshop, although feel free to bring your own images of text to work with if you would like.

License and Authors

All material here collected is free to use but it is covered by a License

License: CC BY-NC 4.0

The author of this repository is Ash Charlton. The powerpoint presentation has been adapted from a previous workshop on OCR by Lucia Michielin and Jessica Witte

image-to-tech-text-extraction's People

Contributors

archarlton avatar

Stargazers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.