GithubHelp home page GithubHelp logo

kata-robert-burns's Introduction

The Robert Burns Kata

Introduction

Robert Burns was a Scottish poet and lyricist, widely regarded as the national poet of Scotland. His poetry celebrated the common people, country life, and the customs and traditions of Scotland. Burns Night, which is celebrated on January 25th, is a celebration of his life and work.

Task

You are given a directory of text files containing poetry written by Robert Burns. Your task is to write a script that takes in the filepaths of the Burns poems as input and calculate the type-token ratio (TTR) for each poem. TTR is a measure of lexical diversity or lexical richness (for more context see https://en.wikipedia.org/wiki/Lexical_density), it's calculated by dividing the number of unique words in a text by the total number of words in the text.

For example, if a text contains 100 words and has 40 unique words, the TTR would be 0.4. The TTR can provide insight into how varied the vocabulary of a text is, with a higher TTR indicating a more varied vocabulary.

A starter example with an argparse and main construct to receive the filepath argument, open the file and print a result is provided.

Input

A file path to the poem text file.

Output

A float value representing the TTR of the poem.

Hints

  • To tokenize the text of each poem, you can use Python's built-in string methods such as split() and replace(), or the re library.
  • Tokenization should be case-insensitive and handle punctuation correctly (e.g. "The" and "the," should be treated as the same word)
  • To calculate the TTR of a poem, you will first need to tokenize the text of the poem and then calculate the number of unique words and the total number of words in the poem.
  • The argparse library has been used to open the file of the poem passed as a command-line argument.

Bonuses

Implement a simple stemmer to preprocess the text before calculating the TTR. A stemmer is a tool that removes the suffixes from words. For example, "running" and "runner" will be stemmed to "run".

Another preprocessing step that can be included is removing stopwords, which are common words that do not contain much meaning (e.g. "the", "and", "is"). Removing stopwords can help to focus on the more meaningful words in the text.

The filenames are not the poem titles, just a sanitized version of the first line, once you have a tokenized and stemmed words with no stopwords, could you devise better filenames?

Even More Ideas

Building a word frequency distribution: Once the text has been tokenized, you can build a word frequency distribution, which is a list of the words in the text along with the number of times each word appears. This can give a sense of the most common words used in the poem.

Part of Speech Tagging: Part-of-speech tagging, also known as grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. This can give a sense of the grammatical structure of the text.

Named Entity Recognition: Named-entity recognition (NER) is the process of identifying named entities in text and classifying them into predefined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

kata-robert-burns's People

Contributors

gavlt avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.