By the end of this class you should be able to:
- Explain the advantages and disadvantages of using text as data
- Compute the number of characters and words within a piece of text
- Define the terms unigram, token, stopword, word stem and TF-IDF
- Transform a dataset of text into a dataset of tokens
- Remove standard and customized stopwords from a data set of tokens
- Visualize the most common words used in a data set of tokens
- Lemmatize a set of tokens and explain the advantages of doing so
- Compute a TF-IDF metric on a set of tokens
- Explain the output of a statistical command that produces TF-IDF scores
- Visualize word clouds that show common or distinguishing words between two categories of text
Clone a copy of this repository using Git. To clone a copy of this repository to your own PC:
git clone https://github.com/tisem-digital-marketing/smwa-computing-lecture-intro-text.git
Once you have cloned the files, open the cloned repository in RStudio as an RStudio project and use the empty R scripts to follow along with the lecture as we work through material.
At the conclusion of the class, the course instructor's scripts are made available in the branch instructor
.
Recall that you can switch between branches using the git branch <BRANCHNAME>
command in a terminal.
Thus to switch to the instructor branch:
git branch instructor
And to switch back to the branch that you worked through live in class:
git branch main
NOTE: Git does not like you to switch branches with uncommitted changes. Before you switch branches, be sure to commit any changes to the files.
This lecture makes use of additional R
packages:
- janitor
- readr
- dplyr
- tibble
- tidyr
- ggplot2
- stringr
- tidytext
- textstem
- tokenizers
- reshape2
- wordcloud
Install these packages before coming to class.
- Module Maintainer: Lachlan Deer (
@lachlandeer
) - Course: Social Media and Web Analytics
- Institute: Dept of Marketing, Tilburg University
- Current Version: 2022 edition
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Deer, Lachlan. 2022. Social Media and Web Analytics: Computing Lecture 2 - Introduction to Text as Data Methods. Tilburg University. url = "https://github.com/tisem-digital-marketing/smwa-computing-lecture-intro-text"
@misc{smwa-compllecture02-2022,
title={"Social Media and Web Analytics: Computing Lecture 2 - Introduction to Text as Data"},
author={Lachlan Deer},
year={2022},
url = "https://github.com/tisem-digital-marketing/smwa-computing-lecture-intro-text"
}