The friends from bensoltoff

friends's Introduction

Hi there 👋

I am a Lecturer in Information Science at Cornell University. I teach learners data science, research design, data communication, and web design.

I enjoy working with individuals from all backgrounds interested in applying computational methods to extract knowledge and inferences from a wide-range of domains. I have training and experience in data science and policy evaluation. In addition to my academic work, I offer consulting services for organizations interested in data science, including research and analysis as well as tailored workshops teaching programming skills to your employees.

For more information about me and my ongoing work, check out my personal website.

friends's People

Contributors

Watchers

friends's Issues

Need to deduplicate character names

Some forms of names are in all capital letters, others use abbreviations.

## tidy.R
## 6/11/18 BCS
## Convert Friends transcripts to tidytext data frame

library(tidyverse)
#> ── Attaching packages ──────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1.9000     ✔ purrr   0.2.5     
#> ✔ tibble  1.4.2          ✔ dplyr   0.7.5     
#> ✔ tidyr   0.8.1          ✔ stringr 1.3.1     
#> ✔ readr   1.1.1          ✔ forcats 0.3.0
#> ── Conflicts ─────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(stringr)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
library(tidytext)

# get list of file names
episodes <- list.files(path = "season", full.names = TRUE)

## remove duplicate episodes
episodes <- episodes[!str_detect(episodes, pattern = "outtakes|uncut")]

# function to scrape a transcript and convert to tidytext data frame
tidy_episode <- function(url) {
  ## collect text into a corpus
  episode_corpus <- read_html(url) %>%
    html_nodes("p") %>%
    html_text(trim = TRUE)
  
  # convert to one-line-per-row data frame
  episode_lines <- data_frame(line = episode_corpus) %>%
    # detect scene transitions
    mutate(scene = str_detect(line, "Scene"),
           scene_num = cumsum(scene)) %>%
    # remove scene transition lines
    filter(scene_num != 0,
           !scene) %>%
    select(-scene) %>%
    # determine which character is speaking
    mutate(character = str_extract(line, "\\w+:"),
           character = str_remove(character, ":"),
           line = str_remove(line, "\\w+:")) %>%
    # remove lines that are not speech
    filter(!is.na(character))
  
  # convert to tidytext data frame
  episode_tidy <- episode_lines %>%
    unnest_tokens(output = word,
                  input = line)
  
  return(episode_tidy)
}

# scrape all the transcripts
episodes_tidy_all <- data_frame(episode = episodes) %>%
  # safely tokenize each episode
  mutate(tidy = map(episode, safely(tidy_episode)),
         tidy_results = transpose(tidy)$result) %>%
  # expand to one row per token
  unnest(tidy_results) %>%
  mutate(episode = parse_number(episode))

count(episodes_tidy_all, character, sort = TRUE) %>%
  filter(n > 250) %>%
  print(n = Inf)
#> # A tibble: 107 x 2
#>     character         n
#>     <chr>         <int>
#>   1 Rachel       104393
#>   2 Ross         103089
#>   3 Joey          95977
#>   4 Chandler      91589
#>   5 Monica        87359
#>   6 Phoebe        85938
#>   7 Geller         4144
#>   8 Mike           3716
#>   9 Janice         2985
#>  10 Emily          2145
#>  11 ROSS           2119
#>  12 Charlie        1934
#>  13 David          1855
#>  14 Director       1799
#>  15 Frank          1646
#>  16 Paul           1532
#>  17 Pete           1512
#>  18 Green          1478
#>  19 Amy            1472
#>  20 JOEY           1394
#>  21 Carol          1356
#>  22 Tag            1349
#>  23 Richard        1337
#>  24 CHANDLER       1217
#>  25 Mona           1076
#>  26 Gunther        1067
#>  27 Joshua         1067
#>  28 Woman          1011
#>  29 Jill           1002
#>  30 All            1001
#>  31 Gary            979
#>  32 Eric            948
#>  33 Doug            943
#>  34 Joanna          920
#>  35 Kathy           916
#>  36 Janine          882
#>  37 Erica           865
#>  38 Susan           795
#>  39 RACHEL          794
#>  40 RACH            793
#>  41 MNCA            789
#>  42 Elizabeth       787
#>  43 MONICA          772
#>  44 Guy             758
#>  45 Cecilia         754
#>  46 CHAN            697
#>  47 Man             671
#>  48 Ursula          655
#>  49 Waltham         648
#>  50 Sr              644
#>  51 Teacher         643
#>  52 Treeger         639
#>  53 Kate            625
#>  54 Alice           607
#>  55 Bing            601
#>  56 Steve           592
#>  57 Melissa         584
#>  58 Zelner          583
#>  59 PHOEBE          579
#>  60 Tribbiani       563
#>  61 Interviewer     558
#>  62 Waiter          557
#>  63 Kim             549
#>  64 Sandy           539
#>  65 PHOE            532
#>  66 Long            520
#>  67 2               513
#>  68 Mark            509
#>  69 Will            492
#>  70 Nurse           481
#>  71 Parker          480
#>  72 Doctor          474
#>  73 Tim             469
#>  74 Gavin           467
#>  75 Danny           465
#>  76 Benjamin        462
#>  77 Sarah           458
#>  78 Roy             450
#>  79 Note            448
#>  80 1               436
#>  81 note            414
#>  82 Estelle         389
#>  83 Barry           385
#>  84 Dina            385
#>  85 Julie           369
#>  86 Ethan           368
#>  87 Earl            367
#>  88 Amanda          359
#>  89 guy             353
#>  90 Roger           341
#>  91 Chloe           331
#>  92 Tommy           325
#>  93 Receptionist    318
#>  94 Salesman        317
#>  95 Helena          304
#>  96 Ben             302
#>  97 FBOB            294
#>  98 Donny           290
#>  99 Franzblau       285
#> 100 Mindy           278
#> 101 Bonnie          275
#> 102 Leslie          273
#> 103 Lauren          271
#> 104 JADE            257
#> 105 Saj             257
#> 106 MIKE            256
#> 107 Laura           255

bensoltoff / friends Goto Github PK

friends's Introduction

Hi there 👋

friends's People

Contributors

Watchers

friends's Issues

Need to deduplicate character names

Some transcripts need a different CSS selector

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs