GithubHelp home page GithubHelp logo

friends's Introduction

Hi there ๐Ÿ‘‹

I am a Lecturer in Information Science at Cornell University. I teach learners data science, research design, data communication, and web design.

I enjoy working with individuals from all backgrounds interested in applying computational methods to extract knowledge and inferences from a wide-range of domains. I have training and experience in data science and policy evaluation. In addition to my academic work, I offer consulting services for organizations interested in data science, including research and analysis as well as tailored workshops teaching programming skills to your employees.

For more information about me and my ongoing work, check out my personal website.

friends's People

Contributors

bensoltoff avatar fangj avatar

Watchers

 avatar  avatar

friends's Issues

Need to deduplicate character names

Some forms of names are in all capital letters, others use abbreviations.

## tidy.R
## 6/11/18 BCS
## Convert Friends transcripts to tidytext data frame

library(tidyverse)
#> โ”€โ”€ Attaching packages โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse 1.2.1 โ”€โ”€
#> โœ” ggplot2 2.2.1.9000     โœ” purrr   0.2.5     
#> โœ” tibble  1.4.2          โœ” dplyr   0.7.5     
#> โœ” tidyr   0.8.1          โœ” stringr 1.3.1     
#> โœ” readr   1.1.1          โœ” forcats 0.3.0
#> โ”€โ”€ Conflicts โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ tidyverse_conflicts() โ”€โ”€
#> โœ– dplyr::filter() masks stats::filter()
#> โœ– dplyr::lag()    masks stats::lag()
library(stringr)
library(rvest)
#> Loading required package: xml2
#> 
#> Attaching package: 'rvest'
#> The following object is masked from 'package:purrr':
#> 
#>     pluck
#> The following object is masked from 'package:readr':
#> 
#>     guess_encoding
library(tidytext)

# get list of file names
episodes <- list.files(path = "season", full.names = TRUE)

## remove duplicate episodes
episodes <- episodes[!str_detect(episodes, pattern = "outtakes|uncut")]

# function to scrape a transcript and convert to tidytext data frame
tidy_episode <- function(url) {
  ## collect text into a corpus
  episode_corpus <- read_html(url) %>%
    html_nodes("p") %>%
    html_text(trim = TRUE)
  
  # convert to one-line-per-row data frame
  episode_lines <- data_frame(line = episode_corpus) %>%
    # detect scene transitions
    mutate(scene = str_detect(line, "Scene"),
           scene_num = cumsum(scene)) %>%
    # remove scene transition lines
    filter(scene_num != 0,
           !scene) %>%
    select(-scene) %>%
    # determine which character is speaking
    mutate(character = str_extract(line, "\\w+:"),
           character = str_remove(character, ":"),
           line = str_remove(line, "\\w+:")) %>%
    # remove lines that are not speech
    filter(!is.na(character))
  
  # convert to tidytext data frame
  episode_tidy <- episode_lines %>%
    unnest_tokens(output = word,
                  input = line)
  
  return(episode_tidy)
}

# scrape all the transcripts
episodes_tidy_all <- data_frame(episode = episodes) %>%
  # safely tokenize each episode
  mutate(tidy = map(episode, safely(tidy_episode)),
         tidy_results = transpose(tidy)$result) %>%
  # expand to one row per token
  unnest(tidy_results) %>%
  mutate(episode = parse_number(episode))

count(episodes_tidy_all, character, sort = TRUE) %>%
  filter(n > 250) %>%
  print(n = Inf)
#> # A tibble: 107 x 2
#>     character         n
#>     <chr>         <int>
#>   1 Rachel       104393
#>   2 Ross         103089
#>   3 Joey          95977
#>   4 Chandler      91589
#>   5 Monica        87359
#>   6 Phoebe        85938
#>   7 Geller         4144
#>   8 Mike           3716
#>   9 Janice         2985
#>  10 Emily          2145
#>  11 ROSS           2119
#>  12 Charlie        1934
#>  13 David          1855
#>  14 Director       1799
#>  15 Frank          1646
#>  16 Paul           1532
#>  17 Pete           1512
#>  18 Green          1478
#>  19 Amy            1472
#>  20 JOEY           1394
#>  21 Carol          1356
#>  22 Tag            1349
#>  23 Richard        1337
#>  24 CHANDLER       1217
#>  25 Mona           1076
#>  26 Gunther        1067
#>  27 Joshua         1067
#>  28 Woman          1011
#>  29 Jill           1002
#>  30 All            1001
#>  31 Gary            979
#>  32 Eric            948
#>  33 Doug            943
#>  34 Joanna          920
#>  35 Kathy           916
#>  36 Janine          882
#>  37 Erica           865
#>  38 Susan           795
#>  39 RACHEL          794
#>  40 RACH            793
#>  41 MNCA            789
#>  42 Elizabeth       787
#>  43 MONICA          772
#>  44 Guy             758
#>  45 Cecilia         754
#>  46 CHAN            697
#>  47 Man             671
#>  48 Ursula          655
#>  49 Waltham         648
#>  50 Sr              644
#>  51 Teacher         643
#>  52 Treeger         639
#>  53 Kate            625
#>  54 Alice           607
#>  55 Bing            601
#>  56 Steve           592
#>  57 Melissa         584
#>  58 Zelner          583
#>  59 PHOEBE          579
#>  60 Tribbiani       563
#>  61 Interviewer     558
#>  62 Waiter          557
#>  63 Kim             549
#>  64 Sandy           539
#>  65 PHOE            532
#>  66 Long            520
#>  67 2               513
#>  68 Mark            509
#>  69 Will            492
#>  70 Nurse           481
#>  71 Parker          480
#>  72 Doctor          474
#>  73 Tim             469
#>  74 Gavin           467
#>  75 Danny           465
#>  76 Benjamin        462
#>  77 Sarah           458
#>  78 Roy             450
#>  79 Note            448
#>  80 1               436
#>  81 note            414
#>  82 Estelle         389
#>  83 Barry           385
#>  84 Dina            385
#>  85 Julie           369
#>  86 Ethan           368
#>  87 Earl            367
#>  88 Amanda          359
#>  89 guy             353
#>  90 Roger           341
#>  91 Chloe           331
#>  92 Tommy           325
#>  93 Receptionist    318
#>  94 Salesman        317
#>  95 Helena          304
#>  96 Ben             302
#>  97 FBOB            294
#>  98 Donny           290
#>  99 Franzblau       285
#> 100 Mindy           278
#> 101 Bonnie          275
#> 102 Leslie          273
#> 103 Lauren          271
#> 104 JADE            257
#> 105 Saj             257
#> 106 MIKE            256
#> 107 Laura           255

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.