GithubHelp home page GithubHelp logo

emilhvitfeldt / smltar Goto Github PK

View Code? Open in Web Editor NEW
248.0 15.0 92.0 475.11 MB

Manuscript of the book "Supervised Machine Learning for Text Analysis in R" by Emil Hvitfeldt and Julia Silge

Home Page: https://smltar.com

License: Other

TeX 72.40% CSS 6.46% R 20.97% JavaScript 0.16% HTML 0.02%
supervised-machine-learning text-analysis bookdown

smltar's Introduction

smltar's People

Contributors

dcossyleon avatar emilhvitfeldt avatar fellennert avatar juliasilge avatar pursuitofdatascience avatar rivaquiroga avatar tmstauss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smltar's Issues

regex appendix

As horrible as it sounds. Should be super short with lots of references of where to find the information

Stemming dumb

normalize .bib files

Right now we have two .bib files (book.bib and references.bib). Should make more sense to move everything into one for now.

tokenization - discussion on tokenizing by spaces

We commonly use spaces as a tokenizing boundary. But this is not always the best choice.

it might behave poorly with double spaces, accidental or otherwise.

  • Chinese doesn't use spaces between words
  • Vietnamese uses species within words
  • German combines multiple words into one

Spaces in words also change over time
today used to be two words "to day"
https://www.etymonline.com/word/today

Wrongly fit a model

Try to fit a model but using wrong pre training (law / fairytale / advertisements / news)

fit a model trained on the wrong language. This should easily work but would give (hopefully) weird results.

Lemmatization

Look into expanding lemmatization section. Possibly move it to further reading

Talk about license

License in repo is currently CC0, which does not have any attribution or non-commercial bits in it. As is, this would let somebody else take the book content and sell it themselves. We probably want to change this to something like CC-BY-NC-SA.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.