GithubHelp home page GithubHelp logo

embeddings's Introduction

Project Title: Embeddings as a Fuller Measure of Complexity for Children's Books

1. Part 1 : Review of embedding structure

This project presents a novel, unsupervised learning approach to assessing the complexity of children's literature. The approach uses the theory of embedding scatter to examine two corpora of children's texts: a collection of books from Project Gutenberg available via Kaggle datasets, and a collection of lore from the Brown library. Visualizing the embeddings in reduced dimensions with t-SNE, clear patterns emerge.

2. Part 2 : Development of an Alternative Algorithm

We developed an alternative algorithm to assess text complexity, which we tested against the industry-standard Flesch-Kincaid grade level. Our approach yielded an R-squared score of 97%, significantly higher than industry-leading products, which typically achieve around 85%.

3. Summary of Models

We used two models in this project. The first is the Word2Vec model, a predictive deep learning based model to generate word embeddings. The second model is the Random Forests model, a machine learning model used to predict the complexity of a sentence based on the features derived from the word embeddings and other sentence characteristics.

4. Hyperparameter Optimization

The primary hyperparameters optimized in this project were the number of trees used in the Random Forests model.

5. Summary of Results

Part 1: We saw very clear t-SNE visualisation emerge particularly when we changed the vectorization method to documents from a fixed number of tokens. Our snake embedding picture is below and can be recreated in the code

t-SNE

Part 2: We used the part 1 findings to model a new algorithm to predict the complexity of childrens text. The core innovation was the use of word embeddinsgs in addition to sentence lenegth and and word length. This novel approach demonstrated a high degree of accuracy in predicting the complexity of children's literature, outperforming existing methods. The use of word embeddings to capture semantic and syntactic information about the words in the texts proved particularly effective, highlighting the potential of this approach for a wide range of applications in text analysis and natural language processing. However, we saw very clearly that they are most effective on similar corpus so for use in K12 education the datasets shoudl be carefully selected.

R-squared for the new training set: 0.978437 R-squared for the new test set: 0.784729

6. Hardware requirements

Some parts of the project require a GPU to execute and the relevant code has been commented out. All other code was executed on a Apple M2

Contact Details

David Roberts [email protected]

embeddings's People

Contributors

agiyogi avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.