GithubHelp home page GithubHelp logo

erictleung / 2017-new-coder-survey Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 52 KB

:beginner: Code to help clean and format the 2017 New Coder Survey by freeCodeCamp

R 99.31% Makefile 0.69%
coder-survey data data-cleaning dplyr freecodecamp

2017-new-coder-survey's Introduction

Hi there, I'm Eric! ๐Ÿ‘‹

Instructions for living a life. Pay attention. Be astonished. Tell about it.

โ€” Mary Oliver, American poet

  • ๐Ÿง‘โ€๐Ÿ’ป Currently: Data Scientist (full-time), Engineer (at heart), and Educator (in practice)
  • ๐Ÿ”ญ Iโ€™m currently working on writing math articles for freeCodeCamp, and fun data side projects.
  • ๐ŸŒฑ Iโ€™m currently learning about Emacs Lisp to build my own packages, marketing, Econometrics with R, Bayesian statistics, and Causal inference as personal/professional endeavors for the future.
  • ๐Ÿ“š Currently reading: The Diamond Age: Or, a Young Lady's Illustrated Primer by Neal Stephenson.
  • ๐Ÿ‘ฏ Iโ€™m looking to collaborate on open source tools that empower individuals to solve problems and learn.
  • ๐Ÿ’ฌ Ask me about open science, open-source culture, data science, healthcare reform, and education reform.
  • ๐Ÿ˜„ Pronouns: he/him
  • โšก Fun fact: I like writing with fountain pens on good stationery and notebooks.
  • ๐Ÿ“ซ How to reach me: Twitter or LinkedIn

Last updated: 2024-01-30

Twitter โ€ข Website โ€ข Email โ€ข Wikipedia โ€ข Mastodon โ€ข LinkedIn โ€ข StackOverflow โ€ข Threads โ€ข Quora โ€ข Tableau โ€ข DEV

2017-new-coder-survey's People

Contributors

erictleung avatar

Stargazers

 avatar

Watchers

 avatar  avatar

2017-new-coder-survey's Issues

Clean job interests

This year, surveyors could select multiple answers for job interests. So to make it easier for interpretation is to convert the text into binary booleans.

Generalize removal obvious outliers

I've removed a couple outliers already, but here I want to add a function to remove other obvious outliers.

A list of some things to look for:

  • Part 2
    • ExpectedEarning == "xxxxx"
    • MoneyForLearning has more than three zeros, look at two or more
    • Age == 120
    • Gender Other has some non-relavent answers
    • ChildrenNumber > 50
    • Income > 100,000,000

Make column data types uniform between parts before joining

  • Check if columns in the second part have "undefined" in the columns
  • Check if columns in the second part need to be changed from yes/no to 1/0
  • Check for truncated answers
  • Standardize data types between data sets to allow joining in next step

Create separate clean data set with threshold of answered questions

Some individuals left many answers blank so it might be convenient to have a separate data set with individuals who have answered more than xx% number

image

From the plot, it looks like a general cut off might be to keep people who've answered at least 50% of the survey. This plot is just the second part, so after the data sets have been merged, this threshold may change.

Code to generate the plot:

part2 %>% apply(MARGIN = 1, FUN = function(x) sum(!is.na(x)) / length(x)) %>%
    hist(main = "Distribution of Percent of Questions Answered in Part 2", 
          xlab = "Percent Answered (%)")

Clean up "Other" answer for "Which online learning resources have you found helpful?"

  • Find relatively most common answers given (total = 1107 so if greater than 10% ~= 10)
    • YouTube
    • SoloLearn
    • Microsoft Virtual Academy
    • Books
    • CodeCombat
    • Codefights
    • Code School
    • DataCamp
    • Exercism
    • Frontend Masters
    • GitHub
    • Google
    • Laracasts
    • Launch School
    • OpenClassroom
    • reddit
    • scotch.io
    • SoloLearn
    • TheNewBoston
    • tutorialspoint
    • Watch and Code
    • Wes Bos
  • Create new column for popular answers

Clean expected earnings

  • Remove dollar signs
  • If value is in the form of xx.000, then convert to xx000 (i.e. remove the decimal place)
  • Deal with ranges of numbers e.g. "80k-100k"
  • Remove unnecessary answers e.g. "over $200 monthly"
  • No need to normalize values (e.g. low values to thousands) - let end user decide

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.