GithubHelp home page GithubHelp logo

teomandi / dit-bigdata-project1 Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 216 KB

In this project I am working with Apache Spark on a dataset around true and fake job postings. Also, the project contains Machine Learning with Python for predicting fake or real jobs

Jupyter Notebook 100.00%

dit-bigdata-project1's Introduction

Project 1

Big Data subject.

  • Implemented by:
    • Theodoros Mandilaras
    • cs2.190018
  • MSc DIT/EKPA 2019-2020

Simple Statistics in Scala

Firstly we load the csv like:

scala> val df - spark.read.options( Map("header"-> "true", "escape"-> "\"", "inferSchema"-> "true")).csv("/absolute/path/to/fake_job_postings.csv")

a

Number of lines in the CSV file

scala> df.count()

res0: Long = 17881

b

Number of fake job postings

scala> val fake_jobs = df.filter("fraudulent == 1")
scala> fake_jobs.count()

res5: Long = 866

c

Number of real job postings

scala> val real_jobs = df.filter("fraudulent == 0")
scala> real_jobs.count()

res5: Long = 17014

d

Top-10 most required education (e.g. Bachelor’s Degree) in fake job postings

scala> fake_jobs.groupBy("required_education").count().sort($"count".desc).show(10)

+--------------------+-----+
|  required_education|count|
+--------------------+-----+
|                null|  451|
|High School or eq...|  170|
|   Bachelor's Degree|  100|
|         Unspecified|   61|
|     Master's Degree|   31|
|Some High School ...|   20|
|       Certification|   19|
|    Associate Degree|    6|
|        Professional|    4|
|Some College Cour...|    3|
+--------------------+-----+
only showing top 10 rows

e

Top-10 most required education in real job postings

scala> real_jobs.groupBy("required_education").count().sort($"count".desc).show(10)
+--------------------+-----+
|  required_education|count|
+--------------------+-----+
|                null| 7654|
|   Bachelor's Degree| 5045|
|High School or eq...| 1910|
|         Unspecified| 1336|
|     Master's Degree|  385|
|    Associate Degree|  268|
|       Certification|  151|
|Some College Cour...|   99|
|        Professional|   70|
|          Vocational|   49|
+--------------------+-----+
only showing top 10 rows

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.