GithubHelp home page GithubHelp logo

imsanjoykb / data-processing Goto Github PK

View Code? Open in Web Editor NEW
8.0 4.0 1.0 833 KB

Scrape Job Portal Data

Home Page: https://imsanjoykb.github.io/

License: Apache License 2.0

Python 100.00%
web-scraping automation web-crawler job-porta indeed-scraping job-search-website linkedin-scraper glassdoor-scraper monster

data-processing's Introduction

###Sanjoy Kumar Biswas

###Scrape Job portal data

Step 1 :

First go through the problem statement that which domain data I need to collect and which type of columns and information gather by scrape the site.

Step 2 :

Find the bunch of particular URL from where I scrape the information. For the problem statement I will choose indeed.com , monster.com, linkedin.com such type of job portal website.

Step 3 :

Inspect the web page: As a lot of information store in webpage, I don’t need all the information. On basis of problem statement I inspect the webpage and highlighted the HTML to get particular information form webpage. And findout the data I want to extract.

Step 4 :

Extract the data from webpage : Now extracting data from those website. For extracting data by web scraping I will choose Python library BeautifulSoup.

Step 5 :

After scraping data I will do preprocess every dataset Like , Date-Time : May be different dataset have different date time format. I will do make a several format .

Drop unnecessary columns . Drop null and duplicate rows. Scaling all the dataset .

As every datatsets don not have all the Columns . I will do rescale the columns for all scraping datasets that’s all dataset contain same columns and can merge easily Like : indeed.com [German] contain columns are company_name, job_title, city, years of experience, salary [Lets say missing columns- Skills]

Monster.com [German] contain columns are company_name, job_title, city, year_of_experience,skills [Lets say missing columns- Salary]

This time I will do which columns are common for those two (any number) datasets.

For job portal scraping I get must having columns are Job_title & company_name

Now I will make a model data columns after merge which information we need.

Model columns : company_name, job_title, city, year_of_experience, skills,salary

Then I will merge those two dataset by those columns.

Step 6 :

Fill missing value [skills] for indeed.com: For fill missing rows I will apply machine learning model here. Like we have dataset of monster.com where I get skills columns. I will take this dataset for train the model and for testing I will apply indeed.com dataset. Fill missing value [salary] for monster.com: For missing columns of salary at monster.com I will use any regression machine learning algorithm to fill those missing rows . And for train dataset I will take indeed.com data as its have salary columns.

data-processing's People

Contributors

imsanjoykb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

saurabhnair239

data-processing's Issues

Indeed 403

Just tried this script to scrape Indeed page. I also tried to add / change: headers, cookies, proxies and even tried to apply VPN. Nothing helped.

Is there a code update that makes scraping work again?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.