GithubHelp home page GithubHelp logo

web-scraping's Introduction

Web Scrapping

Practicing Web Scraping with University of California Web Page's Legislative Reports

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame
url = 'http://www.ucop.edu/operating-budget/budgets-and-reports/legislative-reports/2017-18-legislative-session.html'
#Requesting content from the web page 
result = requests.get(url)
cont = result.content

#Setting up as a Beautiful Soup Object
bSoup = BeautifulSoup(cont, "lxml")
# After inspecting  
result = bSoup.find("div",{'class':'list-land', 'id':'content'})

#Finding the tables in HTML
tables = result.find_all('table')
#Data List
data = []

#since <tr> defines a row in html, set rows as based on whenever a <tr> is found
rows = tables[0].findAll('tr')

#Grab every HTML cell in each row

for tr in rows:
    cols = tr.findAll('td')
    #Check to see if text is in row 
    for td in cols:
        text = td.find(text = True)
        data.append(text)
        print text
1.
09/01/17
2018-19 (EDU 92493 - 92496-2017) Capital Expenditures
2.
09/01/17
9th Amended List of Proposed Energy Projects
3.
(in 2018)
Expenditures for Instruction (biennial) (pdf)
4.
11/01/17
Instruction and Research Space Summary & Analysis (pdf)
5.
11/01/17
Utilization of Classroom and Teaching Laboratories (biennial) (pdf)
6.
11/01/17
Five Year Capital Outlay Plan for State Funds (Capital Financial Plan 2017-27) (pdf)
7.
11/30/17
Innovation and Entrepreneurship Expansion (pdf)
8.
11/30/17
Number of Pupils who Attended a LCFF School: # Admitted to UC, and Enrollment disaggregated by Campus (pdf)
9.
12/01/17
Use of One-time Funds to Support Best Practices in Equal Employment Opportunity in Faculty Employment (pdf)
10.
12/01/17


11.
12/01/17
Project Savings Funded from Capital Outlay Bond Funds (pdf)
12.
12/01/17
Streamlined Capital Projects Funded from Capital (pdf)
13.
12/31/17
Firearm-related Violence Research (pdf)
14.
1/01/18
Contracts with Medical Laboratories
15.
01/01/18
Annual General Obligation Bonds Accountability (pdf)
16.
01/01/18
Small Business Utilization (pdf)
17.
01/10/18
Institutional Financial Aid Programs - Preliminary (pdf)
18.
01/10/18
Summer Enrollment (pdf)
19.
01/15/18
Contracting Out for Services at Newly Developed Facilities (pdf)
20.
02/01/18
Capital Expenditures Progress Report (EDU 92493 - 92496-2017) (pdf)
21.
02/01/18
Statewide Energy Projects (SEP) - Progress (pdf)
22.
02/01/18
Working Families Student Fee Transparency and Accountability Act (pdf)
23.
03/01/18
Student Transfers (pdf)
24.
03/01/18
Entry Level Writing Requirement (ELWR) (pdf)
25.
03/15/18
Performance Outcome Measures (pdf)
26.
03/31/18
Annual Student Financial Support (pdf)
27.
04/01/18
Unique Statewide Pupil Identifier (pdf)
28.
05/15/18
Receipt and Use of Lottery Funds (pdf)
29.
TBD
Draft Long Range Development Plan (LRDP) and LRDP EIR (pdf)
 
Future Reports
None
30.
09/01/18
2019-20 (EDU 92493 - 92496-2017) Capital Expenditures (pdf)
31.
09/01/18
10th Amended List of Proposed Energy Projects (pdf)
32.
09/01/18
Support Services/College Readiness (pdf)
33.
10/1/18
Expenditures for Istruction (biennial) (pdf)
34.
11/01/18
Instruction and Research Space Summary and Analysis (pdf)
35.
11/01/18
Utilization of Classroom and Teaching Laboratories (biennial) (pdf)
36.
11-30-18
Five Year Capital Outlay Plan for State Funds (Capital Financial Plan 2018-28) (pdf)
37.
12-31-20
Breast Cancer Research Program (pdf)
38.
12-31-20
Cigarette and Tobacco Products Surtax Research Program (pdf)
39.
01-01-21
California Subject Matter Programs (CSMP) (pdf)
40.
04-01-21
California State Summer School for Mathematics and Science (COSMOS) Program Outcomes (pdf)
#Setting up empty lists
reports = []
date = []
#Setting index counter
index = 0
for item in data:
#     print item
    if item and "pdf" in item:
        #Adding the date and reports
        date.append(data[index-1])
        #To avoid unicode errors: https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string
        reports.append(item.replace(u'\xa0',u' '))
        
    index += 1
date = Series(date)
reports = Series(reports)
legislative_df = pd.concat([date,reports],axis=1)
legislative_df.columns = ['Date', 'Reports']
legislative_df[1:]
Date Reports
1 11/01/17 Instruction and Research Space Summary & Analy...
2 11/01/17 Utilization of Classroom and Teaching Laborato...
3 11/01/17 Five Year Capital Outlay Plan for State Funds ...
4 11/30/17 Innovation and Entrepreneurship Expansion (pdf)
5 11/30/17 Number of Pupils who Attended a LCFF School: #...
6 12/01/17 Use of One-time Funds to Support Best Practice...
7 12/01/17 Project Savings Funded from Capital Outlay Bon...
8 12/01/17 Streamlined Capital Projects Funded from Capit...
9 12/31/17 Firearm-related Violence Research (pdf)
10 01/01/18 Annual General Obligation Bonds Accountability...
11 01/01/18 Small Business Utilization (pdf)
12 01/10/18 Institutional Financial Aid Programs - Prelimi...
13 01/10/18 Summer Enrollment (pdf)
14 01/15/18 Contracting Out for Services at Newly Develope...
15 02/01/18 Capital Expenditures Progress Report (EDU 9249...
16 02/01/18 Statewide Energy Projects (SEP) - Progress (pdf)
17 02/01/18 Working Families Student Fee Transparency and ...
18 03/01/18 Student Transfers (pdf)
19 03/01/18 Entry Level Writing Requirement (ELWR) (pdf)
20 03/15/18 Performance Outcome Measures (pdf)
21 03/31/18 Annual Student Financial Support (pdf)
22 04/01/18 Unique Statewide Pupil Identifier (pdf)
23 05/15/18 Receipt and Use of Lottery Funds (pdf)
24 TBD Draft Long Range Development Plan (LRDP) and L...
25 09/01/18 2019-20 (EDU 92493 - 92496-2017) Capital Expen...
26 09/01/18 10th Amended List of Proposed Energy Projects ...
27 09/01/18 Support Services/College Readiness (pdf)
28 10/1/18 Expenditures for Istruction (biennial) (pdf)
29 11/01/18 Instruction and Research Space Summary and Ana...
30 11/01/18 Utilization of Classroom and Teaching Laborato...
31 11-30-18 Five Year Capital Outlay Plan for State Funds ...
32 12-31-20 Breast Cancer Research Program (pdf)
33 12-31-20 Cigarette and Tobacco Products Surtax Research...
34 01-01-21 California Subject Matter Programs (CSMP) (pdf)
35 04-01-21 California State Summer School for Mathematics...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.