GithubHelp home page GithubHelp logo

jakobtorben / intelligent-public-web-data-extraction Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 5.19 MB

Imperial College London Advanced Data Science Team - Intelligent public web data extraction project with Refinitiv

Python 11.08% Jupyter Notebook 88.92%

intelligent-public-web-data-extraction's Introduction

Intelligent-public-web-data-extraction

Imperial College London Advanced Data Science Team - Intelligent public web data extraction project with Refinitiv

To run the spider first clone the repository:

git clone https://github.com/jakobtorben/Intelligent-public-web-data-extraction
cd ManWebScraper

to run the HSBC crawler run the command

scrapy crawl extract_board -o HSBC_board.csv

Similarly for Unilever

scrapy crawl Unilever_board -o Unilever_board.csv

Abstract Syntax Tree

In the AST folder, the method of using abstract syntax trees to find the difference between parsers is explored. The file parser_diff.py is a similarity checker that has extracted the core functionality of the Python package Pycode_similar. It includes two methods to calculate the similarity between two functions:

  • UnifiedDiff: Finds the difference in nodes after normalising the function's nodes.
  • TreeDiff: Uses the package zss.distance to find the distance between two ordered trees, which is considered to be the weighted number of edit operations to transform one tree to another.

At the bottom of the file is an example of how the code can be used to find the similarity.

intelligent-public-web-data-extraction's People

Contributors

jakobtorben avatar namiyousef avatar ricardomokhtari avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

oserban

intelligent-public-web-data-extraction's Issues

Make pycode_similar symmetric

For the Treediff method, the distance is calculated based on how many inserts, removes, and updates that are performed. In the pycode_similar package, inserts are treated differently to removes, where insert always has zero cost. As a result, if you swap the two parsers, what was deleted the last time would be inserted this time, which makes it non-symmetrical.

I fixed this by using the same string distance function, in reverse order for both insert and delete. Specifically, I changed

res = zss.distance(a.func_node, b.func_node, _get_children,
lambda node: 0, # insert cost
lambda node: _str_dist(_get_label(node), ''), # remove cost
lambda _a, _b: _str_dist(_get_label(_a), _get_label(_b)), ) # update cost

into

res = zss.distance(a.func_node, b.func_node, _get_children,
lambda node: _str_dist('', _get_label(node)), # insert cost
lambda node: _str_dist(_get_label(node), ''), # remove cost
lambda _a, _b: _str_dist(_get_label(_a), _get_label(_b)), ) # update cost

After this modification, swapping the two parsers gives almost identical results. There is a small difference, due to the fact that the similarity percentage is calculated from the number of equal nodes / total nodes. Since swapping the parsers means that deletes/inserts will be different, the percentage differs slightly.

Remove normalisation for pycode_similar

I had a look at how we can remove the normalisation in the pycode_similar implementation, that removes the arguments in the function calls, which we are interested in.

I managed to remove this normalisation by commenting out the parts that delete the relevant part in the class 'BaseNodeNormalizer'. In addition, in the zss.distance function, it only checks if the names of the nodes are equal, rather than the value of the node. So a string argument will be true for all strings. I changed this to check if the actual characters in the string are equal. I did the same with attributes.

After these changes, the two nodes response.css("li.directors-index__item") is different to response.css("article"). And the two nodes person.css("not relevant").get() is different to person.css("not relevant").attrib['title']

Extract board members from 3 different companies

What type of data to extract is yet to be defined.

Acceptance criteria:

  • Demonstrate a method for extracting board members from Unilever, HSBC and AstraZenaca
  • At this point, only use information from present

Generate similarity baselines using pycode_similar

Goal: Calculate similarity scores between parsers using the pycode_similar module and generate plots

Acceptance criteria:

  • Generate heat map or scatter plot of pairwise similarity between the parsers we have written

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.