Imperial College London Advanced Data Science Team - Intelligent public web data extraction project with Refinitiv

Python 11.08% Jupyter Notebook 88.92%

intelligent-public-web-data-extraction's Introduction

Intelligent-public-web-data-extraction

Imperial College London Advanced Data Science Team - Intelligent public web data extraction project with Refinitiv

To run the spider first clone the repository:

git clone https://github.com/jakobtorben/Intelligent-public-web-data-extraction
cd ManWebScraper

to run the HSBC crawler run the command

scrapy crawl extract_board -o HSBC_board.csv

Similarly for Unilever

scrapy crawl Unilever_board -o Unilever_board.csv

Abstract Syntax Tree

In the AST folder, the method of using abstract syntax trees to find the difference between parsers is explored. The file parser_diff.py is a similarity checker that has extracted the core functionality of the Python package Pycode_similar. It includes two methods to calculate the similarity between two functions:

UnifiedDiff: Finds the difference in nodes after normalising the function's nodes.
TreeDiff: Uses the package zss.distance to find the distance between two ordered trees, which is considered to be the weighted number of edit operations to transform one tree to another.

At the bottom of the file is an example of how the code can be used to find the similarity.

intelligent-public-web-data-extraction's People

Contributors

Stargazers

Watchers

Forkers

oserban

intelligent-public-web-data-extraction's Issues

Create parsers for Coca Cola

Website: https://www.coca-colahellenic.com

Success criteria:

3 different parsers referring to different website structures (1 current, 2 historical)

Create parsers for Shell

Website: https://www.shell.com/

Success criteria:

3 different parsers referring to different website structures (1 current, 2 historical)

Look into popular methods for representing ASTs in Machine Learning

Success criteria:

find at least one method of representing ASTs as vectors

Implement WASTK Paper

Goal: Create a working implementation for calculating the weighted abstract syntax tree kernel (WASTK) similarity metric, from https://downloads.hindawi.com/journals/sp/2017/7809047.pdf

Acceptance criteria:

Demonstration of calculated scores between parsers

Explore methods for web scrapping, such as beautiful soup

Acceptance criteria:
-Decide on package/method that we will use for scrapping the websites

Look into the suffix method of representing ASTs

Success criteria:

create a function that converts an AST to a vectorised form using the 'suffix' method
apply function to at least one parser

Generate dataset of trigger words using page paragraphs

We will explore methods of extracting trigger words from page text

Recreate heatmap plots /w and w/o normalization for all parsers

See issue #19

Make pycode_similar symmetric

For the Treediff method, the distance is calculated based on how many inserts, removes, and updates that are performed. In the pycode_similar package, inserts are treated differently to removes, where insert always has zero cost. As a result, if you swap the two parsers, what was deleted the last time would be inserted this time, which makes it non-symmetrical.

I fixed this by using the same string distance function, in reverse order for both insert and delete. Specifically, I changed

res = zss.distance(a.func_node, b.func_node, _get_children,
lambda node: 0, # insert cost
lambda node: _str_dist(_get_label(node), ''), # remove cost
lambda _a, _b: _str_dist(_get_label(_a), _get_label(_b)), ) # update cost

into

res = zss.distance(a.func_node, b.func_node, _get_children,
lambda node: _str_dist('', _get_label(node)), # insert cost
lambda node: _str_dist(_get_label(node), ''), # remove cost
lambda _a, _b: _str_dist(_get_label(_a), _get_label(_b)), ) # update cost

After this modification, swapping the two parsers gives almost identical results. There is a small difference, due to the fact that the similarity percentage is calculated from the number of equal nodes / total nodes. Since swapping the parsers means that deletes/inserts will be different, the percentage differs slightly.

Create parser's for GSK

Extract board members with history from Unilever using archieved websites

Acceptance criteria:

Extract three historical snapshots using archived websites

Look into the infix method of representing ASTs

Success criteria:

create a function that converts an AST to a vectorised form using the 'infix' method
apply function to at least one parser

Generate dataset of trigger words using page titles

We will explore how words in a page's <title> tag correlate with the board information we are looking for

Remove normalisation for pycode_similar

I had a look at how we can remove the normalisation in the pycode_similar implementation, that removes the arguments in the function calls, which we are interested in.

I managed to remove this normalisation by commenting out the parts that delete the relevant part in the class 'BaseNodeNormalizer'. In addition, in the zss.distance function, it only checks if the names of the nodes are equal, rather than the value of the node. So a string argument will be true for all strings. I changed this to check if the actual characters in the string are equal. I did the same with attributes.

After these changes, the two nodes response.css("li.directors-index__item") is different to response.css("article"). And the two nodes person.css("not relevant").get() is different to person.css("not relevant").attrib['title']

Research methods for computing difference between two ASTs

Goal: demonstration of method(s) to compute difference between 2 ASTs. If multiple methods are found, perhaps quantify the advantages/disadvantages of different approaches

Acceptance criteria: