Tools for access, "diff"-ing, and analyzing archived web pages

Home Page: https://edgi-govdata-archiving.github.io/web-monitoring-processing

License: GNU General Public License v3.0

Python 20.81% HTML 79.05% Dockerfile 0.14%

web-monitoring gsoc-2017

web-monitoring-processing's Introduction

⚠️ This project is no longer maintained. ⚠️ It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

web-monitoring-processing

A component of the EDGI Web Monitoring Project.

Overview of this component's tasks

This component is intended to hold various backend tools serving different tasks:

Query external sources of captured web pages (e.g. Internet Archive, Page Freezer, Sentry), and formulate a request for importing their version and page metadata into web-monitoring-db.
Query web-monitoring-db for new Changes, analyze them in an automated pipeline to assign priority and/or filter out uninteresting ones, and submit this information back to web-monitoring-db.

Development status

Working and Under Active Development:

A Python API to the web-monitoring-db Rails app in web_monitoring.db
Python functions and a command-line tool for importing snapshots from the Internet Archive into web-monitoring-db.

Legacy projects that may be revisited:

Example HTML providing useful test cases.

Installation Instructions

Get Python 3.7 or later. If you don't have the right version, we recommend using conda or pyenv to install it. (You don't need admin privileges to install or use them, and they won't interfere with any other installations of Python already on your system.)
Install libxml2 and libxslt. (This package uses lxml, which requires your system to have the libxml2 and libxslt libraries.)

On MacOS, use Homebrew:
```
brew install libxml2
brew install libxslt
```
On Debian Linux:
```
apt-get install libxml2-dev libxslt-dev
```
On other systems, the packages might have slightly different names.

Install the package.

pip install -r requirements.txt
python setup.py develop

Copy the script .env.example to .env and supply any local configuration info you need. (Only some of the package's functionality requires this.) Apply the configuration:
```
source .env
```
See module comments and docstrings for more usage information. Also see the command line tool wm, which is installed with the package. For help, use
```
wm --help
```
To run the tests or build the documentation, first install the development requirements.
```
pip install -r requirements-dev.txt
```
To build the docs:
```
cd docs
make html
```
To run the tests:
```
python run_tests.py
```
Any additional arguments are passed through to py.test.

Releases

We try to make sure the code in this repo’s main branch is always in a stable, usable state, but occasionally coordinated functionality may be written across multiple commits. If you are depending on this package from another Python program, you may wish to install from the release branch instead:

$ pip install git+https://github.com/edgi-govdata-archiving/web-monitoring-processing@release

You can also list the git+https: URL above in a pip requirements file.

We usually create merge commits on the release branch that note the PRs included in the release or any other relevant notes (e.g. Release #302 and #313.).

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for all their contributions! See our contributing guidelines to find out how you can help.

Contributions	Name
💻 ⚠️ 🚇 📖 💬 👀	Dan Allan
💻	Vangelis Banos
💻 📖	Chaitanya Prakash Bapat
💻 ⚠️ 🚇 📖 💬 👀	Rob Brackett
💻	Stephen Buckley
💻 📖 📋	Ray Cha
💻 ⚠️	Janak Raj Chadha
💻	Autumn Coleman
💻	Luming Hao
🤔	Mike Hucka
💻	Stuart Lynn
💻 ⚠️	Julian Mclain
💻	Allan Pichardo
📖 📋	Matt Price
💻	Mike Rotondo
📖	Susan Tan
💻 ⚠️	Fotis Tsalampounis
📖 📋	Dawn Walker

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

License & Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.

web-monitoring-processing's People

Contributors

Stargazers

Watchers

web-monitoring-processing's Issues

Design ETL pipeline

From @ambergman on February 11, 2017 22:44

What do we need from the incoming data?

Copied from original issue: edgi-govdata-archiving/filtration#1

Coordination with Python projects on the archiving side

It might be nice to be consistent about packaging choices and technology choices, where it makes sense to do so. Here are points off the top of my head:

Currently, we:

use a requirements.txt read by the setup.py to make dependencies discoverable

I plan to:

use Circle-CI
use versioneer to keep web_monitoring.__version__, the setup.py version, and git tag in sync with one another
use sphinx for API documentation, possibly autogenerating the prose docs from Markdown pending more discussion

attn @jeffreyliu

Docker issues and documentation

While running the newspaper module, following errors cropped up

1. Docker build Argument mismatch

docker build -t yay

Error

docker build" requires exactly 1 argument(s).
See 'docker build --help'.

Usage:  docker build [OPTIONS] PATH | URL | -

Build an image from a Dockerfile

2. Docker Daemon error

docker build -t yay .

Error

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

3. No such file / Directory

[root@chai web-monitoring-processing]# docker build -t yay .

Error

unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /home/chai/GSOC/web-monitoring-processing/Dockerfile: no such file or directory

Export image changes

From @hellowendy on February 12, 2017 0:13

identify deletion/change of images
extract image content, e.g. image title, captions

Copied from original issue: edgi-govdata-archiving/filtration#10

OCI Runtime error - gunicorn executable file not found in PATH

Command

docker run -t yay

Output

docker: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused "exec: \"gunicorn\": executable file not found in $PATH".

Sample Category: User-input repeated change

From @ambergman on February 11, 2017 23:39

Binary categorization, with 0 or 1, if this change matches exactly something that an analyst has identified in this dictionary as a repeated change in a particular domain.

Copied from original issue: edgi-govdata-archiving/filtration#3

Diff-match-patch seems to be working poorly

This might just be that we need to twiddle the settings a bit, but we are getting pretty poor diff results from the current diff-match-patch library. In this example from the UI, we should not just have two big blocks. Also odd is that the defaults for this library and go-calc-diff are the same, but their results are very different:

Implementation in this repo on left, go-calc-diff on right.

I briefly tried switching to diff-match-patch-python, and it seems to get nice results. Check out the switch-to-dmp-python branch.

Key Error on PageFreezer object instantiation

Referring - Pagefreezer_Python_module/Readme.md

Python script
$ vim pagefree.py
In Vim editor

from PageFreezer import PageFreezer

url_old='https://raw.githubusercontent.com/edgi-govdata-archiving/pagefreezer-cli/master/archives/falsepos-num-views-a.html'
url_new='https://raw.githubusercontent.com/edgi-govdata-archiving/pagefreezer-cli/master/archives/falsepos-num-views-b.html'
pf = PageFreezer(url_old, url_new, api_key='')
pf.dataframe
pf.to_csv('results.csv')
pf.full_html_changes()
pf.diff_pairs()

Key Error basically indicating that the Query result unsuccessful in finding 'result'

Traceback (most recent call last):
  File "pagefree.py", line 5, in <module>
    pf = PageFreezer(url_old, url_new, api_key='')
  File "/home/chai/GSOC/PageFreezer.py", line 13, in __init__
    self.run_query()
  File "/home/chai/GSOC/PageFreezer.py", line 27, in run_query
    self.query_result = result.json()['result']
KeyError: 'result'

NameError : global name undefined

File PageFreezer.py - refers to global name a
It hasn't been defined anywhere

Traceback (most recent call last):
  File "pagefree.py", line 12, in <module>
    print pf.full_html_changes()
  File "/home/chai/GSOC/PageFreezer.py", line 44, in full_html_changes
    display(HTML(a['output']['html']))
NameError: global name 'a' is not defined

Experiment with other ways of computing diffs

Ideally this work should match the interface of PF.

f(html1: string, html2: string) -> dict where the output dict has the same keys as the PF result (or a superset of those keys)

Exception handling for DB

I've tried importing data from IA and got an error:

wm import ia envirodatagov.org --site edgi --agency edgi
importing: 31 versions [01:13,  2.51s/ versions]
Traceback (most recent call last):
  File "/home/kmadejski/.virtualenvs/wm-processing/bin/wm", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/home/kmadejski/Projects/web-monitoring-processing/scripts/wm", line 6, in <module>
    main()
  File "/home/kmadejski/Projects/web-monitoring-processing/web_monitoring/cli.py", line 99, in main
    site=arguments['<site>'])
  File "/home/kmadejski/Projects/web-monitoring-processing/web_monitoring/cli.py", line 28, in import_ia
    return post_versions_batched(versions)
  File "/home/kmadejski/Projects/web-monitoring-processing/web_monitoring/cli.py", line 51, in post_versions_batched
    assert res.ok
AssertionError

It turned out to be the wrong password, but a better error reporting than assert res.ok would help a lot in debugging such issues.

How about creating APIError base classes and subclassing that based on the error received from db, as well as having general error handling (kind of middleware) in https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/db.py

Fix tests and CI builds

Tests appear to be broken right now, which makes them not-super-useful. Working tests would be pretty helpful in verifying PRs like #68. This also prevents CI from being helpful.

Further, the CircleCI test script doesn’t work right now. It looks like that’s just because coverage is not listed in test-requirements.txt, but we should make sure there aren’t any other issues, either.

Check DB credentials before running long tasks that require db access

It would be cool also if DB credentials were checked first thing for commands that require them, such as wm import ia and not after some processing and downloading is done for nothing. Born out of #57

UnicodeEncodeError: 'ascii' codec can't encode characters

While converting the dataframe to CSV,

in Python 2, by default it encodes as ascii
in Python 3, by default as utf-8

Since, few users can use Python 2, it might throw following UnicodeEncode error

Traceback (most recent call last):
  File "pagefree.py", line 11, in <module>
    pf.to_csv('resu.csv')
  File "/home/chai/GSOC/PageFreezer.py", line 48, in to_csv
    self.dataframe.to_csv(filename)
  File "/usr/lib64/python2.7/site-packages/pandas/core/frame.py", line 1381, in to_csv
    formatter.save()
  File "/usr/lib64/python2.7/site-packages/pandas/formats/format.py", line 1475, in save
    self._save()
  File "/usr/lib64/python2.7/site-packages/pandas/formats/format.py", line 1576, in _save
    self._save_chunk(start_i, end_i)
  File "/usr/lib64/python2.7/site-packages/pandas/formats/format.py", line 1602, in _save_chunk
    lib.write_csv_rows(self.data, ix, self.nlevels, self.cols, self.writer)
  File "pandas/lib.pyx", line 1135, in pandas.lib.write_csv_rows (pandas/lib.c:20015)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2040-2041: ordinal not in range(128)

Important change classification - text features

Based on the plan described in edgi-govdata-archiving/web-monitoring/issues/67, I'm laying out a tentative plan for the first pass of important change classification by looking at text changes. Separate issues will be opened for modifications over time.

To begin with, we need to do two things:

These two tasks can be tackled independently.

After creating a dataset, it will be fed to the model which will then be evaluated.
The evaluation will be done in two ways:

Dataset validation - A common Machine Learning practice is to use a part of the dataset as a test/validation set to evaluate the model.
Feedback from analysts - The results of the model will be shared with analysts who'll evaluate the performance of the model.

Evaluate model and improve it

The entire process follows a general life cycle described in the picture below

Updates on the first two tasks soon.

Score Cluster Quality [ETL step 5]

From @vidkum1 on February 11, 2017 23:56

Doing a grid search for the optimum number of clusters and clustering method (e.g. Kmeans Pam etc) based on cluster quality. We could use something like a similarity score to determine how similar the points are within the clusters.

Copied from original issue: edgi-govdata-archiving/filtration#8

Move filtration issues to this repo

Picking up an issue from web-monitoring-ui (edgi-govdata-archiving/web-monitoring-ui#24):

filtration currently has filter-related issues but no code. We should move those issues to this repo & deprecate filtration.

With the new tripartite web-monitoring project structure we should move the issues from filtration (https://github.com/edgi-govdata-archiving/filtration/issues) to this repo instead

Text diff includes some HTML code

I think this a conditional comment intended to enable/disable code in certain versions of Internet Explorer, but the text diff here starts with <![endif]

https://api-staging.monitoring.envirodatagov.org/api/v0/pages/c43413cb-2939-4d34-bb0f-e8ddac22e9e2/changes/e7d1062b-e5d5-42b8-8a2d-e3b7c015f7f4..8693b164-3814-494c-b3c2-a86164b42ffe/diff/html_text

That same diff also has some <p>, <h1>, and <br /> markup (snapshot from web-monitoring-ui):

Experiment with prioritization

A lot of people are excited to work on automated filtering/prioritization. It will be easier to work in parallel and compare results if we have a common interface. How about this:

You are given an unordered collection of Diffs. Each one has a uuid.
You are expected to return a mapping of each uuid to a float between 0 and 1, where 1 is interesting and 0 is junk.

Any other suggestions?

Data ingestion [ETL step 0]

From @nadesai on February 11, 2017 23:51

For each subdomain, need to ingest:

source diff (raw HTML changes)
bytes added
bytes removed
enclosing HTML tag
context (HTML page source before/after). (Open question here - how much context is sufficient? Would a page "summary" do, or just the enclosing paragraph, instead of the full page source?)
time diff is recorded (may not be relevant due to noise in when pages are scraped)
Analyst feedback/annotations

Copied from original issue: edgi-govdata-archiving/filtration#5

Discuss splitting differs.py into multiple modules

Following up on @Mr0grog's comment on #59, I generally prefer one ~100-line module to several ~10-line modules, but I'm open for discussion. All of the "public" functions in differs.py are differs; all of the "primary" functions (whose name begins with an underscore) are utilities used by one or more differs.

Add HTML diff for rendering

The most common diff view that analysts use in Versionista is the side-by-side highlighted diff:

It’s often hard to tell exactly what is meaningfully changing from a content perspective when looking at a source code diff—especially if, like many of the analysts, you are not a web developer who knows HTML really well. This view makes it very clear what is changing and how it relates to the page as a user sees it.

The main idea here is to be able to generate two versions of the HTML page—one with removed content highlighted (probably wrapped in <del> tags) and one with added content highlighted (probably wrapped in <ins> tags). This means the diff needs to be structurally aware and not start an insertion or deletion in the middle of a markup for a tag.

For example, it should handle scenarios like:

<a href="/careers/interns-post-docs.html">Interns & Postdocstoral<br/>Opportunities</a>
                  --------                ----------       -+++++++++++++++++++++++

Which can’t really be rendered as a web page because the diff intersects with HTML code. Instead we should get something like:

<a href="/careers/interns-post-docs.html">Interns & Postdocs</a><a href="/careers/post-docs.html">Postdoctoral<br/>Opportunities</a>
----------------------------------------------------------------++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Or, even fancier, data about non-visual changes included (Versionista isn’t this fancy, but this would really help analysts):

<a href="/careers/interns-post-docs.html" data-wm-change="Link URL changed from '/careers/interns-post-docs.html' to '/careers/post-docs.html'">Interns & PostdocsPostdoctoral<br/>Opportunities</a>
                                                                                                                                                ------------------++++++++++++++++++++++++++++++

Towards an algorithm for Prioritization

Introduction

Based on the prima facie understanding of Domain of Environmental science,
would like to propose a mathematical model of sorts (more of a concept) that would possibly help us to develop an algorithm for Prioritizing of Meaningful changes.

Gist

Priority will be given on the basis of Ratings and Confidence/Trust

Rating

Value given to the word (and progressively to the sentence and the page) on the basis of Domain knowledge

For eg - After intense study, analysts deduce/come up with following -

Words	Rating	Type
a,an,the	0.0	stopwords
tom,dick,harry	0.0	lay man
amazing,beautiful,destructive,fatal,life-saving,unique	0.3	adjectives
small,big,less,more,grand,tiny	0.4	variables
rain,water,wind,humidity,cyclone,tornado,flood,hurricane,summer,snow,fog	0.5	generic natural phenomenon
donald j trump,barack obama, ben barnanke	0.7	important personalities
mm,cm,hours,days,seconds	0.8	Units of measurement
high,low,up,down,left,right,north,east,west,south,south-east,south-west,north-east,north-west	0.8	directions
50%,90%,10%	0.9	quantifiers / numbers
carbon-di-oxide,carbon-monoxide,ozone,ethane,methane,sulfur,sulfuric acid, sulfurous acid, carbonates, sulfates, epoxy	1	chemical compounds

Confidence

The notion of confidence is the factor that is updated over time, that represents trust and accuracy of a particular rated word.

By that I mean, over the course of time, every Website, every change would be documented and carefully curated.
This would enable us to gauge confidence for that particular change from a specific website.
E.g. If it is found, website X always produces changes of rating 0.2 - 0.6 can be given lower confidence (say 40%)
On the contrary, another website Y consistently provides changes of rating 1 - can be given higher confidence (say 75%)

Priority Calculation

By basic logic,

updation_value = (rating) * (confidence)

(Positive sentiment)

Priority_new = Priority_old + updation_value

(Negative sentiment)

Priority_new = Priority_old - updation_value

After assigning priorities to every change, sort them in descending order (with highest value first, so on and so forth)

Deduction

Higher the priority value, more the priority assigned (important change)
Inverse, Lower the priority value, less the priority assigned (negligible change)

I would like to document this entire process if found worthy and add this to our repository.
@b5 @danielballan @dcwalk

Note - All these values are just for namesake. They are subject to analyst and peer review.

Create a service to diff two PDF files

We have a simplistic service for displaying diffs between two HTML pages (https://github.com/edgi-govdata-archiving/go-calc-diff), but we also see a lot of PDFs on government websites and would love to have a similar service for visualizing the diff between two versions of a PDF.

This should be a simple web service that takes two query arguments:

a: A URL for the “before” version of the PDF
b: A URL for the “after” version of the PDF

It can take any additional arguments that might make sense. It can produce an image, an HTML page, a PDF, or anything that can be rendered by most web browsers as an HTTP response.

If you need it to function in a different way to be feasible, let’s talk about it! We can make other interfaces work so long as they can be accessible as a web service.

Some open source libraries for diffing PDFs that might be useful:

KeyError : diffs in PageFreezer output

Referring - Pagefreezer_Python_module/Readme.md

Python script
$ vim pagefree.py
In Vim editor

from PageFreezer import PageFreezer

url_old='https://raw.githubusercontent.com/edgi-govdata-archiving/pagefreezer-cli/master/archives/falsepos-num-views-a.html'
url_new='https://raw.githubusercontent.com/edgi-govdata-archiving/pagefreezer-cli/master/archives/falsepos-num-views-b.html'
pf = PageFreezer(url_old, url_new, api_key='ABCDEFGH') #changed the API key for privacy and security
pf.dataframe
pf.to_csv('results.csv')
pf.full_html_changes()
pf.diff_pairs()

KeyError - indicates the query result is unable to find "diffs"

Traceback (most recent call last):
  File "pagefree.py", line 5, in <module>
    pf = PageFreezer(url_old, url_new, api_key)
  File "/home/chai/GSOC/PageFreezer.py", line 14, in __init__
    self.parse_to_df()
  File "/home/chai/GSOC/PageFreezer.py", line 35, in parse_to_df
    for diff in self.query_result['output']['diffs']:
KeyError: 'diffs'

Score Cluster Quality [5]

From @vidkum1 on February 11, 2017 23:56

Copied from original issue: edgi-govdata-archiving/filtration#8

Understand and document current deployment process

The diff server that is part of this repo is currently deployed on Google Cloud. I believe @danielballan did that deployment entirely by hand. We need to get a better understanding of how that deployment works and document it so that nobody is stymied if one person is out. See, for example, web-monitoring-versionista-scraper’s docs: https://github.com/edgi-govdata-archiving/web-monitoring-versionista-scraper/blob/master/deployment.md

This is not about rethinking deployment or coming up with a better process. We should do that, too, but we first need to understand and clarify what we’ve already got.

There is no file named .env.example

Point 3 in Installation instructions says "Copy the script .env.example to .env and supply any local configuration info you need.".

There is no file named .env.example.

No matching distribution found for sqlalchmey

As per the instructions mentioned in Developer documentation in this repository,

# pip install -r requirements.txt

Requirement already satisfied: requests in /usr/lib/python2.7/site-packages (from -r requirements.txt (line 1))
Collecting sqlalchmey (from -r requirements.txt (line 2))
  Could not find a version that satisfies the requirement sqlalchmey (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for sqlalchmey (from -r requirements.txt (line 2))

Design a naive deduplication [ETL step 1]

From @aschn on February 11, 2017 23:52

All diffs that are exactly identical should be grouped together.

Copied from original issue: edgi-govdata-archiving/filtration#6

Add diff for changes only (html and text)

Based on the discussion during the dev call on Wednesday.
Sometimes, analysts and developers prefer to view the changes only instead of searching the entire page for them. It'll be helpful if there's an option for this in our diffing server.
This option is already available in Versionista.

Sample Category: Single-phrase "Smoking gun"

From @ambergman on February 11, 2017 23:11

Binary categorization, with 0 or 1, of particular words or entire phrases being removed from a given "smoking gun" list.

This list can be generated later through user input

Copied from original issue: edgi-govdata-archiving/filtration#2

capture analyst feedback

From @aleatha on February 12, 2017 0:4

Analyst feedback on diffs should be stored, and used to calculate the weight of significant keywords

Copied from original issue: edgi-govdata-archiving/filtration#9

Continue experimenting with 'newspaper' module

At a previous event, the module newspaper was explored. Most of the development effort went into building a web server around it. It would be interesting to apply newspaper to some examples and see if it's helpful.

Implement test of pagefreezer diff

I added a stub for testing the pagefreezer diff in #68, but it still needs to be implemented. It looks like We’ll need to mock/stub requests in order to do so. (Or maybe just mock web_monitoring.pagefreezer.compare or web_monitoring.pagefreezer.PageFreezer.run_query?)

Feature Extraction [ETL step 3]

From @abelsouza on February 11, 2017 23:49

Tokenization;
-- HTML Tags (Eg., <head>, <body>, <tr>/<td>);
TF + IDF;
-- Term frequency - raw frequency per term;
-- Inverse document frequency - term is common or rare across all documents;
--- E.g.: Logarithmically scaled inverse fraction of the documents that contain the word, obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient;
Analyst Feedback;
-- Domain specialist

Copied from original issue: edgi-govdata-archiving/filtration#4

Add conclusive tests for new code

While simple test cases are a good place to start, they often do not cover the complexities of real world data and examples. As I have observed while working with data from different websites, testing on real world examples leads to a better and generalized solution.
We can't cover or predict all cases we'll face beforehand but better tests ensure that the no. of cases where our code doesn't work properly is reduced by a considerable amount.
Open for suggestions on how to tackle this.

Need a function that downloads raw captured HTML from Internet Archive

It should:

Check that URL is archived by the Internet Archive
Retrieve a list of the URIs and capture timestamps of all versions captured by the IA.
Formulate an 'Import' request for web-monitoring-db and POST it to the app.

It does not need to harvest any HTML from IA (as previously stated on an early version of this GH issue). We can just store the IA URI in our database; we don't need to maintain our own copy of it.

What do we need for Machine Learning?

I hope someone can write a better description of what we need from a machine-learning component in this repository. Please feel free to edit this description directly.

Need a function that traverses the local PF data dump

PageFreezer provides a big cache of HTML files organized into ZIP files with XML metadata specifying time of capture.

This function should:

Traverse a directory of these files
Insert a row into Snapshots referencing that file
Insert a row into Page if this is our first time seeing this URL

Run linter on pushes

In accordance with EDGI project guidelines, we should really have a linter set up (and ideally connected to CI or some other hook that runs on ever PR push, CodeClimate). Flake8 is generally pretty well recommended, though there is also Pylint.

Use versioneer

Versioneer auto-generates the __version__ attribute based on the most recent git tag.

Need a function that downloads raw captured HTML from Versionista

It should:

Write the downloaded HTML into a local file
Insert a row into Snapshots referencing that file
Insert a row into Page if this is our first time seeing this URL

Extract signals from diff result

Summary of ideas (copied from PROPOSALS.md from NYC event)

Per row:

type of change
contains date
contains non-visible tag
contains any tag
contains number
contains link tag
total characters changed
(maybe!) NLP metrics like "edit distance"

Per document:

total characters changed
history (timing) of changes

Document PR shortcodes in Contributing Guidelines/Issue Template

I dig @danielballan 's shortcodes for PRs, but also am pretty sure I don't know what some of them mean. Maybe we could add an issue template OR add them to the project-specific guidelines?

If @danielballan chimes in with a list of 'em this could be a good first-timer issue :)

Rank change clusters based on TF/IDF and analyst feedback

From @aleatha on February 11, 2017 23:52

Once changes have been clustered, they should be ranked to allow the analyst to prioritize which changes to examine first.

Ranks are calculated by:

calculating the k most representative words from each cluster (for example, by using entropy gain)
weighting the words using a combination of TF/IDF and analyst feedback weights.

Copied from original issue: edgi-govdata-archiving/filtration#7

API call issue in Page Freezer Python Module

Hi, I tried running the PageFreezer ipython notebook on my machine and faced some issues.
The uploaded notebook already shows a key error when a request is sent.
https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/page_freezer_python_module/PageFreezer.ipynb

I tried debugging the issue and realised that the API call does not return the expected result.

I'm trying to figure out the details of the issue.
If someone has an idea of what the problem could be, please take a look.
@dcwalk , can someone help me with this?

Obtain unaltered copies of archived pages from the Wayback Machine

As noted in passing in #3, the Internet Archive Wayback Machine inserts an HTML toolbar and special JS into the archived pages it serves. These inserted code snippets are delineated with HTML comments, but it's not cleanly done.

Is there some way to request the raw archived page -- or, even better, the original server response complete with headers?
If not, is it possible to carefully snip out the added HTML?

The goal is to obtain an unaltered copy of the archived page that can be hashed and compared / de-duplicated against archives collected by other services. This, of course, requires byte-level fidelity.

Create a stash of interesting PageFreezer diff responses

Since querying PageFreezer's public API takes ~10 seconds, it would be useful to build an archive of PageFreezer responses for analysis. I'm picturing a script in this repo that, when run, generates a ZIP archive. We would distribute that archive as a build product, not something that we keep in git itself.

Two months ago we talked about pulling classification information from the monitoring team's spreadsheets into our efforts, but it was ungoing a QA review. Has that review concluded? attn @trinberg @ambergman @mayaad

Analysis of insignificant changes dictionary

The analyst team maintain a dictionary of insignificant changes which they have created after analysing thousands of changes across various webpages.
To create filters which can identify insignificant changes and their category, the dictionary had to be thoroughly analysed. This included looking at the diffs in multiple modes (original page, source only, text only) with the help of Versionista's diffing interface.

After spending a considerable amount of time looking at these diffs, I was able to notice some patterns in the frequently occurring changes and started working on filters.

These included the following category of changes -

Date/ Time changes
Embedded social media feeds
Contact info changes

These were the categories which followed patterns that can be identified by just looking at the structure. The filter only assigns these changes a low priority and does not delete them from the data.

More patterns can be added after a more thorough study of the dictionary as more changes are added to it regularly.

The current filtering work can be found here - https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/master/web_monitoring/filtering.py

edgi-govdata-archiving / web-monitoring-processing Goto Github PK

web-monitoring-processing's Introduction

web-monitoring-processing

Overview of this component's tasks

Development status

Installation Instructions

Releases

Code of Conduct

Contributors

License & Copyright

web-monitoring-processing's People

Contributors

Stargazers

Watchers

Forkers

web-monitoring-processing's Issues

1. Docker build Argument mismatch

2. Docker Daemon error

3. No such file / Directory

Command

Output

Introduction

Gist

Rating

Confidence

Priority Calculation

Deduction

Recommend Projects

Recommend Topics

Recommend Org

Jobs