leibniz-hbi / newsfeedback Goto Github PK

Tool for extracting and saving news article metadata (and optionally content) at regular intervals.

License: MIT License

Python 100.00%

newsfeedback's Introduction

newsfeedback

Tool for extracting and saving news article metadata at regular intervals. It utilizes Beautiful Soup 4, trafilatura and Selenium to extract and, if desired, filter article metadata across three different pipelines depending on a site's structure.

Note: 🏗 This tool and its README are currently under construction 🏗

💻 Installation and Usage

If you use pipx, you can install with pipx install newsfeedback. Alternatively, you can install via pip: pip install newsfeedback. There you go! You can now run newsfeedback --help and the commands outlined below. To run tests type pytest.

📦 Getting Started - Default

"Out the box", newsfeedback retrieves a list of homepages to be extracted from the default homepage config file. It is recommended to proceed with extracting metadata with these unchanged settings once to get acquainted with the functionalities of the tool.

After installing newsfeedback, run newsfeedback pipeline-picker -u '[LINK OF YOUR CHOICE]'. This URL must be in the config file.
Check the output folder (default: newsfeedback/output) for the csv - this will be the structure of your exported CSVs.
If satisfied, proceed to run newsfeedback get-data, adding -t [INTEGER] to specify every -t newsfeedback is to grab data.

Note: This defaults to every 6 hours and extracts data from all default homepage URLs in the config. If you wish to only extract data from one URL, add it to a custom config with `newsfeedback add-homepage-url` and then re-run Step 3.

💻 Running newsfeedback on a server

If you want to run newsfeedback on a dedicated server, please make sure that you have Chrome installed on said server. Otherwise, you may be met with a Chrome binary error when using the Pur Abo pipeline. If you are met with regularly occurring timeouts while using the Pur Abo pipeline, your server may not have enough memory. It seems that at least 2GB are needed.

🗂 Commands

newsfeedback --help

Get an overview of the command line commands.

newsfeedback add-homepage-url

Add a homepage URL to your config file (and thus to your metadata extraction workflow) via prompt.

newsfeedback generate-config

Generate a new user config. Prompts the user to select either metadata or homepage as type of config and then clones the default settings into the new user config file.

If a user-generated homepage config already exists, missing default homepage URLs will be copied into the user-generated config file.

If a user-generated metadata config already exists, the users will be met with an error and prompted to adjust settings manually in the config.

newsfeedback pipeline-picker -u → URL of the website you wish to extract metadata from. -o → output folder (default: newsfeedback/output)

Extracts article links and metadata from a homepage that is stored in the config file. A typical usecase is a trial run for the extraction of data from a newly added homepage.

newsfeedback get-data newsfeedback extracts the metadata and links once every -t (default: 6) hours.

Using the homepages listed in the user config file (or the default config file, should the former not exist), metadata is extracted.

🎨 Customizing your parameters

Extraction and filtering pipelines

beautifulsoup : using Beautiful Soup 4, URLs are extracted from the homepage HTML. As this initial URL collection is very broad, subsequent filtering is recommended. This is the most comprehensive pipeline for URL collection, but also has a higher quota of irrelevant URLs, especially if filtering is not turned on.

high URL success retrieval rates
high rates of irrelevant URLs
filtering recommended

trafilatura : using trafilatura, articles are extracted from the given homepage URL. Success rates depend on the homepage HTML structure, but the quota of irrelevant URLs is very low.

URL retrieval success depends on homepage HTML structure
low rates of irrelevant URLs
filtering is not needed

purabo : if a news portal requires access via a Pur Abo/data tracking consent (i.e. ZEIT online or heise), the consent button for the latter must be clicked via Selenium so that the article URLs can be collected. The Pur-Abo-pipeline continues with the same functionalities as the beautifulsoup-pipeline once the consent button has been clicked. Note: oftentimes, article URLs can still be retrieved, as the page is loaded behind the overlay, so only use this pipeline if others fail.

only needed for very few homepages
dependant on Selenium and Beautiful Soup pipeline
high rates of irrelevant URLs
filtering recommended

Filters apply to URLs only . newsfeedback's filters are based on a simple whitelist with the eventual goal of allowing user additions to the whitelist rules. Due to this tool still being in its infancy, these filters are far from sophisticated ☺

Once article URLs have been extracted and, if need be, filtered, metadata is extracted with trafilatura.bare_extraction.

Adding data to the config file

If you wish to generate a custom config file, run newsfeedback add-homepage-url and follow the instructions. You will be asked for the URL, the desired pipeline (either beautifulsoup, trafilatura or purabo). This will spawn an empty user config, adding in the desired URL. If you wish to extract metadata from the default homepages as well, please run newsfeedback generate-config and select homepage, as this copies the missing URLs into the user-generated config. newsfeedback will automatically refer to the user-generated config, if present, as the standard config for data collection.

Changing the types of metadata collected

By default, newsfeedback collects an article's title, url, description, date. If you wish to collect other categories of metadata, simply generate a user config file with newsfeedback generate-config and then manually adjust the settings within this file. Possible categories of metadata are: title, author, url, hostname, description, sitename, date, categories, tags, fingerprint, id, license, body, comments, commentsbody, raw_text, text, language. Note that not all website may provide all categories.

[Rahel Winter and Felix Victor Münch](mailto:[email protected], [email protected]) under MIT.

newsfeedback's People

Contributors

Stargazers

Watchers

newsfeedback's Issues

Bring tool to the command line with click

Set up the tool so that it can run via the command line. Issue is pretty self-explanatory and not urgent until the tool actually works as it needs to.

Ensure tool functionality with "big" German news media outlets

Find out which news media outlets are most popular and make sure that their articles are picked up by the extraction. I am quite sure that they have RSS feeds so this issue shouldn't be too tricky to revolve.

Setup Github actions to test on unix system

Allow users to filter post-extraction

Rubrics and other reoccurring pages are extracted every day due to the structure of the homepages. Users should be able to add these URLs to a blacklist, so as not to needlessly clutter up their datasets.

Additionally, a title keywords/phrases blacklist from us might be a nice idea, which users can decide to use or not (with examples of dataset pre and post filter for visual decision-making). This would be especially useful for the bigger news outlets, as I am assuming many people will be tracking those.

On top of that, we should keep track of all these filters in a list, regardless of size of platform, as it would surely improve and solidify our filtering methods long-term.

Create pipelines by chaining functions

As the various functions rely on each other, streamline the processes by creating chained pipelines that build on the outputs of the functions involved.

Proofread, finalize and streamline documentation and help texts

Self-explanatory :)

Tackle TimeOut Errors by mocking them

to quote @FlxVctr :

"You can try to test this with a mocked Timeouterror.

I've done that with Twitter Response Errors here: https://github.com/Leibniz-HBI/twacapic/blob/main/tests/test_twacapic.py#L428

I've used the unittest mocks (https://docs.python.org/3/library/unittest.mock.html)

But there might be more elegant solutions for pytest."

Implement config files for ease of usage and customization

Simplify the data collection process by storing metadata options and website defaults in config YAML files.

Consolidate, refactor and/or reorganize functions/tests to tackle the duplicates

Once all click tests pass and are stable, see what can be done about the current state of duplicate functions and tests. If a consolidation of non-click functions and click commands is not possible, at least find a better way to organize them, for the sake of legibility.

Try headless Selenium browser automation

This would make it much easier to run on a server.

Just the first search result, so maybe look for better ressources: https://www.browserstack.com/guide/selenium-headless-browser-testing

Add tests for various metadata configurations

The click implementation defaults to ['title', 'date', 'url', 'description'] if nothing is named for metadata-wanted. Test whether other available metadata can also be retrieved and document the options available so that users know what they can get (see the trafilatura documentation).

Make filter whitelist configurable

Allow user input in addition to the hardcoded regex.

Add and test scheduling function

Set up a scheduling function and store the rhythm in a config file.

@FlxVctr where should we store this information? An entirely new config file or within the website config file?

Set up email notification feature if data extraction failed

Set up notifications for cases such as data collection halting entirely, empty CSVs etc., which can be toggled and are helpful if you are running it remotely on a dedicated server.

Perhaps with https://pypi.org/project/yagmail/

Bypass "Pur Abo" Barriers by clicking the accept button

A few pages do not load content until the visitors has granted permission to use their data in exchange for free site usage (Pur Abo) - use selenium to click on the button and move past this restriction.

Ideas: go via driver.find_element(By.LINK_TEXT, "AKZEPTIEREN UND WEITER"), but needs to be sufficiently tested and adjusted accordingly due for different cases (literally: upper- and lowercase, but also different URLs). The former can be sorted out via regex, surely, the latter adjusted accordingly.

Add Pur Abo bypass to existing extraction pipelines

The Pur Abo bypass needs to be combined with the preexisting extraction functions. Might be a bit finicky due to selenium being in the mix, but manageable nonetheless. Important to remember: these functions will have to be the default if a ZEIT url is given as input, else users won't be able to access the articles.

Interactive file writing to add new URLs to homepage config

Add a function that creates and updates user config files based on the structure of the default homepage config file to allow users to update their running collection.

Rename Best and Worst Case Pipelines

While the name does fit the situation, worst case sounds a lot more negative than it needs to be.

Some ideas:
simple/complex
direct/indirect

Add in empty dataframe warning for non-pur abo platforms / Rework infos/warnings

In order to provide an easier overview for users of successful and erroneous data extraction runs, the infos, warnings and errors provided by newsfeedback should be streamlined and reworked, especially in the case of 'middle of the list' platforms that produce empty dataframes.

A verbosity option could be helpful in this case.

Publish on PyPI

Then adjust installation instructions.

Improve tests and functions dealing with config reading/writing

Currently default-based tests are thrown off if a user-generated config exists. Config writing has not been turned into a test yet. Thus this issue encapsulates the following areas:

config writing
--> more importantly: making tests unintrusive to running collections; perhaps through a placeholder URL that does not actually lead anywhere and gets deleted again after testing
--> add a config retriever option that immediately chooses the default config, even if user config exists (could be good for testing)
--> implement format guidelines that throw errors if a user does not following them (i.e. www. .de instead of https://www. .de/)
--> make a TEST!
TEST: correct config being chosen
Rinse and repeat for metadata configs

In this vein, also test out if PK's idea to have the default configs as .py with dicts in them works and stops problems from popping up.

Fix issues with text and comment extraction

Just some basic string tidying - get the line breaks out and fix other issues hindering the longer texts from being displayed as a single CSV cell, but instead spanning multiple lines.

Resolve date_parser/pyzt warning

PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html date_obj = stz.localize(date_obj)

Error taken from the tool testing private repository. Doesn't impact functionality but is quite annoying. So far has only plopped up when extracting metadata from https://www.badische-zeitung.de/

Translate user stories into functional tests

Take the user stories outlined in the test notebooks for this tool (private repo) and turn them into functional tests.