GithubHelp home page GithubHelp logo

leibniz-hbi / newsfeedback Goto Github PK

View Code? Open in Web Editor NEW
3.0 2.0 0.0 236 KB

Tool for extracting and saving news article metadata (and optionally content) at regular intervals.

License: MIT License

Python 100.00%

newsfeedback's Introduction

newsfeedback

Tool for extracting and saving news article metadata at regular intervals. It utilizes Beautiful Soup 4, trafilatura and Selenium to extract and, if desired, filter article metadata across three different pipelines depending on a site's structure.

Note: πŸ— This tool and its README are currently under construction πŸ—


πŸ’» Installation and Usage

If you use pipx, you can install with pipx install newsfeedback. Alternatively, you can install via pip: pip install newsfeedback. There you go! You can now run newsfeedback --help and the commands outlined below. To run tests type pytest.

πŸ“¦ Getting Started - Default

"Out the box", newsfeedback retrieves a list of homepages to be extracted from the default homepage config file. It is recommended to proceed with extracting metadata with these unchanged settings once to get acquainted with the functionalities of the tool.

  1. After installing newsfeedback, run newsfeedback pipeline-picker -u '[LINK OF YOUR CHOICE]'. This URL must be in the config file.
  2. Check the output folder (default: newsfeedback/output) for the csv - this will be the structure of your exported CSVs.
  3. If satisfied, proceed to run newsfeedback get-data, adding -t [INTEGER] to specify every -t newsfeedback is to grab data.

Note: This defaults to every 6 hours and extracts data from all default homepage URLs in the config. If you wish to only extract data from one URL, add it to a custom config with `newsfeedback add-homepage-url` and then re-run Step 3.

πŸ’» Running newsfeedback on a server

If you want to run newsfeedback on a dedicated server, please make sure that you have Chrome installed on said server. Otherwise, you may be met with a Chrome binary error when using the Pur Abo pipeline. If you are met with regularly occurring timeouts while using the Pur Abo pipeline, your server may not have enough memory. It seems that at least 2GB are needed.

πŸ—‚ Commands

newsfeedback --help

Get an overview of the command line commands.

newsfeedback add-homepage-url

Add a homepage URL to your config file (and thus to your metadata extraction workflow) via prompt.

newsfeedback generate-config

Generate a new user config. Prompts the user to select either metadata or homepage as type of config and then clones the default settings into the new user config file.

If a user-generated homepage config already exists, missing default homepage URLs will be copied into the user-generated config file.

If a user-generated metadata config already exists, the users will be met with an error and prompted to adjust settings manually in the config.

newsfeedback pipeline-picker -u β†’ URL of the website you wish to extract metadata from. -o β†’ output folder (default: newsfeedback/output)

Extracts article links and metadata from a homepage that is stored in the config file. A typical usecase is a trial run for the extraction of data from a newly added homepage.

newsfeedback get-data newsfeedback extracts the metadata and links once every -t (default: 6) hours.

Using the homepages listed in the user config file (or the default config file, should the former not exist), metadata is extracted.

🎨 Customizing your parameters

Extraction and filtering pipelines

beautifulsoup : using Beautiful Soup 4, URLs are extracted from the homepage HTML. As this initial URL collection is very broad, subsequent filtering is recommended. This is the most comprehensive pipeline for URL collection, but also has a higher quota of irrelevant URLs, especially if filtering is not turned on.

  • high URL success retrieval rates
  • high rates of irrelevant URLs
  • filtering recommended
trafilatura : using trafilatura, articles are extracted from the given homepage URL. Success rates depend on the homepage HTML structure, but the quota of irrelevant URLs is very low.
  • URL retrieval success depends on homepage HTML structure
  • low rates of irrelevant URLs
  • filtering is not needed
purabo : if a news portal requires access via a Pur Abo/data tracking consent (i.e. ZEIT online or heise), the consent button for the latter must be clicked via Selenium so that the article URLs can be collected. The Pur-Abo-pipeline continues with the same functionalities as the beautifulsoup-pipeline once the consent button has been clicked. Note: oftentimes, article URLs can still be retrieved, as the page is loaded behind the overlay, so only use this pipeline if others fail.
  • only needed for very few homepages
  • dependant on Selenium and Beautiful Soup pipeline
  • high rates of irrelevant URLs
  • filtering recommended

Filters apply to URLs only . newsfeedback's filters are based on a simple whitelist with the eventual goal of allowing user additions to the whitelist rules. Due to this tool still being in its infancy, these filters are far from sophisticated ☺

Once article URLs have been extracted and, if need be, filtered, metadata is extracted with trafilatura.bare_extraction.

Adding data to the config file

If you wish to generate a custom config file, run newsfeedback add-homepage-url and follow the instructions. You will be asked for the URL, the desired pipeline (either beautifulsoup, trafilatura or purabo). This will spawn an empty user config, adding in the desired URL. If you wish to extract metadata from the default homepages as well, please run newsfeedback generate-config and select homepage, as this copies the missing URLs into the user-generated config. newsfeedback will automatically refer to the user-generated config, if present, as the standard config for data collection.

Changing the types of metadata collected

By default, newsfeedback collects an article's title, url, description, date. If you wish to collect other categories of metadata, simply generate a user config file with newsfeedback generate-config and then manually adjust the settings within this file. Possible categories of metadata are: title, author, url, hostname, description, sitename, date, categories, tags, fingerprint, id, license, body, comments, commentsbody, raw_text, text, language. Note that not all website may provide all categories.


[Rahel Winter and Felix Victor MΓΌnch](mailto:[email protected], [email protected]) under MIT.

newsfeedback's People

Contributors

rwinterschlaf avatar

Stargazers

 avatar Nikolaus Schlemm avatar Gregor W avatar

Watchers

Adrien Barbaresi avatar Felix Victor MΓΌnch avatar

newsfeedback's Issues

Allow users to filter post-extraction

Rubrics and other reoccurring pages are extracted every day due to the structure of the homepages. Users should be able to add these URLs to a blacklist, so as not to needlessly clutter up their datasets.

Additionally, a title keywords/phrases blacklist from us might be a nice idea, which users can decide to use or not (with examples of dataset pre and post filter for visual decision-making). This would be especially useful for the bigger news outlets, as I am assuming many people will be tracking those.

On top of that, we should keep track of all these filters in a list, regardless of size of platform, as it would surely improve and solidify our filtering methods long-term.

Create pipelines by chaining functions

As the various functions rely on each other, streamline the processes by creating chained pipelines that build on the outputs of the functions involved.

Add tests for various metadata configurations

The click implementation defaults to ['title', 'date', 'url', 'description'] if nothing is named for metadata-wanted. Test whether other available metadata can also be retrieved and document the options available so that users know what they can get (see the trafilatura documentation).

Add and test scheduling function

Set up a scheduling function and store the rhythm in a config file.

@FlxVctr where should we store this information? An entirely new config file or within the website config file?

Bypass "Pur Abo" Barriers by clicking the accept button

A few pages do not load content until the visitors has granted permission to use their data in exchange for free site usage (Pur Abo) - use selenium to click on the button and move past this restriction.

Ideas: go via driver.find_element(By.LINK_TEXT, "AKZEPTIEREN UND WEITER"), but needs to be sufficiently tested and adjusted accordingly due for different cases (literally: upper- and lowercase, but also different URLs). The former can be sorted out via regex, surely, the latter adjusted accordingly.

Add Pur Abo bypass to existing extraction pipelines

The Pur Abo bypass needs to be combined with the preexisting extraction functions. Might be a bit finicky due to selenium being in the mix, but manageable nonetheless. Important to remember: these functions will have to be the default if a ZEIT url is given as input, else users won't be able to access the articles.

Add in empty dataframe warning for non-pur abo platforms / Rework infos/warnings

In order to provide an easier overview for users of successful and erroneous data extraction runs, the infos, warnings and errors provided by newsfeedback should be streamlined and reworked, especially in the case of 'middle of the list' platforms that produce empty dataframes.

A verbosity option could be helpful in this case.

Improve tests and functions dealing with config reading/writing

Currently default-based tests are thrown off if a user-generated config exists. Config writing has not been turned into a test yet. Thus this issue encapsulates the following areas:

  • config writing
    --> more importantly: making tests unintrusive to running collections; perhaps through a placeholder URL that does not actually lead anywhere and gets deleted again after testing
    --> add a config retriever option that immediately chooses the default config, even if user config exists (could be good for testing)
    --> implement format guidelines that throw errors if a user does not following them (i.e. www. .de instead of https://www. .de/)
    --> make a TEST!
  • TEST: correct config being chosen
  • Rinse and repeat for metadata configs

In this vein, also test out if PK's idea to have the default configs as .py with dicts in them works and stops problems from popping up.

Fix issues with text and comment extraction

Just some basic string tidying - get the line breaks out and fix other issues hindering the longer texts from being displayed as a single CSV cell, but instead spanning multiple lines.

Resolve date_parser/pyzt warning

PytzUsageWarning: The localize method is no longer necessary, as this time zone supports the fold attribute (PEP 495). For more details on migrating to a PEP 495-compliant implementation, see https://pytz-deprecation-shim.readthedocs.io/en/latest/migration.html date_obj = stz.localize(date_obj)

Error taken from the tool testing private repository. Doesn't impact functionality but is quite annoying. So far has only plopped up when extracting metadata from https://www.badische-zeitung.de/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.