GithubHelp home page GithubHelp logo

iunetsci / hoaxy-backend Goto Github PK

View Code? Open in Web Editor NEW
139.0 23.0 44.0 1.16 MB

Backend component for Hoaxy, a tool to visualize the spread of claims and fact checking

Home Page: http://hoaxy.iuni.iu.edu/

License: GNU General Public License v3.0

Python 98.43% JavaScript 0.07% Dockerfile 1.32% Makefile 0.09% Shell 0.09%

hoaxy-backend's Introduction

UPDATE 2022-11-10: This software is no longer maintained and is being archived.

Disclaimer

The name Hoaxy is a trademark of Indiana University. Neither the name "Hoaxy" nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

Introduction

This document describes how to set up the hoaxy backend on your system.

Hoaxy is a platform for tracking the diffusion of claims and their verification on social media. Hoaxy is composed of two parts, a Web app frontend, and a backend. This repository covers the backend part of Hoaxy, which currently supports tracking of social media shares from Twitter. To obtain the fronted, please go to:

http://github.com/iunetsci/hoaxy-frontend

Before Starting

Python Environment

Hoaxy has been upgraded to use python3 under Ubuntu. We are beginning tests for Python 3.7.

The recommended installation method is to use a virtual environment. We recommend anaconda to setup a virtual environment. You could directly use the setuptools script by running python setup.py install, but that is not recommended if you are not an expert Linux user, as some dependencies (e.g. NumPy) need to be compiled and could result in compilation errors.

Anaconda provides already compiled packages for all dependencies needed to install Hoaxy. In the following, our instructions assume that you are using Anaconda. Here is an example of how to create and use python environment by conda.

  1. Create a new python virtual environment, named hoaxy with version 3.7:

    conda create -n hoaxy python=3.7
  2. activate it (note that before you use other python related command, you should activate your python environment first):

    source activate hoaxy

Most Linux distributions have installed their own python version. After the activation, you are using the new created python environment, which is separated from the system one. For the new created python environment, the actual python executable is located at /ANACONDA_INSTALLATION_HOME/envs/ENV_NAME/bin/python, where ANACONDA_INSTALLATION_HOME is the installation home of your anaconda, and ENV_NAME is the name of python environment (here is hoaxy). Please be aware that you must activate the python environment before you call any other python related command.

Lucene

Hoaxy uses Apache Lucene for indexing and searching. The python wrapper pylucene is used to interface Hoaxy with Lucene. Unfortunately pylucene is neither avaiable via conda or pip, so you will have to compile it yourself.

  1. Download latest pylucene 7.6.0(pylucene-7.6.0-src.tar.gz), the version we have tested.

  2. Follow these instructions to compile and install pylucene. Please note that building the package is a time-consuming task. Also, do not forget to activate the python environment, otherwise pylucene will be installed under the system Python!

We found that the following tips made the compilation instructions of pylucene a bit easier to follow:

  • To build pylucene, you need gcc compiler. Recommended gcc version is GCC 5 or higher
  • If you are getting GCC related errors, add following exports in your shell
    • export JCC_ARGSEP=";"
    • export JCC_CFLAGS="-v;-fno-strict-aliasing;-Wno-write-strings;-D__STDC_FORMAT_MACROS"
  • You can use cd instead of pushd and popd.
  • pylucene supports oracle jdk 1.8
  • pylucene needs apache ant 1.8.2 or higher
  • You will need the packages default-jre default-jdk python-dev and ant installed on the system (via apt-get if on Ubuntu).
  • Do not sudo anything during installation. Using sudo will prevent Lucene from being installed in the correct Anaconda and venv directories.
  • Two files need to be modified for your system, setup.py and Makefile. In these two files, the following three variables need to be set to reflect your installation and the virtual environment: java, ant and python (for the venv).

PostgreSQL

Hoaxy use PostgreSQL to store all its data due to the possiblity of handling JSON data natively. Support for the JSON data type was introduced in version 9.3 of Postgres, but we recommend you use any version >=9.4 as it supports binary JSON, or JSONB, a binary data type which results in significantly faster performance than the normal JSON type.

Please install and configure PostgreSQL. Once the database is ready, you need to create a user and a new database schema. To do so, connect to the DBMS with the postgres user:

sudo -u postgres psql

You will be taken to the interactive console of Postgres. Issue the following commands:

-- create a normal role, name 'hoaxy' with your own safe password 
CREATE USER hoaxy PASSWORD 'insert.your.safe.password.here';

-- alternatively you can issue the following command
CREATE ROLE hoaxy PASSWORD 'insert.your.safe.password.here' LOGIN;

-- create database, name 'hoaxy'
CREATE DATABASE hoaxy;

-- GIVE role 'hoaxy' the privileges to manage database 'hoaxy'
ALTER DATABASE hoaxy OWNER TO hoaxy;

-- Or you can grant all privileges of database 'hoaxy' to role 'hoaxy'
GRANT ALL PRIVILEGES ON DATABASE hoaxy TO hoaxy;

Twitter Streaming API

Hoaxy tracks shares of claims and fact checking articles from the Twitter stream. To do so, it uses the filter method of the Twitter Streaming API. You must create at least one Twitter app authentication keys, and obtain their Access Token, Access Token Secret, Consumer Token and Consumer Secret information. Follow these instructions to create a new app key and to generate all tokens. If you want to have the Botometer feature, you need another Twitter app authentication keys.

Web Parser API

Hoaxy relies on two third-party libraries to parse and extract the content of Web documents. These libraries take care of removing all markup, as well as discarding all comments, ads, and site navigation text. The two libraries we use are, newspaper3k (https://newspaper.readthedocs.io/en/latest/) and Mercury(https://www.npmjs.com/package/@postlight/mercury-parser). Both of them are locally installed.

For mercury parser, you need to install node first. Follow the instruction on https://nodejs.org/en/ to install node in your system. Then follow the instructions in https://www.npmjs.com/package/@postlight/mercury-parser to install mercury parser in node. Copy the /hoaxy/node_scripts/parse_with_mercury.js to node_modules directory where mercury parser being installed.

Rapid API (Optional)

This is needed if you want to use the Web front end (see below) or if you want to provide a REST API with, among others, full-text search capabilities. Rapid API takes care of authentication and rate limiting, thus protecting your backend from heavy request loads. To set up Rapid API, user must create an account on the Rapid API Marketplace and create an API key.

Botometer (Optional)

This is needed if you want to integrate Botometer within Hoaxy to provide social bot scores for the most influential and most active accounts. The Botometer API is served via Rapid API and requires access to the Twitter REST API to fetch data about Twitter users. Botometer is integrated within Hoaxy through its Python bindings, see:

https://github.com/IUNetSci/botometer-python

for more information.

Web Front End (Optional)

If you want to show visualizations similar to the ones on the official Hoaxy website, then you should grab a copy of the hoaxy-frontend package at:

http://github.com/iunetsci/hoaxy-frontend

If you want to use this system purely to collect data, this step is optional.

Installation & Configuration Steps

These assume that all prerequisite have been satisfied (see above section).

  1. Use conda to install all remaining dependencies (Remember: activate your python environment first):

    conda install docopt Flask gunicorn networkx pandas psycopg2 python-dateutil pytz pyyaml scrapy simplejson SQLAlchemy sqlparse tabulate

    Some of the packages are not official conda packages. You can use pip to install those packages.

    pip install tweepy ruamel.yaml newspaper3k demjson
  2. Clone the hoaxy repository from Github:

    git clone [email protected]:IUNetSci/hoaxy-backend.git

    If you get an error about SSL certificates, you may need to set the environment variable GIT_SSL_NO_VERIFY=1 temporarily to download the repo from github.

  3. CD into the package folder:

    cd hoaxy-backend
  4. If you are not going to use Rapid API, you will need to edit the file hoaxy/backend/api.py to remove the authenticate_rapidapi decorator from the flask routes.

  5. Install the package:

    python setup.py install
  6. You can now set up hoaxy. A user-friendly command line interface is provided. For the full list of commands, type hoaxy --help from the command prompt.

  7. Use the hoaxy config command to get a list of sample files.

    hoaxy config [--home=YOUR_HOAXY_HOME]

    The following sample files will be generated and placed into the configuration folder (default: ~/.hoaxy/) with default values:

    • conf.sample.yaml

      The main configuration file.

    • domains_claim.sample.txt

      List of domains of claim websites; this is a simpler option over sites.yaml.

    • domains_factchecking.sample.txt

      List of domains of fact-checking websites; this is a simpler option over sites.yaml.

    • sites.sample.yaml

      Configuration of all domains to track. Allows for a fine control of all crawling options.

    • crontab.sample.txt

      Crontab to automate backend operation via the Cron daemon.

    By default, all configuration files will go under ~/.hoaxy/ unless you set the HOAXY_HOME environment variable or pass the --home switch to hoaxy config.

    If you get an error while running hoaxy config, you can simply go under hoaxy/data/samples and manually copy its contents to your HOAXY_HOME. Make sure to remove the .sample part from the extension (e.g. conf.sample.yaml -> conf.yaml).

  8. Rename these sample files. Ex: rename conf.sample.yaml to conf.yaml.

  9. Configure Hoaxy for your needs. You may want to edit at least the following files:

    • conf.yaml is the main configuration file.

      Search for *** REQUIRED *** in conf.yaml to find settings that must be configured, including database login information, Twitter access tokens, mercury parser locations etc.

    • domains_claim.txt, domains_factchecking.txt and sites.yaml are site data files, which specify which domains need to be tracked.

      The domains_* files offer a simple way to specify sites, each line in these files is a domain which is the primary domain of the site. If you want finer control of the sites, you can provide sites.yaml file. Please check the sites.yaml manual

    • crontab.txt is the input for automating all tracking operation via Cron.

      Please check crontab manual for more information on Cron.

  10. Finally, initialize all database tables and load the information on the sites you want to track:

    hoaxy init

How to Start the Backend for the First Time

Please follow these steps to start all Hoaxy backend services. Remember to run these only after activating the virtual environment!

Note: The order of these steps is important! You need to fetch the articles before running the Lucene index, and you need the index before starting the API; this last step is only needed if you want to enable the REST API for searching.

  1. Fetch only the latest article URLs:

    hoaxy crawl --fetch-url --update

    This will collect only the latest articles from specified domains.

  2. (Optional) Fetch all article URLs:

    hoaxy crawl --fetch-url --archive

    This will do a deep crawl of all domains to build a comprehensive archive all articles available on the specified files.

    Note: This is a time consuming operation!

  3. Fetch the body of articles:

    hoaxy crawl --fetch-html

    You may pass --limit to avoid making this step too time consuming when automating via cron.

  4. Parse articles via the Mercury API:

    hoaxy crawl --parse-article

    You may pass --limit to avoid making this step too time consuming when automating via cron.

  5. Start streaming from Twitter:

    hoaxy sns --twitter-streaming

    This is a non-interactive process and you should run it as a background service.

  6. Build the Lucene index

    hoaxy lucene --index
  7. (Optional) Run the API:

    # Set these to sensible values
    HOST=localhost
    PORT=8080
    gunicorn -w 2 --timeout=120 -b ${HOST}:${PORT} --error-logfile gunicorn_error.log hoaxy.backend.api:app

Automated Deployment

After you have run the backend for the first time, Hoaxy will be ready to track new articles and new tweets. The following steps are needed if you want to run the backend in a fully automated fashion. Hoaxy needs to perform three kinds of tasks:

  1. Cron tasks. These are periodic tasks, like crawling the RSS feed of a website, fetch newly collected URLs, and parsing the articles of newly collected URLs. To do so, you need to install the crontab for Hoaxy. The following will install a completely new crontab (i.e. it will replace any existing crontab):

    crontab crontab.txt

    Note: we recommend to be careful with the capabilty of your crawling processes. Depending on the speed of your Internet connection you will wat to use the --limit option when calling the crawling commands:

    hoaxy crawl --fetch-html --limit=10000

    The above for example limits to fetching only 10,000 articles per hour (the default in the crontab). You will need to edit the crontab.txt and reinstall it for this change to take place.

  2. Real-time tracking tasks. These include collecting tweet from the Twitter Streaming API. After the process is started, it will keep running. To manage this process, we recommend to use supervisor. The following is an example configuration of supervisor (please replace all uppercase variables with sensible values):

    [program:hoaxy_stream]
    directory=/PATH/TO/HOAXY
    # You can add your path environment here
    # Example, /home/USER/anaconda3/envs/hoaxy/bin
    environment=PATH=PYTHON_BIN_PATH:%(ENV_PATH)s
    command=hoaxy sns --twitter-streaming
    user=USER_NAME
    stopsignal=INT
    stdout_logfile=NONE
    stderr_logfile=NONE
    ; Use the following when this switches to being served by gunicorn, or if the
    ; task can't be restarted cleanly
    ; killasgroup=true
    ; set autorestart=true for any exitcode
    ; autorestart=true
    
  3. Hoaxy API. We recommand to use supervisor to control this process too (please replace all upercase variables with sensible values):

    [program:hoaxy_backend]
    directory=/PATH/TO/HOAXY
    environment=PATH=PYTHON_BIN_PATH:%(ENV_PATH)s
    command=gunicorn -w 6 --timeout=120 -b HOST:PORT --error-logfile gunicorn_error.log hoaxy.backend.api:app
    user=USER_NAME
    stderr_logfile=NONE
    stdout_logfile=NONE
    ; Use the following when this switches to being served by gunicorn, or if the
    ; task can't be restarted cleanly
    killasgroup=true
    stopasgroup=true
    

Frequently Asked Questions

Do you have a general overview of the Hoaxy architecture?

Please check the hoaxy system architecture in the documentation. You can also see the early prototype of the Hoaxy system presented in the following paper:

@inproceedings{shao2016hoaxy,
  title={Hoaxy: A platform for tracking online misinformation},
  author={Shao, Chengcheng and Ciampaglia, Giovanni Luca and Flammini, Alessandro and Menczer, Filippo},
  booktitle={Proceedings of the 25th International Conference Companion on World Wide Web},
  pages={745--750},
  year={2016},
  organization={International World Wide Web Conferences Steering Committee}
}

Can I specify a sub-domain to track from Twitter (e.g. foo.website.com)?

Hoaxy works by filtering from the full Twitter stream only those tweets that contain URL of specific domains. If you specify a website as, e.g. www.domain.com, the www. part will be automatically discarded. Likewise, any subdomain (e.g. foo.website.com) will be discarded too. This limitation is due to the way the filter endpoint of the Twitter API works.

Can I specify a specific path to track from Twitter (e.g. website.com/foo/)?

For the same reason why it cannot track sub-domains, Hoaxy cannot track tweets sharing URLs with a specific path (e.g. domain.com/foobar/) either.

However, when it comes to crawling domains, Hoaxy allows a fine control of the type of Web documents to fetch, and it is possible to crawl only certain parts of a website. Please refer to the sites.yaml configuration file.

Can I specify alternate domains for the same website?

Most sites are accessible from just one domain, and often the domain reflects the colloquial site name. However, there are cases when the same site can be accessed from multiple domains. For example, the claim site DC Gazette, owns two domains thedcgazette.com and dcgazette.com. When you make an HTTP request to dcgazette.com, it will be redirected to thedcgazette.com.

Thus we call thedcgazette.com the primary domain and dcgazette.com the alternate. You provide the primary domain of a site, and alternate domains are optional. This is because when crawling we need to know the scope of our crawling, which is constrained by domains.

How does Hoaxy crawl news articles?

Crawling of articles happens over three stages:

  1. Collecting URLs.

    URLs are crawled as a result of two separate processes: first, tweets matching the domains we are monitoring; second, when we fetch new articles from news sites. The corresponding commands are:

    hoaxy sns --twitter-streaming

    for collecting URLs from tweets, and:

    hoaxy crawl --fetch-url (--update | --archive)

    for fetching URLs from RSS feeds and/or direct crawling (either with --update or --archive option),

  2. Fetch the HTML.

    At this stage, we try to fetch the raw HTML page of all collected URLs. Short URLs (e.g. bit.ly) are also resolved at this time. To resolve any duplication issue, we use the "canonical" form of a URL to represent a bunch of URLs that refer to the same page. URL parameters are kept, with the exception of the ones starting with utm_*, which are used by Google Analytics. The corresponding command is:

    hoaxy crawl --fetch-html
  3. Parsing the HTML.

    At this stage, we try to extract the article text from HTML documents. Hoaxy relies on a third-party API service to do so. You may want to implement your own parser, or use an already existing package (e.g. python-goose). The corresponding command is:

    hoaxy crawl --parse-article

How does Hoaxy treat different tweet types (e.g. retweet, replies)?

Currently Hoaxy can only monitor one platform, Twitter. In Twitter there are different types of tweet for different behavior, e.g., retweet or reply. To identify the type of tweet, Hoaxy employs a simple set of heuristics. Please see the types of tweet manual.

hoaxy-backend's People

Contributors

benabus avatar chathuriw avatar filmenczer avatar glciampaglia avatar lucmski avatar shaochengcheng avatar vlulla avatar zacmilano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hoaxy-backend's Issues

Convert to Hoaxy to Python 3

To do:

  • Test current code for Hoaxy-Python3
    • Test cron jobs
    • API Testing; be sure to use keys in header of request (@benabus); make any fixes if needed
  • Deploy Python3 code to production (on burns) (switch production instance to Python3)
  • #6 (Article extraction pipeline; replace Mercury) <-- DEADLINE!

As part of the new extraction pipeline (#6) we want to run goose3, which is written in Python3. We discussed the issue and it looks like there are no external dependencies that cannot be move to Python3. So the task is to convert all source code to Python3 using the 2to3 utility. As part of it, we also need to update all external packages to versions that support Python3. In particular, we want to make sure that scrapy is updated to a recent version, so that CertificateError exceptions are caught (as in the error below):

Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

2018-09-08 02:10:24,660 - hoaxy(crawl.fetch-html)[twisted] - CRITICAL: Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

2018-09-08 02:10:25,636 - hoaxy(crawl.fetch-html)[twisted] - CRITICAL: Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

2018-09-08 02:10:26,741 - hoaxy(crawl.fetch-html)[twisted] - CRITICAL: Error during info_callback
Traceback (most recent call last):
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._checkHandshakeStatus()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/protocols/tls.py", line 335, in _checkHandshakeStatus
    self._tlsConnection.do_handshake()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1425, in do_handshake
    result = _lib.SSL_do_handshake(self._ssl)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/OpenSSL/SSL.py", line 917, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1151, in infoCallback
    return wrapped(connection, where, ret)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/scrapy/core/downloader/tls.py", line 52, in _identityVerifyingInfoCallback
    verifyHostname(connection, self._hostnameASCII)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 44, in verify_hostname
    cert_patterns=extract_ids(connection.get_peer_certificate()),
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/pyopenssl.py", line 102, in extract_ids
    if c[0] == b"CN"]
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 161, in __init__
    _validate_pattern(self.pattern)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/service_identity/_common.py", line 406, in _validate_pattern
    .format(cert_pattern)
service_identity.exceptions.CertificateError: Certificate's DNS-ID '*' hast too few host components for wildcard usage.

Consider new search fields

As part of the migration to Python 3 #20, we had to make some changes to the indexing class of Hoaxy. It would be nice to see if there are additional fields that would be worth indexing. For example, we might want to search only by domain (e.g. breitbart, infowars), or by language (e.g. Portoguese, Spanish).

(Dynamic) updates of source list

Background. So far Hoaxy has been collecting tweets that include links to a pre-defined list of source domains. To do so, each domain is included as a keyword for the POST/statuses endpoint of the Twitter streaming API.

Problem. This list can be only updated manually, and does not take into account that not all domains may generate traffic (domains may go offline), and that one may want to prioritize domains that appear in a) multiple lists and b) generate more traffic.

Solution. To overcome these limitations, a new cron job will be added that estimates the amount of tweets for each domain, using a call to the search API. We will also add a table that keeps track of multiple lists of websites, in order to get an idea of how much consensus there is about individual source domains. Finally, another cron job will select only lists with minimum consensus, rank them by the estimated traffic, and update the tweet collection filter accordingly.

Tasks. The following tasks are needed:

  • Script to estimate the weekly traffic of each source, to be run as a cron job
  • Table keeping track of each source domain and in how many lists of fact-checkers it is included, and whether the source is "enabled" for data collection, and tags from each source (this last bit needs an extra table).
  • Script to update the list of sources based on the data from the traffic estimation script.

hoaxy init on afp.com fails

When running hoaxy init with a domains_factchecking.txt that contains the following line

www.afp.com

I get the following error

(hoaxy) hoaxyuser@hoaxydeback:/root$ hoaxy init
2017-07-04 11:06:34,155 - hoaxy(init) - INFO: Creating database tables:
2017-07-04 11:06:34,155 - hoaxy(init) - WARNING: Ignore existed tables
2017-07-04 11:06:34,182 - hoaxy(init) - INFO: Inserting platforms if not exist
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Trying to load site data:
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Claim domains /home/hoaxyuser/.hoaxy/domains_claim.txt found
2017-07-04 11:06:34,215 - hoaxy(init) - INFO: Sending HTTP requests to infer base URLs ...
2017-07-04 11:06:42,714 - hoaxy(init) - INFO: Fact checking domains /home/hoaxyuser/.hoaxy/domains_factchecking.txt found
2017-07-04 11:06:42,714 - hoaxy(init) - INFO: Sending HTTP requests to infer base URLs ...
2017-07-04 11:06:44,110 - hoaxy(init) - ERROR: HTTPConnectionPool(host='afp.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa71187c490>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))
2017-07-04 11:06:44,114 - hoaxy(init) - ERROR: HTTPSConnectionPool(host='afp.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fa711897c50>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))
2017-07-04 11:06:44,210 - hoaxy(init) - WARNING: line 18 'www.afp.com', domain inactive!
2017-07-04 11:06:44,210 - hoaxy(init) - ERROR: Please fix the warnings or errors above! Edit domains, or use --ignore-redirected to handle redirected domains', or Use --ignore-inactive or --force-inactive  to handle inactive domains

However, when visiting the domain in my browser, everything seems to work fine.

A second issue is that afp.com is actually publishing in English, but we would like to access the German version that is available at afp.com/de. However, hoaxy only accepts domains and not URLs. Any workaround for that?

Docker compose not working

I tried running the instructions in the README in /docker but ran into several issues.

First, the download link for lucene changed: https://www.apache.org/dist/lucene/pylucene/pylucene-7.7.1-src.tar.gz should be https://downloads.apache.org/lucene/pylucene/pylucene-7.7.1-src.tar.gz.

After changing this I am able to get a bit further in the docker-compose up command but run into an issue related to ivy and ant.

/usr/src/pylucene/lucene-java-7.7.1/lucene/common-build.xml:449: /root/.ant/lib does not exist.

Total time: 0 seconds
Buildfile: /usr/src/pylucene/lucene-java-7.7.1/lucene/build.xml

-ivy-bootstrap1:
    [mkdir] Created dir: /root/.ant/lib
     [echo] installing ivy 2.4.0 to /root/.ant/lib
      [get] Getting: http://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
      [get] To: /root/.ant/lib/ivy-2.4.0.jar
      [get] Error opening connection java.io.IOException: Server returned HTTP response code: 501 for URL: http://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
      [get] Error opening connection java.io.IOException: Server returned HTTP response code: 501 for URL: http://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
      [get] Error opening connection java.io.IOException: Server returned HTTP response code: 501 for URL: http://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
      [get] Can't get http://repo1.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar to /root/.ant/lib/ivy-2.4.0.jar

-ivy-bootstrap2:
     [echo] installing ivy 2.4.0 to /root/.ant/lib
      [get] Getting: http://uk.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar
      [get] To: /root/.ant/lib/ivy-2.4.0.jar
      [get] Error getting http://uk.maven.org/maven2/org/apache/ivy/ivy/2.4.0/ivy-2.4.0.jar to /root/.ant/lib/ivy-2.4.0.jar

-ivy-checksum:
 [checksum] Could not find file /root/.ant/lib/ivy-2.4.0.jar to generate checksum for.

BUILD FAILED
/usr/src/pylucene/lucene-java-7.7.1/lucene/common-build.xml:527: Could not find file /root/.ant/lib/ivy-2.4.0.jar to generate checksum for.

Total time: 0 seconds
make: *** [Makefile:206: ivy] Error 1

An aside but potentially somewhat related is that in the main repo README mentions that hoaxy has only been tested on 7.6.0 but the Dockerfile for hoaxy is trying to install 7.1.1.

Check dashboard botscores

The cron job that gets the Botometer scores to put into the dashboard can fail (returning n/a) possibly because of RapidAPI or Botometer API rate issues. Slow down the requests and/or add a check that retries after a while.

default crontab contains errors

The crontab that hoaxy produces if creating a fresh config directory is erroneous. The excerpt below contains a command hoaxy site --crawl ..., which however, should actually be hoaxy crawl --fetch-url ...

# hoaxy: fetch article update every hour
0 * * * * source activate hoaxy && hoaxy site --crawl --fetch-url --update > /dev/null

Python3 Deployment

After the transition to Python3:

  • Check scrapy version, make sure that the error with source activate hoaxy-backend && hoaxy --console-log-level=critical crawl --fetch-html --limit=40000 does not reappear
    • This error is caused by a bug of the old version scrapy, when the host name of the crawling HTTPS website is represented as IP address. It is stated in scrapy/scrapy#3029, and fixed in the new version now.
    • check whether the version of scrapy is the newest. And use scrapy shell to test whether we could crawl URL like https://104.238.181.150
  • Update master branch, switch production back to master
  • Switch production back to burns, restart all cron jobs
  • look into disk usage on burns/lenny and for truthy user
  • clear up /nobackup, test databases on lenny, back up folders on lenny and burns, etc.
  • reboot carl to bring GFS back online
  • Update README instructions
    • Document the changes making new system incompatible with Python 2
    • Add requirements for Mercury parser
    • We need to specify/include version for alll the python packages
    • 1.10 of networkx, else API has issues due to a deprecated method (thanks to Francesco Pierri) #30
  • Stop test instance on Lenny and clean up the DB
  • Once remaining bugs are fixed, parser is updated, docs are updated and AMI is available, contact Francesco (Poly Milano) and Emanuele (ISI/Barcelona) about new version availability (@filmenczer and @glciampaglia)

Control panel

For non-technical users, such as newsrooms, we would like a more friendly way (compared to CLI) to modify the config file. Idea: a control panel to manage backend settings like domain lists and API keys, on the fly.

Avoid duplicated article text in database

Currently Hoaxy extracts all the hyperlinks in each tweet collected from the Twitter stream, and puts them in the url table. Hoaxy parses each raw URL so collected and stores the full HTML of each raw URL into the url table, along with its canonical URL. This creates a lot of duplicate content, and is not an efficient usage of space.

To overcome this, we will alter two tables. We will remove the html column from the url table, and add it to the article table, which is the one with the canonical URL of the article. PRIORITY: 2

Steps:

  • pre-update. SQL script to add new columns to table article: html, status_code

  • update of hoaxy-backends, mainly affected modules:

    • hoaxy.database.models
    • hoaxy.crawl.items
    • hoaxy.crawl.pipelines
    • hoaxy.crawl.spiders.article
    • hoaxy.crawl.spider.html
    • hoaxy.crawl.spider.url
  • post-update. Python script to migrate old tables: url and article

I (@shaochengcheng) am working on the second steps now. I prefer to handle it sololy, because there are so many small things to take care of.

Stop including screen_name in Network API results

Since screen_name is mutable in Twitter, results from the network endpoint are often stale. This causes several problems, e.g. https://github.com/IUNetSci/hoaxy-botometer/issues/230

By removing the screen name we would force the frontend to only look up users by numeric ID, which is the right thing to do. This change would break the frontend but in doing so it would help us figure out the source of bugs like https://github.com/IUNetSci/hoaxy-botometer/issues/230

Problems with duplicate and missing results

Currently the front-end gets weird results from the back-end:

  • There are articles missing (eg the top one for "vaccines" query)
  • There are duplicates of the same article (so that clicking on one selects both)
  • There are near-duplicates, eg differing in 'http' vs 'https' or escaped & in URL

In the article table, the column group_id is meant to identify multiple copies of the same article (articles with the same title). Lucene should index only one article among those with the same group_id.

The Lucene search function has a duplicate filter to avoid having duplicate results.

One or both of the above must have broken in the new version.

Invalid transaction is causing TopArticles API to fail

The TopArticles endpoint is failing intermittently. It fails once every two or three queries. As a result Hoaxy is showing an error on the front page, instead of the lists of top popular claims and fact-checks.

  • The log of Hoaxy shows that when there is an error we get this exception (see below). It looks like something related to an invalid transaction.
  • The error pops up around 9 AM 06/01. Looking further in the logs, it looks like that between 6 AM and 8:30 AM there was a space issue ("no space left on device" errors). Wondering if the error is related to that.
  • Looking at top20_articles_monthly it looks like it hasn't been updated since 05/08
StatementError: (sqlalchemy.exc.InvalidRequestError) Can't reconnect until invalid transaction is rolled back [SQL: u'SELECT max(top20_article_monthly.upper_day) AS max_1 \nFROM top20_article_monthly'] [parameters: [{}]]
2018-06-01 14:04:45,055 - hoaxy(api) - ERROR: (sqlalchemy.exc.InvalidRequestError) Can't reconnect until invalid transaction is rolled back [SQL: u'SELECT max(top20_article_monthly.upper_day) AS max_1 \nFROM top20_article_monthly'] [parameters: [{}]]
Traceback (most recent call last):
  File "/home/data/apps/hoaxy-backend/hoaxy/backend/api.py", line 582, in query_top_articles
    df = db_query_top_articles(engine, **q_kwargs)
  File "/home/data/apps/hoaxy-backend/hoaxy/ir/search.py", line 895, in db_query_top_articles
    upper_day = get_max(session, Top20ArticleMonthly.upper_day)
  File "/home/data/apps/hoaxy-backend/hoaxy/database/functions.py", line 105, in get_max
    return q.scalar()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2785, in scalar
    ret = self.one()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2756, in one
    ret = self.one_or_none()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2726, in one_or_none
    ret = list(self)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2797, in __iter__
    return self._execute_and_instances(context)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2820, in _execute_and_instances
    result = conn.execute(querycontext.statement, self._params)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 945, in execute
    return meth(self, multiparams, params)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 263, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1053, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1121, in _execute_context
    None, None)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1393, in _handle_dbapi_exception
    exc_info
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 202, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1114, in _execute_context
    conn = self._revalidate_connection()
  File "/u/truthy/miniconda3/envs/hoaxy-backend/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 424, in _revalidate_connection
    "Can't reconnect until invalid "
StatementError: (sqlalchemy.exc.InvalidRequestError) Can't reconnect until invalid transaction is rolled back [SQL: u'SELECT max(top20_article_monthly.upper_day) AS max_1 \nFROM top20_article_monthly'] [parameters: [{}]]

Ignore false matches by default, change configuration on private instance

We need to make sure that the tweets that match the filter of the Twitter Streaming API but do not contain actual links should be dropped and not added to the database. There is an option for it in the configuration file. We should make dropping the default.

@shaochengcheng should also update the configuration file of our own private instance at hoaxy.iuni.iu.edu

Mercury path config file

We fixed the absolute path for the js code that calls the Mercury parser, but for robustness, this should be placed in the config file.

Networkx issues

  • If we use the latest version of networkx alongside the Python 3 code, networkx seems to break

Docker build fail on jcc install

docker-compose up failed on line 96 of Dockerfile

    && JCC_JDK=/usr/lib/jvm/default-jvm python setup.py install \

permalink

Error message:
/opt/conda/compiler_compat/ld: cannot find -lpython3.8m

Seems to be a python 3.8 compatibility issue

I added python=3.7.5 to the conda install at line 47 to fix and the image built successfully

Speed of query retrieval

We noticed that the search from the article search engine (Lucene) is often very slow. Can you please do some tests with random queries? We should find the bottleneck: is it Lucene? the database? the API? the network?

Add AltNews.in And BoomLive.in as fact checkers

Hello!

I India has the largest number of social media users and so consequently the problem of fake news has been the worst in India.
India Has a Public Health Crisis. It’s Called Fake News.

When I tried searching Hoaxy on https://hoaxy.iuni.iu.edu/ I couldn't find India specific content even when I used India specific search terms in "Articles". Seems that the India-specific fact-checkers haven't been added to Hoaxy.

I would like to request you to please add 1. AltNews.in and 2. BoomLive.in - both are certified by International Fact Checking Network.[0][1]

Thank you!

[0] https://ifcncodeofprinciples.poynter.org/application/public/pravda-media-foundation/D27BB43D-D8FC-F85B-1C25-2AF73DF3A12C
[1] https://ifcncodeofprinciples.poynter.org/application/public/boom/BEE99226-33F7-4B33-9B78-F0F98F51E991

One-time update of source list

@yangkcatiu will update the list based on current literature.

Restart?

Update GDoc (so FAQ does not need to be updated)

Facebook interface? (and CI)

This is a really impressive project, and also in this era a genuinely important one. I have two questions:

  • is there an appetite for making it also look at Facebook information the way it does for Twitter?
  • is there an appetite for a continuous integration setup?

Tweet URL

The Tweet URLs stored in the file that Hoaxy generates seem link to a version of a tweet that only has the first 140 characters, instead of the new max of 280 characters. For example:

This is the link saved by Hoaxy: https://twitter.com/DonGoyoOficial/status/1151497327330365441

This is the result of that link (notice the RT at the beginning and the 3 dots at the end):

Screen Shot 2019-08-11 at 5 13 31 PM

Here is the same tweet on that user's account:

Screen Shot 2019-08-11 at 5 13 42 PM

Here is the same tweet when using search:

Screen Shot 2019-08-11 at 5 25 43 PM

This seems to be a compatibility mode link. If it is, then there has to be an extended mode link.

Should this link be updated to point to a tweet's full text (if possible)?

Make it easier and document how to update source list

Ideally, updating the list of sources (domains) should be as easy as updating the file and/or the configuration file and restarting the system. But recently we learned that it's much more complicated, and there is no documentation about how to do it.

If possible, we should make it as easy as described above. Else there should be a command to update the list. And this has to be documented.

In the future we expect the list to change often, possibly automatically, therefore the update system should be compatible with that.

@shaochengcheng is this possible?

Site config from the production version

First of all, thank you heaps for making this project open-source. I spent literally the whole morning crawling through the files and quality of code is very impressive.

I got really interested in the "site_tags" features, and would like to know if:

  1. Is there any automated way to add these tags from the tables you guys mentioned? I've found the opensource lists, but it would be helpful if thats already done.

  2. If not, is there any chance you guys provide me with the site config used in the version currently deployed at (https://hoaxy.iuni.iu.edu/)?

For the project I have in mind I'd probably have to add some scripts to automatically increase the list of sources to query from. I'd be very glad to contribute with what I add if you guys deem it as valuable for the tool.

Thanks again,
Manoel

Determine cause of UnhandledPromiseRejectionWarning

When Cron job does the following command:

Cron <truthy@burns> source activate hoaxy-be-py3 && hoaxy --console-log-level=critical crawl --parse-article --limit=10000

we get the error:

(node:10396) UnhandledPromiseRejectionWarning: Error: ESOCKETTIMEDOUT at ClientRequest.<anonymous> (/nfs/nfs7/home/truthy/node_modules/postman-request/request.js:1025:19) at Object.onceWrapper (events.js:277:13) at ClientRequest.emit (events.js:189:13) at TLSSocket.emitRequestTimeout (_http_client.js:662:40) at Object.onceWrapper (events.js:277:13) at TLSSocket.emit (events.js:189:13) at TLSSocket.Socket._onTimeout (net.js:440:8) at ontimeout (timers.js:436:11) at tryOnTimeout (timers.js:300:5) at listOnTimeout (timers.js:263:5) at Timer.processTimers (timers.js:223:10) (node:10396) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1) (node:10396) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

Test and revise article parsing pipeline

Test (and also analyze hoaxy backend log) to measure how often one or the other parser fails/succeeds at getting all required fields. Note that if content is indeed empty, that should not be interpreted as a parser error. We may need to revise the pipeline based on these results, eg, switch order, use both and merge fields, or add other parser...

Fix article extraction pipeline

We should move to a better package from text extraction from HTML. The issues are the following:

  1. The current API is failing on some sites (e.g. The Onion).
  2. We also need to store only the text because it will take less space.

To solve the problem of multiple versions, we should keep only the first version of an article.

  • We need to identify a good package for doing extraction and test a few candidates on the HTML of our sites. UPDATE: We decided to use two packages: Goose3 and Dragnet.
  • We will create a cascading system that first uses the local packages (see above), and in case they both fail, and only then, it sends a request to the external Mercury Postlight API.
  • Since Goose3 is Python3 code, we also need to port Hoaxy to Python3, see #20.
  • After extraction, we include the text into Lucene, and then we can set the HTML to NULL to save space. (SEE INSTRUCTIONS FROM @shaochengcheng )

@ZacMonroe could help with research of a package for HTML text extraction and with testing it.

GET Articles (Hoaxy API): strange behavior on "date_published" query filter

Hello there,
I found some strange behavior with the Lucene index on "date_published". (using Hoaxy API on rapida-I)
My goal is to retrieve all articles collected by Hoaxy at the highest granularity (hours or minutes). I noticed that:

  1. the term range filter has problems on the same day (so no hope of filtering on different hours of the same day)
  2. the simple query has problem with the "T" (I had to use a "?").

Maybe I misunderstood Lucene query index but I was able to find a solution to crawl the desired articles using something like:
- date_published:2019-03-13?00* 74 results/entries
- date_published:2019-03-13?01* 22 results/entries
- date_published:2019-03-13?02* 50 results/entries
- etc…

Francesco

TWO NEGATIVE EXAMPLES:
Schermata 2019-03-20 alle 18 25 19


Schermata 2019-03-20 alle 18 23 14


TWO POSITIVE EXAMPLES
Schermata 2019-03-20 alle 18 23 25


Schermata 2019-03-20 alle 18 22 12


Schermata 2019-03-20 alle 18 31 40

Hoaxy backend install issue

Hi, I'm trying to install hoaxy backend on my ubuntu server, but pylucene doesn't install. When I perform make test my jvm throws an exception. See below:

Installed /tmp/pylucene/pylucene-4.10.1-1/build/test/lucene-4.10.1-py2.7-linux-x86_64.egg
Processing dependencies for lucene==4.10.1
Finished processing dependencies for lucene==4.10.1
find test -name 'test_*.py' | PYTHONPATH=/tmp/pylucene/pylucene-4.10.1-1/build/test xargs -t -n 1 /usr/local/bin/python
/usr/local/bin/python test/test_BooleanQuery.py

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00002b3c1c1495e8, pid=7399, tid=47537106197888

JRE version: OpenJDK Runtime Environment (7.0_171-b02) (build 1.7.0_171-b02)

Java VM: OpenJDK 64-Bit Server VM (24.171-b02 mixed mode linux-amd64 compressed oops)

Derivative: IcedTea 2.6.13

Distribution: Ubuntu 14.04 LTS, package 7u171-2.6.13-0ubuntu0.14.04.2

Problematic frame:

V [libjvm.so+0x61b5e8]

Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

An error report file with more information is saved as:

/tmp/pylucene/pylucene-4.10.1-1/hs_err_pid7399.log

If you would like to submit a bug report, please include

instructions on how to reproduce the bug and visit:

http://icedtea.classpath.org/bugzilla

xargs: /usr/local/bin/python: terminated by signal 6
make: *** [test] Error 125
root@blackbird:/tmp/pylucene/pylucene-4.10.1-1# python --version
Python 2.7.12

Does anyone knows any solution for this issue?

Regards,
Márcio Silva

Move Hoaxy logs

Update the config file so that the logs are stored on /l/cnets (GFS) rather than the truthy user home directory, and move the logs that are currently there.

GET Articles queries only return 100 results (Hoaxy API)

When querying the API, getting articles only return up to 100 results (either the 100 most recent or 100 most relevant results).

I suggest adding a query param to specify how many results are required.

This discussion was opened in #28, but should be considered as a separate issue. The inability to obtain all articles corresponding to a query is extremely limiting to API users.

Lucene query string parser option `use_lucene_syntax`

How we provide the default behavior when parsing the query string by Lucene, specifically, the use_lucene_syntax option in the API.

I am thinking to provide three values for this field:

  • False, do not use Lucene syntax
  • True, do use Lucene syntax
  • None, inferring it by the query string that tries to parse it as Lucene syntax if an error occurs then do parse without Lucene syntax.

And the default could be set to None.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.