jaryaman / propnews Goto Github PK
View Code? Open in Web Editor NEWMaking the news proportionate to global priorities
License: GNU Affero General Public License v3.0
Making the news proportionate to global priorities
License: GNU Affero General Public License v3.0
error_log.txt
into an SQL table.gitignore
so that when you pull the repo you don't have to discard the old logWe're currently hardcoded for BBC news. We should generalise so that the bot can take a user-defined list of news sites.
We are limited by page_limit_per_request=10
and results_per_page=100
. Is it possible that, as we increase the list of news sources we search, we hit the limit that NewsAPI can return in a single call, and we therefore need to query NewsAPI more frequently?
Seems to me that get_full_content.get_bbc_content()
doesn't return the full content of a corresponding BBC article. The text isn't precisely the same. Is this a concern?
score_articles.score_article
should search the article content for (large?) numbers, and words like: double, triple, halve, third, percent, and require that the article pass this test in order to receive a non-zero score.
Twitter doesn't allow you to tweet the same thing twice. Currently, if main.py
attempts to tweet the same tweet twice, we write to error_log.txt
, but really we ought to keep sampling articles until something new comes up.
Using Doc2Vec as a substitute for search strings in assigning scores to articles
The security of the API key we are currently using is compromised. We should make another.
get_articles.get_url_content()
never appears to be able to get URL contents. I performed the following experiment. In get_articles.py, I added the lines
content = get_url_content(url_path, url)
if len(content) > 0:
pass
and set a breakpoint at pass. The code never entered the line 'pass' after searching > 100 articles. This doesn't seem right.
Currently using the URL as the primary key in news.db
, but integer primary keys are more efficient. Consider hashing the URL?
Integrate get_new_articles
into main, where we use some time window e.g. 24 hours to draw from the database to tweet
Whilst debugging the code, we don't necessarily want to call the API heavily. There currently exists an argument dbg_mode
for tweeting.tweet_news
. This should be set as a command line flag for running main.py
$ ipython
$ run main.py -d
Some files like run.sh
and ranked_articles.csv
don't seem to do anything. Do some spring cleaning.
It would be good to start accumulating a database of URLs. Up to now I've been deleting news.db
when convenient. This is perhaps bad practice.
We should collect e.g. 10 news articles per topic for some supervised NLP methods
We are currently taking all news from a vendor, but certain categories are (more or less) guaranteed to be irrelevant, e.g. sport. We can filter out a lot of noise by avoiding certain categories at the level of API calls.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.