gonzaloulla / unlp-dbd-newsler Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 123 KB

Newsler - News crawler from Websites and Twitter - DB Design - MS in Software Engineering - UNLP

License: MIT License

Python 74.26% Batchfile 0.61% Dockerfile 6.66% Shell 18.47%

unlp-dbd-newsler's People

Contributors

Stargazers

Watchers

unlp-dbd-newsler's Issues

As a Developer, I want to export Kibana dashboards to JSON so the end user can import and use them

Is your feature request related to a problem? Please describe.
I have no idea why to visualize our data.

Describe the solution you'd like
I would like to have a Kibana Dashboard stored as a JSON file in our GitHub repo.

Describe alternatives you've considered
None yet

As a DevOps, I want to use a wait-for mechanism so my docker-compose services start gratefully

Is your feature request related to a problem? Please describe.
Starting the entire stack at once might make your computer/vm crash.

Describe the solution you'd like
Since docker-compose depends: attribute does not determine startup order of services, a wait-for-it.sh script should be used.

Kibana and Logstash should wait-for-elasticseach
Twitter and News Crawlers should wait-for-logstash

As an User, I want to have a Data Dictionary so I can analyze Newsler outcomes

Describe the solution
3 tables in README.md

The first one containing Newsler component and data attribute prefixes (news_ fors news-crawler and tweet_ for twitter-crawler
News Crawler: a list of attributes and description of data provided by news-crawler
Twitter Crawler: a list of attributes and description of data provided by twitter-crawler

Describe alternatives you've considered
A Google doc file, an Excel spreadsheet, a Json file

As a DevOps, I want to use supervisord so my services are always up

Is your feature request related to a problem? Please describe.
YES!!
After a long time running, when my laptop becomes idle, it shuts down all network data transmissions and, thus, twitter-crawler fails.

Whenever one of the crawler fail, the python process is stopped and never restarted.

Describe the solution you'd like
Have supervisord to keep a process up and running all the time

As a Developer, I want to document Newsler arch & features so I can share it with stakeholders

Value of the US for each role

As a Developer, I want to document Newsler arch & features so I can share it with stakeholders.

As a Student, I want to approve DB Design.

As a Professor, I want to have proper evidence of the work that's been done.

As an User, I don't know all Newsler prerequisites, features and configurations (e.g., Kibana index patterns) to use it

Describe the solution you'd like

A clear and concise User Manual (check alternatives below) detailing the following items:

Architecture
Progress / Issues / Technologies used / Problems along the road & solutions
Future work (Kafka)
Docker-compose commands to get everything up & running
Kibana index patterns
Data Dictionary (maybe the link to the README.md)
Dashboards
- How to import them
- How to use them
Available Visualizations
Example of Queries

MVP (acceptance criteria)

The final report for DB Design is completed.

Alternatives

Put everything in the README.md file
Use Sphinx and .rst files (maybe it's way too much)
Create something in Google Docs (meh)
Use GitHub Wikis

As a Developer, I want to make a PoC using MongoDB so I can analyze the impact of changing DB

Solution

Replace ElasticSearch for MongoDB

Epic
Check epic #22

[EPIC] As a Developer I want to decouple Newsler from ELK Stack so I can change the DB seamlessly

Is your feature request related to a problem? Please describe.

As a DB Design requirement, we should encapsulate and decouple from the internal ELK technical details.
The current approach of writing a json-per-line object to a file stored in a docker volume might/should be deprecated

Describe the solution you'd like

Not sure yet, some decoupled framework to abstract data models and decouple CRUD operations to any persistence layer.

Describe alternatives you've considered

Use Newsler "AS IS" right now, using different Logstash output plugins
A common framework/library (set of Python modules) shared by both news-crawler and twitter-crawler components to abstract data models and decouple CRUD operations
Option 1 plus a Kafka input/output mechanism (maybe a couple of new components and dependencies?)

Additional context
I need help, a lot.

As an User, I want to parameterize a Kibana dashboard so I can get better insights from Newsler

Is your feature request related to a problem? Please describe.

How can I parameterize an existing Kibana dashboard to filter/dive into specific results?

Describe the solution you'd like

For example, as an User I would like define some parameters in order to:

Get metrics for a specific news website (number of tweets, tweet info)
Get metrics for a specific news headline/keyword
Get metrics for a specific tweet ID (tweet_id_str)

As a Data Scientist, I want to assess my Sentiment Analysis so its performance can be assured

Describe the solution you'd like
Jupyter notebooks assessing Sentiment Analysis performance.

As a Developer, I want to make a PoC using Kafka so I can analyze using it to decouple Newsler

Solution

Kafka and Zookeeper docker containers
Something out and it from Kafka topics
New branch with the PoC or the explanation of the effort involved in this design

Epic
Check epic #22

As an User, I want to access a more complete Dashboard so I can get better insights from Newsler

Inputs que te pueden servir:

#21
Ese mensaje largo de WhatsApp que mandó @pablofrias diciendo sus ideas, lo que a él le gustaría ver en Newsler, preguntale a él si no lo encontrás en el grupo de WhatsApp

Describe the solution you'd like

Add Tag Cloud visualizations
Add Metric visualizations
Add Controls
Add Filters

As a DevOps, I want to collects logs and metrics from my services so I can monitor them using the ELK stack

Describe the solution you'd like
Current filebeat+logstash+elasticsearch+kibana infrastructure to collect logs and metrics from each service. New logstash pipelines can be used to achieve this.

For tweets that are retweets from another, tweet_likes is always zero

Describe the bug
If one tweet is a retweet from another, tweet_likes is always zero. That is, the amount of likes is not saved

To Reproduce
Steps to reproduce the behavior:

Inspect the data that we are saved.
Compare with data registred in Twitter.

Expected behavior
Twitter: https://i.imgur.com/elsozs9.png
Kibana: For tweet_id = 1251658883650203648, tweet_likes must be 140

Current behavior
Twitter: https://i.imgur.com/elsozs9.png
Kibana: https://i.imgur.com/UMFQjOs.png

CNN and FoxNews links are broken

Describe the bug
Some of the links generated by the CNN and FoxNews crawler are broken

To Reproduce
Steps to reproduce the behavior:

Run the CNN or FoxNews crawler and check the links.
Some links are not under the /world URL but for instance europe or us, by default the /world is added to the URL

Expected behavior
Links are not broken and the URL is honored

Current behavior
/world is added to all the links

As an User, I want to perform free text searches so I can analyze the correlation between News and Tweets

Inputs:

Mapping de elasticsearch
Posible migración o pérdida de datos
Cambios programáticos, no por UI que dependan del volumen de datos de Kibana

Describe the solution you'd like

Modify index template
Add "analyzer" : "stop" to perform free text searches

The polarity of many tweets is nil

Describe the bug
Out of 1082 tweets, 443 of them have nil polarity

To Reproduce

Check the variable tweet_sentiment_polarity for saved tweets

Expected behavior
Tweets don't have to have a nil value in their variable tweet_sentiment_polarity

Current behavior
https://i.imgur.com/zXzohR8.png
https://i.imgur.com/89iH9Hw.png

As a Developer, I want to remove deprecated spiders so I have an unified news crawling system

Is your feature request related to a problem? Please describe.
No

Describe the solution you'd like
Only one way to run web spiders: right now this'd be websites.py, which applies the Template Method pattern

gonzaloulla / unlp-dbd-newsler Goto Github PK

unlp-dbd-newsler's People

Contributors

Stargazers

Watchers

unlp-dbd-newsler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs