GithubHelp home page GithubHelp logo

gonzaloulla / unlp-dbd-newsler Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 123 KB

Newsler - News crawler from Websites and Twitter - DB Design - MS in Software Engineering - UNLP

License: MIT License

Python 74.26% Batchfile 0.61% Dockerfile 6.66% Shell 18.47%

unlp-dbd-newsler's People

Contributors

delavegamatias avatar gonzaloulla avatar julirios avatar pablofrias avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

unlp-dbd-newsler's Issues

As a DevOps, I want to use a wait-for mechanism so my docker-compose services start gratefully

Is your feature request related to a problem? Please describe.
Starting the entire stack at once might make your computer/vm crash.

Describe the solution you'd like
Since docker-compose depends: attribute does not determine startup order of services, a wait-for-it.sh script should be used.

  • Kibana and Logstash should wait-for-elasticseach
  • Twitter and News Crawlers should wait-for-logstash

As an User, I want to have a Data Dictionary so I can analyze Newsler outcomes

Describe the solution
3 tables in README.md

  1. The first one containing Newsler component and data attribute prefixes (news_ fors news-crawler and tweet_ for twitter-crawler

  2. News Crawler: a list of attributes and description of data provided by news-crawler

  3. Twitter Crawler: a list of attributes and description of data provided by twitter-crawler

Describe alternatives you've considered
A Google doc file, an Excel spreadsheet, a Json file

As a DevOps, I want to use supervisord so my services are always up

Is your feature request related to a problem? Please describe.
YES!!
After a long time running, when my laptop becomes idle, it shuts down all network data transmissions and, thus, twitter-crawler fails.

Whenever one of the crawler fail, the python process is stopped and never restarted.

Describe the solution you'd like
Have supervisord to keep a process up and running all the time

As a Developer, I want to document Newsler arch & features so I can share it with stakeholders

Value of the US for each role

As a Developer, I want to document Newsler arch & features so I can share it with stakeholders.

As a Student, I want to approve DB Design.

As a Professor, I want to have proper evidence of the work that's been done.

As an User, I don't know all Newsler prerequisites, features and configurations (e.g., Kibana index patterns) to use it

Describe the solution you'd like

A clear and concise User Manual (check alternatives below) detailing the following items:

  • Architecture
  • Progress / Issues / Technologies used / Problems along the road & solutions
  • Future work (Kafka)
  • Docker-compose commands to get everything up & running
  • Kibana index patterns
  • Data Dictionary (maybe the link to the README.md)
  • Dashboards
    • How to import them
    • How to use them
  • Available Visualizations
  • Example of Queries

MVP (acceptance criteria)

  • The final report for DB Design is completed.

Alternatives

  • Put everything in the README.md file
  • Use Sphinx and .rst files (maybe it's way too much)
  • Create something in Google Docs (meh)
  • Use GitHub Wikis

[EPIC] As a Developer I want to decouple Newsler from ELK Stack so I can change the DB seamlessly

Is your feature request related to a problem? Please describe.

  • As a DB Design requirement, we should encapsulate and decouple from the internal ELK technical details.
  • The current approach of writing a json-per-line object to a file stored in a docker volume might/should be deprecated

Describe the solution you'd like

Not sure yet, some decoupled framework to abstract data models and decouple CRUD operations to any persistence layer.

Describe alternatives you've considered

  1. Use Newsler "AS IS" right now, using different Logstash output plugins

  2. A common framework/library (set of Python modules) shared by both news-crawler and twitter-crawler components to abstract data models and decouple CRUD operations

  3. Option 1 plus a Kafka input/output mechanism (maybe a couple of new components and dependencies?)

Additional context
I need help, a lot.

As an User, I want to parameterize a Kibana dashboard so I can get better insights from Newsler

Is your feature request related to a problem? Please describe.

How can I parameterize an existing Kibana dashboard to filter/dive into specific results?

Describe the solution you'd like

For example, as an User I would like define some parameters in order to:

  • Get metrics for a specific news website (number of tweets, tweet info)
  • Get metrics for a specific news headline/keyword
  • Get metrics for a specific tweet ID (tweet_id_str)

For tweets that are retweets from another, tweet_likes is always zero

Describe the bug
If one tweet is a retweet from another, tweet_likes is always zero. That is, the amount of likes is not saved

To Reproduce
Steps to reproduce the behavior:

  1. Inspect the data that we are saved.
  2. Compare with data registred in Twitter.

Expected behavior
Twitter: https://i.imgur.com/elsozs9.png
Kibana: For tweet_id = 1251658883650203648, tweet_likes must be 140

Current behavior
Twitter: https://i.imgur.com/elsozs9.png
Kibana: https://i.imgur.com/UMFQjOs.png

CNN and FoxNews links are broken

Describe the bug
Some of the links generated by the CNN and FoxNews crawler are broken

To Reproduce
Steps to reproduce the behavior:

Run the CNN or FoxNews crawler and check the links.
Some links are not under the /world URL but for instance europe or us, by default the /world is added to the URL

Expected behavior
Links are not broken and the URL is honored

Current behavior
/world is added to all the links

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.