GithubHelp home page GithubHelp logo

energy-news-roundup's People

Contributors

derekeder avatar herrerarobert avatar suragnuthulapaty avatar

Watchers

 avatar  avatar  avatar

energy-news-roundup's Issues

assign states to each entry

A key objective of this project is to split out the news entries by state, and eventually by city.

To do this, lets make use of the locationtagger library: https://pypi.org/project/locationtagger/ and check against the descriptions and link URLs.

If the state isn't named directly, we can create a mapping of publications to states/cities and assign them based on that

clean up categories

With the basic scrape completed in #1 (nice work @SuragNuthulapaty and @HerreraRobert!), the next step is to review the output data and see if there's any changes to make to massage the data. I created a Google Sheet of the scraped digest items here as well as a summary table for counts of each category: https://docs.google.com/spreadsheets/d/1n_ruhDMz0g-b2yfhak1m9fq7Xr_0SY5rqPeamJqPHho/edit#gid=1843617235

From this, it looks like there are a few redundant / inconsistently named categories we're pulling like OIL AND GAS vs OIL & GAS etc: https://docs.google.com/spreadsheets/d/1n_ruhDMz0g-b2yfhak1m9fq7Xr_0SY5rqPeamJqPHho/edit#gid=934319479

We should try to clean up / merge these together. A few ideas for that:

  • convert all text to uppercase and strip out all non-letter characters
  • create a list of categories to ignore, like YOUR AD HERE and YOUR MESSAGE HERE
  • for the ALSO and MEANWHILE articles, maybe associate them with the previously found category?

Blurbs Contain bullet points

some of the processing on the data doesn't work and there is some blurbs where there are still bullet points, this needs to be fixed. Currently, throw an error if we find this
Screenshot 2022-11-08 at 9 16 27 PM

add naive check for presence of state string

Some entries that locationtagger misses seem pretty obvious to me:

  • New Mexico’s state land office brought in more than $2 billion in oil and gas royalty revenue over the last 12 months, a record high.
  • Wyoming conservationists urge state and federal regulators to tax and further restrict oil and gas facilities’ methane emissions.

Lets add a string comparison check against a list of states and also tag based on that.

Pre process link

Take the links to the digests and pre-process them so that:
<b> -> <strong>
</b> -> </strong>
<i> -> <em>
</i> -> </em>

to make it easier to work through

Make Unittests

Make unittests for text parsing, specifically cleaning up publication name.

Fix publication parsing

Turn the logic to parse the publication tag and make it into its own method to make it easier to debug.
Screenshot 2022-11-08 at 8 40 15 PM

This will make it so that we don't accidentally truncate words

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.