keithcu / linuxreport Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 2.0 3.07 MB

Customizable Linux news site based on Python / Flask

Home Page: https://covidreport.keithcu.com/

License: GNU Lesser General Public License v3.0

Python 88.78% HTML 11.22%

flask linux newsfeed python

linuxreport's People

Contributors

Stargazers

Watchers

Forkers

tkulick wommy

linuxreport's Issues

Add timeout

This request hanged:
https://www.google.com/alerts/feeds/12151242449143161443/16985802477674969984

That means that anyone fetching a page will never return either.

It seems to be working again now, but it would be nice to have some socket timeout logic of 5 seconds or so max.

Also, I could have some logic to temporarily cache an empty feedparser result so that anyone who was waiting could just use that value and not block the whole site.

A few architectural issues

~~Consider refactoring the two URL dictionaries into one, and two URL lists for each website.~~
~~It makes it easier to share feeds between both sites.~~
Most sites return results in less than 1s, but some websites take 2 or 3 seconds. Try to return stale version, and create a thread to fetch. This way typically no one will ever have to wait for an HTTP fetch.

Dark mode

would be a nice feature. I just need to grab some reasonable colors.

Use e-tags, last modified headers to be smarter / faster about fetching old content

https://pythonhosted.org/feedparser/http-etag.html

To use this you have to save off some data. It also means that instead of expiring the whole feed to trigger a refresh, actually keep it around and use some other mechanism than cache expiration to trigger when to refetch the data.

Faster when multiple feeds have expired

On startup, or when multiple feeds have expired, it will sequentially fetch the RSS feeds. That's a little slow when there are 9 or more to fetch, and some sites take 1.5 seconds. Note that most users won't have this problem because it's only 1 request per hour or so. Also, there is a little jitter to spread out the requests so it's unusual to have more than a few fetches.

It would be faster to switch to multi-process or multi-threading to allow multiple fetches to happen at the same time.

It would be simplest to use multi-process, but that could mean that each of the ~10 Python engines that respond to Apache requests would probably have their own pool of 2-5 processes sitting around.

It could also be sped up by creating multiple threads which Python now supports.

Ideally it would be done in an async way. You could queue up 2-9 requests async with one thread, which would mostly be sitting for .5 to 1.5 seconds waiting for a response.

I think creating a pool of Python threads is the best solution here because they are lighter weight than Linux threads and obviously processes, and the logic is very simple.

Because this uses a file system cache, a solution using either processes or threads should work.

Mobile order

The mobile order isn't the same as the desktop. One way to fix it would just be to not create 3 columns. The question of whether it's a mobile request can be determined in CSS, but this needs it in the Python.

Smarter refreshing via machine learning

Currently the system fetches usually every hour, or every 6 hours (for sites that usually update just once per day.) It does this 24/7.

It should be possible to apply some machine learning per feed to have the system figure out when the site usually updates, and then only make requests around then. This could be done manually (by keeping track of a week's worth of updates), or by applying some machine learning algorithms. It would be great if it could keep learning over time.

This would also be better for the sites that update once per day, because it could try to catch them soon after they are usually posted, rather than up to 6 hours later.

More URLs

I'm sure there are more URLs that could be worth adding to the page. People can add it themselves already, but it's a bit of work to dig up the RSS feed.

It would be nice to have some extra ones that aren't necessarily shown by default, but can be easily chosen without having to track down the RSS URL.

Prevent excess RSS requests

When a feed has expired, if two people come to the website at the same time, it will possibly fetch for both of them. Implement some logic which if the feed has expired:
Check for URL + "FETCH" in the cache
if it doesn't exist, then create an entry containing this PID / TID
Then check to make sure it's our PID / TID

If it already exists and it's not our PID / TID, sleep for 100 ms and keep checking till cache entry disappears
If it is, then fetch the RSS feed, add to cache, and then delete the URL + FETCH cache entry

Get rid of mixed-content for SSL

The website works with SSL, but it has hard-coded "http:" in various places for images, which need to be fixed.

Bootstrap / prettier CSS

Is it worth using bootstrap or some custom CSS to look a little prettier?

Jitter doesn't work with long page cache expirations

The jitter is a nice feature to spread out the requests slightly. That way when someone shows up, an hour after the server started, they will only have to fetch one request, as the other hourly requests will actually expire over the next 5 minutes.

However, it only works for short page cache lengths. Right now, with a page cache of 10 minutes, when that expires, the next user is often having to wait for multiple fetches. So either shorten the page cache to around 1 minute, or take out the jitter.

keithcu / linuxreport Goto Github PK

linuxreport's People

Contributors

Stargazers

Watchers

Forkers

linuxreport's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs