docnow / diffengine Goto Github PK

View Code? Open in Web Editor NEW

177.0 177.0 30.0 396 KB

track changes to the news, where news is anything with an RSS feed

License: MIT License

Python 96.58% HTML 3.42%

diffengine's People

Contributors

Stargazers

Watchers

diffengine's Issues

unexpected archive.org response: name 'url' is not defined

Looking at diffengine.log I noticed the following error:

2017-01-19 15:25:06,606 - root - ERROR - unexpected archive.org response for https://web.archive.org/save/http://www.presseportal.de/pm/58964/3539208: name 'url' is not defined

Opening the same URL in a browser just loads archive.org fine and it returns the saved URL.

Not sure this is just a temporary error due to connection speed or similar issues, or a bug with my PhantomJS install?

configure text element

It might be useful to be configure a feed with a CSS selector to specify what element to extract text from with readability. For example the Washington Post currently use

<article itemprop="articleBody">...</article>

To enclose the text of the article using https://schema.org/NewsArticle microdata. Perhaps the config could look like:

- name: Washington Post - Politics
  url: http://feeds.washingtonpost.com/rss/politics
  css_selector: article[itemprop="articleBody"]
  twitter:
   access_token: foo
   access_token_secret: bar

I guess the downside to this is that sites change, so unless you are watching it you may not notice when their markup changes, and your diffengine instance would quietly stop working.

Negative match / exclude URLs

I think it would be a nice feature to ignore articles with specific words in the URL or in the Title.

I have one feed where articles with a specific word in the url are always changed but contain only Copyright information (https://twitter.com/ueberschrieben/status/864926843224432641)

unable to tweet: Read-only application cannot POST

Looking at the Twitter account's "Apps" settings this is actually correct: it reads "Permissions: read-only", however I just provided the tokens and secret as requested and opened the Twitter authorize URL that returned the pin.

I haven't played with Twitter OAuth for ages and I can't seem to find a way to change the permissions afterwards using the Twitter UI, so what would be the proper way to grant read and write to my app?

EDIT: I simply clicked "Regenerate My Access Token and Token Secret" and Twitter then magically makes it a Read and write token instead of a read-only

'EntryVersion' object has no attribute 'save_url'

I think is a typo error on init.py at line 259 it says self.save_url and it should be just save_url

Why tweet_status_id_str on EntryVersion and not Diff?

Since diffs are tweeted and not particular versions I'm wondering why we are storing the tweet id for the diff on the EntryVersion instead of Diff? I think it would be cleaner to store it on the Diff right?

Non-canonical URLs pointing at the same URL from multiple feeds results in duplicates

For example: https://twitter.com/search?f=tweets&q=fox_diff%20Former%20President%20Bush%20intensive%20care&src=typd

All the Archive URLs are the same. However, in the logs I see both:

checking http://feedproxy.google.com/~r/foxnews/politics/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html
checking http://feedproxy.google.com/~r/foxnews/most-popular/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html

Which I'm guessing is how these duplicates are getting tweeted.

Maybe we need to do some URL de-referencing/canonicalization before storing/checking URLs from feeds? If I curl -I those feedproxy URLs I get a 301 response with a semi-canonical URL in the location (would need to have parameters stripped).

Unable to complete setup

I finished installing using AWS cloud9 and got this:

Fetching initial set of entries.
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/firefox/webdriver.py", line 174, in init
keep_alive=True)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities

**I downgraded selenium (I saw this might help and I'm really new to this stuff)

Then I got this:**

Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
TypeError: init() got an unexpected keyword argument 'options'

Please could someone help? Thanks!!

staleness heuristic / performance

The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.

There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:

hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness

stale? = s >= .2

So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:

1200 / 10800 = .11 (not stale)

Or if the entry is 3 hours old and it was last checked 1 hour ago:

3600 / 10800 = .33 (stale)

The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.

I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?

It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.

I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!

SavePageNow 503 Service Unavailable

I've noticed that the SavePageNow service gives out occasional 503 Service Unavailable errors. diffengine should guard against that and retry then log the failure.

Make `time.sleep` configurable and default to zero

When asked for it use at #74 , @edsu said:

I put it there to prevent getting blocked by websites that saw repeated and rapid crawling as a threat. But honestly I had kind of forgotten about it. It might be nice to retain it as a configuration option, and have it default to zero?

smarter stale measure

The current test for staleness doesn't seem to be smart enough. After running diffengine for over a month it is taking it 8 hours to check Breitbart, The Guardian and The Washington Post. I think it needs to be smarter about what to do with the backlog of sites. Perhaps randomly sampling from them?

Document how to work with multiple feeds

I have accounts setup, and a couple of have many urls for RSS feeds. Not sure if I have everything setup right. So, we should probably document the best way to setup multiple accounts, and an account that has multiple RSS feeds in it.

Happy to do this work.

I have each account setup in it's own home directory:

Toronto Sun /home/nruest/.torontosun
Toronto Star /home/nruest/.diffengine
Globe & Mail /home/nruest/.globemail
Canadaland /home/nruest/.canadaland
CBC /home/nruest/.cbc

Toronto Sun has multiple RSS feeds, and I have config.yaml setup like so:

- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/photos/rss.xml
- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/videos/rss.xml
- name: Top Home stories
  twitter:
    access_token: SOMETHING
    access_token_secret: SOMETHING
  url: http://www.torontosun.com/sunshine-girl/rss.xml
...
phantomjs: phantomjs
twitter:
  consumer_key: SOMETHING
  consumer_secret: SOMETHING

Photo changes

Franc suggests that it would be useful to possibly track changes in images. At the moment only textual changes are noted. But it could be possible to notice a substantial change in images used in the body of the article.

Error message: WARNING - not tweeting without archive urls

I tried to set up diffengine yesterday and after a few teething issues, including installation failing due to the "--process-dependency-links" error, it seemed to run fine when I rolled back to Pip 18.

However, it does not tweet.

I do have 3 directories in my diffs folder but I noticed that differences were not tweeted. Looking through the diffengine log, there are three error messages that presumably relate to this: "WARNING - not tweeting without archive urls"

Not sure whether other errors in the diffengine.log are related:

"ERROR - unable to get archive id from None"
"ERROR - unexpected archive.org response for https://web.archive.org/save/https://www.blahblah.com"

Any ideas?

strange diffs getting tweeted

@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:

import requests
import readability

html = requests.get("https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html").content
doc = readability.Document(html)

print(doc.summary())

returns (at the moment):

<html><body><div><div class="article__subheadline" data-reactid="93"><p data-reactid="94">The 15-year-old was remanded into secure accommodation on Wednesday and was also charged with possession of an offensive weapon. </p></div></div></body></html>

Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?

Dealing with url changes

I'm setting up a tracker for http://visir.is (a Icelandic news site).

I've noticed that when changes are done on headlines, their system makes new urls.

The urls are made up of these elements:

http://visir.is/g/<ARTICLE_ID>/< HEADLINE >

To view the article the < HEADLINE > part is reduntant.

To get around it I made some changes to allow for a regex to be applied to a url from the rss feed. See here:

pallih@0519f31

This makes the url checked: http://visir.is/g/<ARTICLE_ID> so subsequent changes to the headline are picked up, and not stored as a new article.

I'm not sure introducing a config variable is appropriate for the project, but at least my solution is there, if anyone needs it.

User-configurable deletions for content normalization

It might be nice for users to be able to put an array of strings or regexes in config.yaml that can be used to normalize content before diffing.

For example, I could put 'Scroll down for video' in for deletion for dailymail_diff, or with regexes globemail_diff might be able to remove stock price changes.

Related to #10, there might be a tradeoff for where to put such an array in the YAML hierarchy. Putting it as a top-level key would mean less repetition for people using one config per news source, putting it as a key under each feed would allow people using one config for multiple news sources to have different ones for each.

diffengine sometimes hangs without explanation

Noticed on 2017-01-27 for cnn_diff:

2017-01-25 10:16:43,128 - root - INFO - shutting down: new=13 checked=495 skipped=1691 elapsed=0:16:41.543544
2017-01-25 10:30:02,301 - root - INFO - starting up with home=/Users/ryan/source/diffengine/cnn_diff
2017-01-25 10:30:02,317 - root - INFO - fetching feed: http://rss.cnn.com/rss/cnn_topstories.rss
2017-01-25 10:30:03,048 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/KiO2MctO3eI/index.html
2017-01-25 10:30:03,240 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/iyOB4KTWINU/index.html
2017-01-25 10:30:03,413 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/q2hFlt0ZpK0/index.html

and bbc_diff:

2017-01-26 05:52:10,719 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983
2017-01-26 05:53:29,932 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml?edition=int
2017-01-26 05:53:38,509 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/system/latest_published_content/rss.xml
2017-01-26 05:53:38,673 - root - INFO - shutting down: new=6 checked=226 skipped=1851 elapsed=0:08:36.948907
2017-01-26 06:00:01,536 - root - INFO - starting up with home=/Users/ryan/source/diffengine/bbc_diff
2017-01-26 06:00:01,545 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml
2017-01-26 06:00:01,674 - root - INFO - found new entry: http://www.bbc.co.uk/news/science-environment-38755229
2017-01-26 06:00:01,717 - root - INFO - found new entry: http://www.bbc.co.uk/news/business-38748296
2017-01-26 06:00:01,765 - root - INFO - found new entry: http://www.bbc.co.uk/news/magazine-38722929
2017-01-26 06:01:41,721 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983

Both had long-running processes but didn't appear to be logging or doing anything new.

I tried using dtruss to trace syscalls in the running processes before killing them, but no syscalls were being made (using dtruss on a successfully-running diffengine instance produces a lot of output).

Python code formatter integration

Hi @edsu I'm working on the first PR related to the envyaml package integration as the first step of the entire features I've been adding in my own fork.

I know you created the thread branch but I'm sure this way would be much easier as we can discuss every feature separately and that way avoiding conflicts as it's a one big file.

Anyway, one of the first things I'm thinking is this: what do you think about using a common Python formatter as a condition for the colaborators (myself in this case)?

This way every code addition, is compliant with the same way to coding. This can be done automatically by installing the code formatter it in the collaborator's own computer. E.g.: I've installed Black for formatting the code when I save the file without having to worry about that kind of stuff.

The thing here is that modifies the entire file the first time. So this would be the very first PR to integrate to the master branch, if you agree with this.

Change summary lang key default value to "the page"

As stated by @edsu at #74

All that being said, maybe it's best to stick with summary since that is what Readability calls it, and its what is in the model. But I think the default string value should be "the page"?

Screenshot truncated

Since the move to Firefox for screenshotting it seems that the image can sometimes not include the diffed text. For example https://twitter.com/whitehouse_diff/status/1252696316381143041

Perhaps there is a timing difference, or the JavaScript that adjusts the page isn't working as it once was?

changes in whitespace count as diffs

It looks like changes in whitespace in the readability text are showing up as diffs. Here's an example

Rather than doing a simple equality check perhaps there whitespace should be stripped somehow? Or we could calculate a diff each time?

pip 19 breaks install

Because of pypa/pip#4187 --process-dependency-links is not supported anymore and installation fails. Use pip3 install --upgrade pip==18.0.0 to install it.

handle too many redirects

I caught this via an email from cron. It looks like some better handling of this type of error is needed?

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 233
, in archive
    resp = requests.get(save_url, headers={"User-Agent": UA})
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 56, in req
uest
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 488,
in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 111,
in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, respon
se=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 9, in <module>
    load_entry_point('diffengine==0.0.27', 'console_scripts', 'diffengine')()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 459
, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 173
, in get_latest
    new.archive()
  File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 242
, in archive
    save_url, resp.headers, e
UnboundLocalError: local variable 'resp' referenced before assignment

tweet throttling

It's probably important to put in some kind of tweet throttling so that global changes in content on a website don't trigger a rash of tweets. I think it's probably ok to generate diffs for this content, but excessive tweeting can get your account blocked.

Style the output

Hi,

could you tell me how I can modify / individualize the tweetet images?

I first thought I would be able to do so with the ./diffengine/diff.html, but it seems like the output hasn’t changed.

Thanks in advance!

How to enable logging?

I see there's logging in https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py, how is it turned on?

The utility is running without any evident errors, but with no updates yet so I'd like to debug to check everything's in order.

404s

I'm just now seeing that if something that was previously available becomes 404 Not Found diffengine logs it, but doesn't tweet it. Ideally I think it should tweet it right? Or at least it should be configurable to tweet it. I noticed this because I've been watching the White House website, and a large number of posts from 2017 went missing during the Drupal -> WordPress switch.

https://inkdroid.org/2017/12/20/whitehouse-redesign/

Link to Wayback Diff?

The Wayback Machine now has a diff view for comparing two versions of a page. For example:

https://web.archive.org/web/diff/20200204071550/20190921140148/https://www.anotheracronym.org/about/

When tweeting a URL it would be better to link to this diff rather than expecting users to be able to compare them in separate tabs.

Info

Hi.

I am sorry to contact you here but I am new and I did not find a better way to reach you.

I am looking for a tool that allow me to automatically monitor webpages. Examples of use: track the price of an item and/or check item availability.

Features that I would like to have:

Option to track just a portion of a page, instead of the whole page
Notifications directly on my phone
Option to customize the frequency of the checks

Does your tool allow this?

My intention would be to use this stuff in a raspberry, as it is cheap solution and it has a low power consumption. Is it in your opinion suitable for this?

Thanks

Can we use it for personal use only?

Like i have few websites where i want to keep and an eye like amazon or few real estate builer website.

Can i remove the twitter section?

Database migrations

The next version of diffengine will require some database modifications for existing installs. peewee supports migrations. I think we have an example of one in init.py but maybe we should pull these out into a separate module?

I can test the migrations on a v0.2.7 database that I have.

Track changes only in headlines

Hi,

at first thanks for your efforts and the project!

I got the script running and tweeting but would like to tweet only changes in headlines. Is there a setting to compare only changes in the headline?

Thanks and kind regards

Tibor

X-Archive-Orig-Date

Internet Archive's Save Page Now functionality seems to have some logic to return a previous snapshot if it has one that is 5 minutes or so old. The time of the snapshot is made available in the X-Archive-Orig-Date header. It ought to be possible to parse this and compare it against the current time to see if the snapshot was current.

I'm not quite sure what to do if it isn't though...I guess it could at least be logged? Alternatively it could decide not to tweet it so that this doesn't happen. Notice how the new version doesn't have the new change?

Occasional WebDriverExceptions raised from _generate_diff_images

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 460, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 175, in get_latest
    diff.generate()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 271, in generate
    self._generate_diff_images()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 297, in _generate_diff_images
    self.browser = webdriver.PhantomJS(phantomjs)
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 52, in __init__
    self.service.start()
  File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 102, in start
    raise WebDriverException("Can not connect to the Service %s" % self.path)
selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service phantomjs

https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L297

Maybe we should catch this and retry after waiting? Not sure exactly what's causing it (and if retrying would help or not).

Empty URLs/documents in feed seem to crash diffengine

Last line in log: 017-10-18 13:18:57,133 - root - INFO - checking https://gateway.itstgate.com/WebLink2/WebLink.aspx

Traceback:

Traceback (most recent call last):
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 483, in main
    version = entry.get_latest()
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 156, in get_latest
    title = doc.title()
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 137, in title
    return get_title(self._html(True))
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 108, in _html
    self.html = self._parse(self.input)
  File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 117, in _parse
    doc, self.encoding = build_doc(input)
  File "/usr/local/lib/python3.6/site-packages/readability/htmls.py", line 21, in build_doc
    doc = lxml.html.document_fromstring(decoded_page.encode('utf-8', 'replace'), parser=utf8_parser)
  File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 765, in document_fromstring
    "Document is empty")
lxml.etree.ParserError: Document is empty

use web.archive.org directly

It's probably a good idea to use web.archive.org directly instead of pragma as a middle man for adding a URL to Internet Archive? The relevant code can be found here.

Min / max change feature

I think it would be good to track and tweet only articles when there was a minimum or maximum of changes:

Small changes are often edits because of typos / styleguide or
Very large changes are often made because the article has been fetched and auto published from a news agency and edited afterwards.

Also a “heatmap” feature would be interesting: not only looking for the amount of changes in the whole article but in the part where the changes have been made (paragraph, heading)

This way it would also be easier to generate smaller captures like mentioned in #34 because we would know which parts of the article have the most relevant changes.

UnicodeEncodeError being raised by calls to logging.info

UnicodeEncodeError: 'ascii' codec can't encode character '\u279c' in position 280: ordinal not in range(128)
Call stack:
  File "/usr/local/bin/diffengine", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 464, in main
    tweet_diff(version.diff, f['twitter'])
  File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 420, in tweet_diff
    logging.info("tweeted %s", status)
Message: 'tweeted %s'
Arguments: ('Trump wants good relationship with Russia, May says sanctions should stay | Reuters https://wayback.archive.org/web/20170127111722/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews \u279c https://wayback.archive.org/web/20170127193013/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews',)

Should we explicitly call status.encode('utf-8') before logging? http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20
I've also now set LC_ALL='en_US.utf8' in my crontab as suggested by another answer there to see if that fixes it as well.

diffengine installation fails on Ubuntu (htmldiff version)

pip3 install --process-dependency-links diffengine fails on Ubuntu 17.04 with the following message:

Collecting htmldiff==0.2 (from diffengine) Could not find a version that satisfies the requirement htmldiff==0.2 (from diffengine) (from versions: 0.1) No matching distribution found for htmldiff==0.2 (from diffengine)

Remove Entry.blogged

It looks like Diff.tweeted is no longer set. It should be set when a tweet for a diff has been sent. Also I think we can remove DIff.blogged now.

$HOME/.diffengine default profile location

Rather than having the user always select the location for their profile directory perhaps $HOME/.diffengine could be the default, and it could be overridden with a --profile command line option?

Fetching data with NewsAPI

I think the NewsAPI would be a nice addition to RSS feeds.

There are some pros:

reliable data and
proven option to fetch media articles
70 sources available with access to titles, authors, article images and excerpts

Cons:

Only excerpt possible, not the full article
dependency from a 3rd party developer
not usable for blogs or smaller news web sites

However, I think NewsAPI would make diffengine more reliable and gives us further options to style the ouput (with image layouts and author name).

The developer is also thinking about adding a RSS / Atom feature. Maybe a collaboration would be great for both projects?

404 Pages

It would be useful to record when a page disappears completely.

unable to tweet image

I think images may need to be resized, sometimes they fail like this:

2017-01-18 07:39:22,299 - root - ERROR - unable to tweet: [{'message': 'Image dimensions must be >= 4x4 and <= 8192x8192', 'code': 324}]

"Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead"

I have hundreds of mails on my server with this warning:

/usr/local/lib/python3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for Phantom
JS has been deprecated, please use headless versions of Chrome or Firefox instead
  warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '

(The PhantomJS repo is being archived: ariya/phantomjs#15344 There may be a fork soon: ariya/phantomjs#15345.)

Ideally diffengine should switch from PhantomJS to headless Chrome (eg.) or Firefox (or the fork), but it'd be good to silence this specific warning in the meantime.

docnow / diffengine Goto Github PK

diffengine's People

Contributors

Stargazers

Watchers

Forkers

diffengine's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs