docnow / diffengine Goto Github PK
View Code? Open in Web Editor NEWtrack changes to the news, where news is anything with an RSS feed
License: MIT License
track changes to the news, where news is anything with an RSS feed
License: MIT License
Looking at diffengine.log I noticed the following error:
2017-01-19 15:25:06,606 - root - ERROR - unexpected archive.org response for https://web.archive.org/save/http://www.presseportal.de/pm/58964/3539208: name 'url' is not defined
Opening the same URL in a browser just loads archive.org fine and it returns the saved URL.
Not sure this is just a temporary error due to connection speed or similar issues, or a bug with my PhantomJS install?
It might be useful to be configure a feed with a CSS selector to specify what element to extract text from with readability. For example the Washington Post currently use
<article itemprop="articleBody">...</article>
To enclose the text of the article using https://schema.org/NewsArticle microdata. Perhaps the config could look like:
- name: Washington Post - Politics
url: http://feeds.washingtonpost.com/rss/politics
css_selector: article[itemprop="articleBody"]
twitter:
access_token: foo
access_token_secret: bar
I guess the downside to this is that sites change, so unless you are watching it you may not notice when their markup changes, and your diffengine instance would quietly stop working.
I think it would be a nice feature to ignore articles with specific words in the URL or in the Title.
I have one feed where articles with a specific word in the url are always changed but contain only Copyright information (https://twitter.com/ueberschrieben/status/864926843224432641)
Looking at the Twitter account's "Apps" settings this is actually correct: it reads "Permissions: read-only", however I just provided the tokens and secret as requested and opened the Twitter authorize URL that returned the pin.
I haven't played with Twitter OAuth for ages and I can't seem to find a way to change the permissions afterwards using the Twitter UI, so what would be the proper way to grant read and write to my app?
EDIT: I simply clicked "Regenerate My Access Token and Token Secret" and Twitter then magically makes it a Read and write token instead of a read-only
I think is a typo error on init.py at line 259 it says self.save_url
and it should be just save_url
Since diffs are tweeted and not particular versions I'm wondering why we are storing the tweet id for the diff on the EntryVersion instead of Diff? I think it would be cleaner to store it on the Diff right?
For example: https://twitter.com/search?f=tweets&q=fox_diff%20Former%20President%20Bush%20intensive%20care&src=typd
All the Archive URLs are the same. However, in the logs I see both:
checking http://feedproxy.google.com/~r/foxnews/politics/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html
checking http://feedproxy.google.com/~r/foxnews/most-popular/~3/_eSUNInb95I/bush-41-cant-make-inauguration-tells-trump-sitting-outside-could-put-me-six-feet-under.html
Which I'm guessing is how these duplicates are getting tweeted.
Maybe we need to do some URL de-referencing/canonicalization before storing/checking URLs from feeds? If I curl -I
those feedproxy URLs I get a 301 response with a semi-canonical URL in the location (would need to have parameters stripped).
I finished installing using AWS cloud9 and got this:
Fetching initial set of entries.
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/firefox/webdriver.py", line 174, in init
keep_alive=True)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init
self.start_session(capabilities, browser_profile)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute
self.error_handler.check_response(response)
File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: Unable to find a matching set of capabilities
**I downgraded selenium (I saw this might help and I'm really new to this stuff)
Then I got this:**
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 473, in main
init(home)
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 463, in init
setup_browser()
File "/usr/local/lib/python3.6/dist-packages/diffengine/init.py", line 421, in setup_browser
browser = webdriver.Firefox(options=opts)
TypeError: init() got an unexpected keyword argument 'options'
Please could someone help? Thanks!!
The longer diffengine runs the more urls it needs to check, and the more time it takes to take a full pass through them. The assumption I've had so far based on watching news websites is that the older a page gets the less likely it is to change its content.
There is a method on the Entry object that calculates whether an entry is stale or not. It uses what I call a staleness ratio or s. If s is greater than a given value (currently .2) it is deemed stale. I've thought about making this magic number configurable per feed. Here's how it works:
hotness = current time - entry creation time
staleness = current time - time last checked
s = hotness / staleness
stale? = s >= .2
So if an entry is 3 hours (10800 seconds) old and it was last checked 20 minutes (1200 seconds) ago, the calculation is:
1200 / 10800 = .11 (not stale)
Or if the entry is 3 hours old and it was last checked 1 hour ago:
3600 / 10800 = .33 (stale)
The idea is that things get checked less often as they get older, but the problem that I haven't really verified yet is that I think it can still result in thresholds over which lots of checks need to happen. So periodically diffengine will spend a lot of time checking URLs as they cross over that threshold.
I was wondering if it might make sense to take a more probabilistic approach where URLs are checked more often when they are new and less often as they get older using some sort of probability sampling. For example when an entry is new it is checked 80% of the time, and as it gets to be old, say a month old, it is checked only 50% of the time. So a gradiant of some kind like that? Or maybe it should also factor in the total number of entries that need to be checked, and the desired time it should take for a complete run?
It takes about a second to check an entry, and after running against Washington Post the Guardian and Breitbart for a week I have 1531 URLs to check. If there were no backing off at all this be 25 minutes of runtime, and it would just get worse. This would mean that new entries would not be monitored closely enough. Also it would unduly burden the webserver being checked with tons of requests.
I suspect this problem may have been solved elsewhere before, so if you have ideas or pointers they would be appreciated!
I've noticed that the SavePageNow service gives out occasional 503 Service Unavailable errors. diffengine should guard against that and retry then log the failure.
The current test for staleness doesn't seem to be smart enough. After running diffengine for over a month it is taking it 8 hours to check Breitbart, The Guardian and The Washington Post. I think it needs to be smarter about what to do with the backlog of sites. Perhaps randomly sampling from them?
I have accounts setup, and a couple of have many urls for RSS feeds. Not sure if I have everything setup right. So, we should probably document the best way to setup multiple accounts, and an account that has multiple RSS feeds in it.
Happy to do this work.
I have each account setup in it's own home directory:
/home/nruest/.torontosun
/home/nruest/.diffengine
/home/nruest/.globemail
/home/nruest/.canadaland
/home/nruest/.cbc
Toronto Sun has multiple RSS feeds, and I have config.yaml
setup like so:
- name: Top Home stories
twitter:
access_token: SOMETHING
access_token_secret: SOMETHING
url: http://www.torontosun.com/photos/rss.xml
- name: Top Home stories
twitter:
access_token: SOMETHING
access_token_secret: SOMETHING
url: http://www.torontosun.com/videos/rss.xml
- name: Top Home stories
twitter:
access_token: SOMETHING
access_token_secret: SOMETHING
url: http://www.torontosun.com/sunshine-girl/rss.xml
...
phantomjs: phantomjs
twitter:
consumer_key: SOMETHING
consumer_secret: SOMETHING
Franc suggests that it would be useful to possibly track changes in images. At the moment only textual changes are noted. But it could be possible to notice a substantial change in images used in the body of the article.
I tried to set up diffengine yesterday and after a few teething issues, including installation failing due to the "--process-dependency-links" error, it seemed to run fine when I rolled back to Pip 18.
However, it does not tweet.
I do have 3 directories in my diffs folder but I noticed that differences were not tweeted. Looking through the diffengine log, there are three error messages that presumably relate to this: "WARNING - not tweeting without archive urls"
Not sure whether other errors in the diffengine.log are related:
"ERROR - unable to get archive id from None"
"ERROR - unexpected archive.org response for https://web.archive.org/save/https://www.blahblah.com"
Any ideas?
@ruebot noticed a series of odd updates like this which led to the discovery that readability returns very little content sometimes. For example:
import requests
import readability
html = requests.get("https://www.thestar.com/news/world/2017/01/11/uk-teen-charged-with-murder-of-7-year-old-girl.html").content
doc = readability.Document(html)
print(doc.summary())
returns (at the moment):
<html><body><div><div class="article__subheadline" data-reactid="93"><p data-reactid="94">The 15-year-old was remanded into secure accommodation on Wednesday and was also charged with possession of an offensive weapon. </p></div></div></body></html>
Perhaps there should be a configurable threshold below which the content will be ignored or at least not tweeted? Could readability be tuned in this case to return content that is more appropriate like the text of the AP press release?
I'm setting up a tracker for http://visir.is (a Icelandic news site).
I've noticed that when changes are done on headlines, their system makes new urls.
The urls are made up of these elements:
http://visir.is/g/<ARTICLE_ID>/< HEADLINE >
To view the article the < HEADLINE > part is reduntant.
To get around it I made some changes to allow for a regex to be applied to a url from the rss feed. See here:
This makes the url checked: http://visir.is/g/<ARTICLE_ID> so subsequent changes to the headline are picked up, and not stored as a new article.
I'm not sure introducing a config variable is appropriate for the project, but at least my solution is there, if anyone needs it.
It might be nice for users to be able to put an array of strings or regexes in config.yaml
that can be used to normalize content before diffing.
For example, I could put 'Scroll down for video' in for deletion for dailymail_diff, or with regexes globemail_diff might be able to remove stock price changes.
Related to #10, there might be a tradeoff for where to put such an array in the YAML hierarchy. Putting it as a top-level key would mean less repetition for people using one config per news source, putting it as a key under each feed would allow people using one config for multiple news sources to have different ones for each.
See also: #14
Noticed on 2017-01-27 for cnn_diff:
2017-01-25 10:16:43,128 - root - INFO - shutting down: new=13 checked=495 skipped=1691 elapsed=0:16:41.543544
2017-01-25 10:30:02,301 - root - INFO - starting up with home=/Users/ryan/source/diffengine/cnn_diff
2017-01-25 10:30:02,317 - root - INFO - fetching feed: http://rss.cnn.com/rss/cnn_topstories.rss
2017-01-25 10:30:03,048 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/KiO2MctO3eI/index.html
2017-01-25 10:30:03,240 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/iyOB4KTWINU/index.html
2017-01-25 10:30:03,413 - root - INFO - found new entry: http://rss.cnn.com/~r/rss/cnn_topstories/~3/q2hFlt0ZpK0/index.html
and bbc_diff:
2017-01-26 05:52:10,719 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983
2017-01-26 05:53:29,932 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml?edition=int
2017-01-26 05:53:38,509 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/system/latest_published_content/rss.xml
2017-01-26 05:53:38,673 - root - INFO - shutting down: new=6 checked=226 skipped=1851 elapsed=0:08:36.948907
2017-01-26 06:00:01,536 - root - INFO - starting up with home=/Users/ryan/source/diffengine/bbc_diff
2017-01-26 06:00:01,545 - root - INFO - fetching feed: http://feeds.bbci.co.uk/news/rss.xml
2017-01-26 06:00:01,674 - root - INFO - found new entry: http://www.bbc.co.uk/news/science-environment-38755229
2017-01-26 06:00:01,717 - root - INFO - found new entry: http://www.bbc.co.uk/news/business-38748296
2017-01-26 06:00:01,765 - root - INFO - found new entry: http://www.bbc.co.uk/news/magazine-38722929
2017-01-26 06:01:41,721 - root - WARNING - Got 404 when fetching http://www.bbc.co.uk/news/world-us-canada-38702983
Both had long-running processes but didn't appear to be logging or doing anything new.
I tried using dtruss
to trace syscalls in the running processes before killing them, but no syscalls were being made (using dtruss
on a successfully-running diffengine instance produces a lot of output).
Hi @edsu I'm working on the first PR related to the envyaml package integration as the first step of the entire features I've been adding in my own fork.
I know you created the thread
branch but I'm sure this way would be much easier as we can discuss every feature separately and that way avoiding conflicts as it's a one big file.
Anyway, one of the first things I'm thinking is this: what do you think about using a common Python formatter as a condition for the colaborators (myself in this case)?
This way every code addition, is compliant with the same way to coding. This can be done automatically by installing the code formatter it in the collaborator's own computer. E.g.: I've installed Black for formatting the code when I save the file without having to worry about that kind of stuff.
The thing here is that modifies the entire file the first time. So this would be the very first PR to integrate to the master branch, if you agree with this.
Since the move to Firefox for screenshotting it seems that the image can sometimes not include the diffed text. For example https://twitter.com/whitehouse_diff/status/1252696316381143041
Perhaps there is a timing difference, or the JavaScript that adjusts the page isn't working as it once was?
It looks like changes in whitespace in the readability text are showing up as diffs. Here's an example
Rather than doing a simple equality check perhaps there whitespace should be stripped somehow? Or we could calculate a diff each time?
Because of pypa/pip#4187 --process-dependency-links
is not supported anymore and installation fails. Use pip3 install --upgrade pip==18.0.0
to install it.
I caught this via an email from cron. It looks like some better handling of this type of error is needed?
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 233
, in archive
resp = requests.get(save_url, headers={"User-Agent": UA})
File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 56, in req
uest
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 488,
in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 630,
in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 111,
in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, respon
se=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 9, in <module>
load_entry_point('diffengine==0.0.27', 'console_scripts', 'diffengine')()
File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 459
, in main
version = entry.get_latest()
File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 173
, in get_latest
new.archive()
File "/usr/local/lib/python3.5/dist-packages/diffengine/__init__.py", line 242
, in archive
save_url, resp.headers, e
UnboundLocalError: local variable 'resp' referenced before assignment
It's probably important to put in some kind of tweet throttling so that global changes in content on a website don't trigger a rash of tweets. I think it's probably ok to generate diffs for this content, but excessive tweeting can get your account blocked.
Hi,
could you tell me how I can modify / individualize the tweetet images?
I first thought I would be able to do so with the ./diffengine/diff.html, but it seems like the output hasn’t changed.
Thanks in advance!
I see there's logging in https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py, how is it turned on?
The utility is running without any evident errors, but with no updates yet so I'd like to debug to check everything's in order.
I'm just now seeing that if something that was previously available becomes 404 Not Found diffengine logs it, but doesn't tweet it. Ideally I think it should tweet it right? Or at least it should be configurable to tweet it. I noticed this because I've been watching the White House website, and a large number of posts from 2017 went missing during the Drupal -> WordPress switch.
The Wayback Machine now has a diff view for comparing two versions of a page. For example:
https://web.archive.org/web/diff/20200204071550/20190921140148/https://www.anotheracronym.org/about/
When tweeting a URL it would be better to link to this diff rather than expecting users to be able to compare them in separate tabs.
Hi.
I am sorry to contact you here but I am new and I did not find a better way to reach you.
I am looking for a tool that allow me to automatically monitor webpages. Examples of use: track the price of an item and/or check item availability.
Features that I would like to have:
Does your tool allow this?
My intention would be to use this stuff in a raspberry, as it is cheap solution and it has a low power consumption. Is it in your opinion suitable for this?
Thanks
Like i have few websites where i want to keep and an eye like amazon or few real estate builer website.
Can i remove the twitter section?
The next version of diffengine will require some database modifications for existing installs. peewee supports migrations. I think we have an example of one in init.py but maybe we should pull these out into a separate module?
I can test the migrations on a v0.2.7 database that I have.
Hi,
at first thanks for your efforts and the project!
I got the script running and tweeting but would like to tweet only changes in headlines. Is there a setting to compare only changes in the headline?
Thanks and kind regards
Tibor
Internet Archive's Save Page Now functionality seems to have some logic to return a previous snapshot if it has one that is 5 minutes or so old. The time of the snapshot is made available in the X-Archive-Orig-Date header. It ought to be possible to parse this and compare it against the current time to see if the snapshot was current.
I'm not quite sure what to do if it isn't though...I guess it could at least be logged? Alternatively it could decide not to tweet it so that this doesn't happen. Notice how the new version doesn't have the new change?
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 460, in main
version = entry.get_latest()
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 175, in get_latest
diff.generate()
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 271, in generate
self._generate_diff_images()
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 297, in _generate_diff_images
self.browser = webdriver.PhantomJS(phantomjs)
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/phantomjs/webdriver.py", line 52, in __init__
self.service.start()
File "/usr/local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 102, in start
raise WebDriverException("Can not connect to the Service %s" % self.path)
selenium.common.exceptions.WebDriverException: Message: Can not connect to the Service phantomjs
https://github.com/DocNow/diffengine/blob/master/diffengine/__init__.py#L297
Maybe we should catch this and retry after waiting? Not sure exactly what's causing it (and if retrying would help or not).
Last line in log: 017-10-18 13:18:57,133 - root - INFO - checking https://gateway.itstgate.com/WebLink2/WebLink.aspx
Traceback:
Traceback (most recent call last):
File "/usr/local/bin/diffengine", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 483, in main
version = entry.get_latest()
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 156, in get_latest
title = doc.title()
File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 137, in title
return get_title(self._html(True))
File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 108, in _html
self.html = self._parse(self.input)
File "/usr/local/lib/python3.6/site-packages/readability/readability.py", line 117, in _parse
doc, self.encoding = build_doc(input)
File "/usr/local/lib/python3.6/site-packages/readability/htmls.py", line 21, in build_doc
doc = lxml.html.document_fromstring(decoded_page.encode('utf-8', 'replace'), parser=utf8_parser)
File "/usr/local/lib/python3.6/site-packages/lxml/html/__init__.py", line 765, in document_fromstring
"Document is empty")
lxml.etree.ParserError: Document is empty
It's probably a good idea to use web.archive.org directly instead of pragma as a middle man for adding a URL to Internet Archive? The relevant code can be found here.
I think it would be good to track and tweet only articles when there was a minimum or maximum of changes:
Also a “heatmap” feature would be interesting: not only looking for the amount of changes in the whole article but in the part where the changes have been made (paragraph, heading)
This way it would also be easier to generate smaller captures like mentioned in #34 because we would know which parts of the article have the most relevant changes.
UnicodeEncodeError: 'ascii' codec can't encode character '\u279c' in position 280: ordinal not in range(128)
Call stack:
File "/usr/local/bin/diffengine", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 464, in main
tweet_diff(version.diff, f['twitter'])
File "/usr/local/lib/python3.6/site-packages/diffengine/__init__.py", line 420, in tweet_diff
logging.info("tweeted %s", status)
Message: 'tweeted %s'
Arguments: ('Trump wants good relationship with Russia, May says sanctions should stay | Reuters https://wayback.archive.org/web/20170127111722/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews \u279c https://wayback.archive.org/web/20170127193013/http://www.reuters.com/article/us-usa-trump-britain-idUSKBN15B104?feedType=RSS&feedName=politicsNews',)
Should we explicitly call status.encode('utf-8')
before logging? http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20
I've also now set LC_ALL='en_US.utf8'
in my crontab as suggested by another answer there to see if that fixes it as well.
pip3 install --process-dependency-links diffengine fails on Ubuntu 17.04 with the following message:
Collecting htmldiff==0.2 (from diffengine) Could not find a version that satisfies the requirement htmldiff==0.2 (from diffengine) (from versions: 0.1) No matching distribution found for htmldiff==0.2 (from diffengine)
It looks like Diff.tweeted is no longer set. It should be set when a tweet for a diff has been sent. Also I think we can remove DIff.blogged now.
Rather than having the user always select the location for their profile directory perhaps $HOME/.diffengine could be the default, and it could be overridden with a --profile command line option?
I think the NewsAPI would be a nice addition to RSS feeds.
There are some pros:
Cons:
However, I think NewsAPI would make diffengine more reliable and gives us further options to style the ouput (with image layouts and author name).
The developer is also thinking about adding a RSS / Atom feature. Maybe a collaboration would be great for both projects?
It would be useful to record when a page disappears completely.
I think images may need to be resized, sometimes they fail like this:
2017-01-18 07:39:22,299 - root - ERROR - unable to tweet: [{'message': 'Image dimensions must be >= 4x4 and <= 8192x8192', 'code': 324}]
I have hundreds of mails on my server with this warning:
/usr/local/lib/python3.6/dist-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for Phantom
JS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
(The PhantomJS repo is being archived: ariya/phantomjs#15344 There may be a fork soon: ariya/phantomjs#15345.)
Ideally diffengine should switch from PhantomJS to headless Chrome (eg.) or Firefox (or the fork), but it'd be good to silence this specific warning in the meantime.
... is not getting installed by pip install diffengine
. It ought to be possible to update setup.py to handle this.
It would be useful for diffengine to establish a lock before running in order to prevent a long running cron job from interfering with a newly started one.
At the moment the project is capturing always the whole article even if only a few words in a specific section have changed.
I guess it would be great to capture only the paragraph that has changed. This way we would not copy the whole article and it would be easier to read on Twitter as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.