pictuga / morss Goto Github PK

View Code? Open in Web Editor NEW

573.0 573.0 77.0 754 KB

Get full text RSS feeds

Home Page: https://morss.it/

License: GNU Affero General Public License v3.0

Python 90.48% HTML 1.03% XSLT 7.40% Dockerfile 0.35% Shell 0.74%

article full-text python rss

morss's People

Contributors

Stargazers

Watchers

morss's Issues

Facebook Connection fails

When trying to connect with facebook following error appears:

Can't Load URL: The domain of this URL isn't included in the app's domains. To be able to load this URL, add all domains and subdomains of your app to the App Domains field in your app settings.

Internal Server Error

getting this new error, am running in docker, git build
url: www.maketecheasier.com/feed
the problem is with self-host only, works fine with morss.it

File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 196, in cgi_file_handler
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 132, in cgi_app
    url, rss = FeedFetch(url, options)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 275, in FeedFetch
    raise MorssException('Error downloading feed')
morss.morss.MorssException: Error downloading feed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 134, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 175, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 266, in cgi_encode
    out = app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 260, in cgi_error_handler
    log('ERROR: %s' % repr(e), force=True)
TypeError: log() got an unexpected keyword argument 'force'
[2021-01-03 15:29:41 +0000] [8] [ERROR] Error handling request /favicon.ico
Traceback (most recent call last):
  File "/usr/lib/python3.8/urllib/request.py", line 1350, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1255, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1301, in _send_reques
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1250, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1010, in _send_output
    self.send(msg)
  File "/usr/lib/python3.8/http/client.py", line 950, in send
    self.connect()
  File "/usr/lib/python3.8/http/client.py", line 921, in connect
    self.sock = self._create_connection(
  File "/usr/lib/python3.8/socket.py", line 787, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/lib/python3.8/socket.py", line 918, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name does not resolve
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 272, in FeedFetch
    req = crawler.adv_get(url=url, follow=('rss' if not options.items else None), delay=delay, timeout=TIMEOUT * 2)
  File "/usr/lib/python3.8/site-packages/morss/crawler.py", line 92, in adv_get
    con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1379, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.8/urllib/request.py", line 1353, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name does not resolve>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 252, in cgi_error_handler
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 246, in cgi_dispatcher
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 196, in cgi_file_handler
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 132, in cgi_app
    url, rss = FeedFetch(url, options)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 275, in FeedFetch
    raise MorssException('Error downloading feed')
morss.morss.MorssException: Error downloading feed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 134, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 175, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 266, in cgi_encode
    out = app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 156, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 260, in cgi_error_handler
    log('ERROR: %s' % repr(e), force=True)
TypeError: log() got an unexpected keyword argument 'force'

Web doesn't work

First I received a 50x error. Now landing page is displayed but it can't connect to any feed. Are you working to have morss.it up & ready again?

AttributeError: module 'morss' has no attribute 'main'

Hello
I tried to install morss on an up to date arch linux.

> pip install git+https://git.pictuga.com/pictuga/morss.git@master
Collecting git+https://git.pictuga.com/pictuga/morss.git@master
  Cloning https://git.pictuga.com/pictuga/morss.git (to revision master) to /tmp/pip-req-build-wgrp5hhd
  Running command git clone -q https://git.pictuga.com/pictuga/morss.git /tmp/pip-req-build-wgrp5hhd
Requirement already satisfied (use --upgrade to upgrade): morss==0.0.0 from git+https://git.pictuga.com/pictuga/morss.git@master in /usr/lib/python3.8/site-packages
Requirement already satisfied: lxml in /usr/lib/python3.8/site-packages (from morss==0.0.0) (4.5.2)
Requirement already satisfied: bs4 in /usr/lib/python3.8/site-packages (from morss==0.0.0) (0.0.1)
Requirement already satisfied: python-dateutil in /usr/lib/python3.8/site-packages (from morss==0.0.0) (2.8.1)
Requirement already satisfied: chardet in /usr/lib/python3.8/site-packages (from morss==0.0.0) (3.0.4)
Requirement already satisfied: pymysql in /usr/lib/python3.8/site-packages (from morss==0.0.0) (0.10.0)
Requirement already satisfied: beautifulsoup4 in /usr/lib/python3.8/site-packages (from bs4->morss==0.0.0) (4.9.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3.8/site-packages (from python-dateutil->morss==0.0.0) (1.15.0)
Requirement already satisfied: soupsieve>1.2 in /usr/lib/python3.8/site-packages (from beautifulsoup4->bs4->morss==0.0.0) (2.0.1)
Building wheels for collected packages: morss
  Building wheel for morss (setup.py) ... done
  Created wheel for morss: filename=morss-0.0.0-py3-none-any.whl size=62552 sha256=6075ad834cfcecdea16f668925aa5f1725db3c1ec27dfd2c67bb740af00426e5
  Stored in directory: /tmp/pip-ephem-wheel-cache-1iodvrzu/wheels/fa/e1/35/7dc2cbdfdaa5b83a83c5ed461628da31febcef62e43ab29823
Successfully built morss

but when i try to use it :

morss --help
Traceback (most recent call last):
  File "/usr/bin/morss", line 33, in <module>
    sys.exit(load_entry_point('morss==0.0.0', 'console_scripts', 'morss')())
  File "/usr/bin/morss", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.8/importlib/metadata.py", line 79, in load
    return functools.reduce(getattr, attrs, module)
AttributeError: module 'morss' has no attribute 'main'

Can you help me?
Thank you very much

ERROR: You must not use 8-bit bytestrings

Hi,

I've tried to install morss but have hit upon an issue when running the test scenario:

MyMachine$ python -m debug http://www/bbc/co/uk/bla/bla.xml
ERROR: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

Stackoverflow suggested adding ".text_factory = str" in, which I did in the init function of the SQLiteCache function in crawlers.py, which appears to have solved my problem. Later comments suggest this isn't the right solution and the data should be converted to binary:

I'm not sure what's right, but thought I'd flag it up to you. I've not had a chance to use this properly yet, but looks very useful - thanks for sharing it.

Cheers,
Iain.

Seems to be a typo

Hi there,

in file feeds.py there is an undefined variable "f" around line 320. Seems like remnant of some renaming. Here the whole function:

    def __set__(self, instance, value):
        feedlist = self.__get__(instance)
        [x.remove() for x in [x for x in f.items]]  # f is not defined
        [feedlist.append(x) for x in value]

HTTPS feeds can't be processed

Only report that this is not working:
https://blog.sysaid.com/feed?post_type=sysaid_blog

https://morss.it/https://blog.sysaid.com/feed?post_type=sysaid_blog

Not working on Android Police

I'm trying to use Morss with Android Police, but I get an error saying it Couldn't load the feed.

https://www.androidpolice.com/feed

[CRITICAL] WORKER TIMEOUT

Hi,
I managed to get some results by fetching rss via morss.it. It takes a long time but it's working.
The url used:
https://morss.it/:items=||*[class=dataList-cell]||a/https://platinmods.com/advancedsearch/advancedsearch-results?type=thread&keywords=&posted_by=&search_forums[]=156
The problem is that it doesn't work with my own server. After 30 seconds I get:

[2021-01-13 10:50:38 +0000] [30] [INFO] Booting worker with pid: 30
[2021-01-13 10:51:08 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:27)
[2021-01-13 10:51:08 +0000] [27] [INFO] Worker exiting (pid: 27)

Even if I put the TIMEOUT to more than 30 seconds the error come every 30 seconds.

What are the morss.it server parameters?

Here is my docker-compose.yml:

version: "2"
services:
  morss:
    build: /home/xxxxxx/sources/morss
    ports:
      - '9090:8080'
    restart: unless-stopped
    environment:
      - MAX_ITEM=100
      - LIM_TIME=-1
      - MAX_TIME=-1
      - LIM_ITEM=100
      - TIMEOUT=40
      - IGNORE_SSL=1
      - DEBUG=1

Another thing: Even with DEBUG=1 my docker logs don't return more infos.

Error Downloading Feed

Started having issues with feed (www.accountingtoday.com/feed?rss=true) through morss.it, results in Invalid SSL Certificate errors. Can you please help?

installation impossible

salut,

voilà facilement 1 mois et demi que je tente d'installer morss (je ne connais absolument pas python, du coup çà n'aide pas...) sans réussite.

Peu importe la méthode que j'utilise, j'arrive toujours à un problème d'"appli inexistante" et/ou d'import.

Dernier exemple en date (ce soir), en suivant une xxème fois le tuto de https://blog.ronsonchan.com/setting-up-morss-full-text-rss-expander-on-debian-wheezy/ (que tu as mis en avant sur twitter).

lorsque j'arrive à cette étape :
4 -Set up uWSGI
c. Test to see if it works, run uwsgi --http-socket :8080 --wsgi-file morss.py --callable cgi_wrapper
d. Access http://:8080 and you should get the morss default page.

Dans le terminal, j'obtiens :

*** Starting uWSGI 2.0.11.1 (32bit) on [Tue Sep 22 20:23:14 2015] ***
compiled with version: 4.6.3 on 21 September 2015 21:44:50
os: Linux-3.18.11+ #781 PREEMPT Tue Apr 21 18:02:18 BST 2015
nodename: raspberrypi
machine: armv6l
clock source: unix
detected number of CPU cores: 1
current working directory: /usr/share/nginx/www/morss
detected binary path: /usr/share/nginx/www/morss/morss_venv/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
*** WARNING: you are running uWSGI without its master process manager ***
your processes number limit is 3416
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to TCP address :8080 fd 3
Python version: 2.7.3 (default, Mar 18 2014, 05:13:23)  [GCC 4.6.3]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x1f8d780
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 64256 bytes (62 KB) for 1 cores
*** Operational MODE: single process ***
Traceback (most recent call last):
  File "morss.py", line 15, in <module>
    from . import feeds
ValueError: Attempted relative import in non-package
unable to load app 0 (mountpoint='') (callable not found or import error)
*** no app loaded. going in full dynamic mode ***
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI worker 1 (and the only) (pid: 10738, cores: 1)

Dans mon browser :

Internal Server Error

Puis (après la requete dans le browser) dans mon terminal :

--- no python application found, check your startup logs for errors ---
[pid: 10738|app: -1|req: -1/3] 192.168.0.254 () {30 vars in 503 bytes} [Tue Sep 22 20:25:15 2015] GET /favicon.ico => generated 21 bytes in 1 msecs (HTTP/1.1 500) 2 headers in 83 bytes (0 switches on core 0)

J'ai du mal à comprendre l'erreur sur la ligne 15 déjà, feeds.py est bien dans le même répertoire que morss.py.

what's wrong ?

image for docker compose?

Hello,

i have existing docker compose, in which i want to add morss as another service,
is there any image available either on dockerhub/github.?
if i try git.pictuga.com/pictuga/morss.git as is obviously it gives error.

dependency missing

when i download and build the docker container, i see package 'wheel' is not installed how can i fix it?

Building morss
Step 1/5 : FROM alpine:latest
---> a24bb4013296
Step 2/5 : RUN apk add --no-cache python3 py3-lxml py3-gunicorn py3-pip git
---> Using cache
---> 5af6122bab90
Step 3/5 : ADD . /app
---> a670f3716eb1
Step 4/5 : RUN pip3 install /app
---> Running in 6b4f859dfe1a
Processing /app
Requirement already satisfied: lxml in /usr/lib/python3.8/site-packages (from morss==0.0.0) (4.5.1)
Collecting bs4
Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting python-dateutil
Downloading python_dateutil-2.8.1-py2.py3-none-any.whl (227 kB)
Requirement already satisfied: chardet in /usr/lib/python3.8/site-packages (from morss==0.0.0) (3.0.4)
Collecting pymysql
Downloading PyMySQL-0.10.1-py2.py3-none-any.whl (47 kB)
Collecting beautifulsoup4
Downloading beautifulsoup4-4.9.3-py3-none-any.whl (115 kB)
Requirement already satisfied: six>=1.5 in /usr/lib/python3.8/site-packages (from python-dateutil->morss==0.0.0) (1.15.0)
Collecting soupsieve>1.2; python_version >= "3.0"
Downloading soupsieve-2.0.1-py3-none-any.whl (32 kB)
Using legacy setup.py install for morss, since package 'wheel' is not installed.
Using legacy setup.py install for bs4, since package 'wheel' is not installed.
Installing collected packages: soupsieve, beautifulsoup4, bs4, python-dateutil, pymysql, morss
Running setup.py install for bs4: started
Running setup.py install for bs4: finished with status 'done'
Running setup.py install for morss: started
Running setup.py install for morss: finished with status 'done'
Successfully installed beautifulsoup4-4.9.3 bs4-0.0.1 morss-0.0.0 pymysql-0.10.1 python-dateutil-2.8.1 soupsieve-2.0.1
Removing intermediate container 6b4f859dfe1a
---> fd192327b049
Step 5/5 : CMD gunicorn --bind 0.0.0.0:8080 -w 4 --preload morss
---> Running in e2194b9749eb
Removing intermediate container e2194b9749eb
---> 724ac9294c12

Successfully built 724ac9294c12

No photo on feed

hello
Some feed can't get photos
for example : http://www.sudouest.fr/pyrenees-atlantiques/bayonne/rss.xml

MAX_TIME and LIM_TIME

Hi!

I have some feeds that takes a long time to fetch (like google trends). My ttrss instance have 15s timeout and I wanted morss to stop trying to fetch a little before this time.

So I tried to set MAX_TIME and LIM_TIME, but I didn't succeeded. I don't know the unity (I suppose that it is seconds), but I tried with 10 for both.

I used environment variable for docker :

      environment:
        MAX_TIME: 10
        LIM_TIME: 10

Am I doing something wrong?

Favicon not working on docker

Hi,
I can't manage to add a favicon to my morss site. I added favicon.ico in each morss folder to be sure and added this line to index.html header:

<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico" sizes="32x32" />

I tried other things without success.
When I click the favicon link in the page source I get this:

<!-- The above is a description of an error in a Python program, formatted
     for a Web browser because the 'cgitb' module was enabled.  In case you
     are not reading this in a Web browser, here is the original traceback:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 294, in FeedFetch
    req = crawler.adv_get(url=url, follow=('rss' if not options.items else None), delay=delay, timeout=TIMEOUT * 2)
  File "/usr/lib/python3.8/site-packages/morss/crawler.py", line 68, in adv_get
    con = custom_handler(*args, **kwargs).open(url, timeout=timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1379, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.8/urllib/request.py", line 1319, in do_open
    h = http_class(host, timeout=req.timeout, **http_conn_args)
  File "/usr/lib/python3.8/http/client.py", line 835, in __init__
    self._validate_host(self.host)
  File "/usr/lib/python3.8/http/client.py", line 1208, in _validate_host
    raise InvalidURL(f"URL can't contain control characters. {host!r} "
http.client.InvalidURL: URL can't contain control characters. "{% static '" (found at least ' ')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 639, in cgi_error_handler
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 533, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 633, in cgi_dispatcher
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 533, in app_wrap
    return func(environ, start_response, app)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 583, in cgi_file_handler
    return app(environ, start_response)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 509, in cgi_app
    url, rss = FeedFetch(url, options)
  File "/usr/lib/python3.8/site-packages/morss/morss.py", line 297, in FeedFetch
    raise MorssException('Error downloading feed')
morss.morss.MorssException: Error downloading feed

-->

Timezone not correct or missing when using JSON output format

Hi guys!

I have noticed that the time-zone information in JSON output is missing.

RSS/XML output:
<published>2021-04-19T16:30:00+02:00</published>

JSON output:
"time":"2021-04-19T16:30:00Z",

So, the +2:00 is missing!

Doesn't matter whether you use the Browser or the CLI version ...

broken compatibilty with python3

Hey, there is a bug when trying to run morss with python 3, namely, executing python3 main.py "$FEED" I get ERROR: '%' must be followed by '%' or '(', found: '%a, %d %b %Y %H:%M:%S %Z\n%a, %d %b %Y %H:%M:%S %Z\n%Y-%m-%dT%H:%M:%SZ\n%Y-%m-%dT%H:%M:%SZ'.

It seems to be related to the ConfigParser; this might be a related issue: CGATOxford/CGATPipelines#335

Feed error report

http://morss.it/https://www.aworkoutroutine.com/feed/
http://morss.it/https://bayesianbodybuilding.com/feed/

Memory leak issues

Hi!

I have a big issue with morss : the app act like it have a memory leak :

The only way to handle the problem is to restart the container : not very handy :).

I use the container version of morss, building it from https://git.pictuga.com/pictuga/morss.git , without passing argument (so on gunicorn). I rebuild it yesterday without any improvement.

In order to handle this problem, I tried, without any better, to :

Use a sqlite cache
Use a mariadb cache
Set some environment variable (MAX_ITEM=20)

I had some problem with morss for connecting to mariadb, but I think it "works" because I see lines in the data tables :

I had to remove options of mariadb in order to morss not crashing (data length error at start)
I have some error showing that connections are not ended correctly, but without error on morss output : "[Warning] Aborted connection 12619 to db: 'XXX' user: 'YYY' host: 'ZZZ' (Got an error reading communication packets)"

I didn't understand well how some environment variables work (MAX_ITEM & LIM_ITEM), maybe I need to use this?

Thanks for help :).

Articles are not well detected/matched

Hello,

I have 2 websites that give rss link:
https://secouchermoinsbete.fr/feeds.atom
https://consomac.fr/rss/consomac.xml

But morss.it don't give the full text. Could you check ?
Thank you

Provided web UI is too basic

Hi,

I've installed morss using Docker, and when I type this URL: https://www.fcbarcelona.com/en/football/first-team/news I get these errors:

(but it works if I enter a valid RSS feed URL)

Re/Code not parsing?

Just getting empty story bodies when trying to parse Re/Code: http://morss.it/recode.net/feed/

sidenote: awesome app, just found it today and is working great on several site for me!

Hacker news full text with comments

The Morss feed for hacker news is really great, however there's no easy way to access the comments.

I know Morss is more of a general rss tool, but any chance you'd consider adding features to combine the full text of hacker news posts and the comments?

invalid syntax

@commit: 5c2151f
Running as CLI, you get this error

root@jolokia:~/git/morss# python2.7 -m morss http://www.***.it/rss/homepage/rss2.0.xml
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/usr/lib/python2.7/runpy.py", line 109, in _get_module_details
    return _get_module_details(pkg_main_name)
  File "/usr/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/usr/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/usr/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/usr/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "morss/__init__.py", line 1, in <module>
    from .morss import *
  File "morss/morss.py", line 202
    'R': ':', 'S': 'www.', , 'T': '#', 'U': '$', 'V': '~', 'W': '!',
                           ^
SyntaxError: invalid syntax

Massimo

Feed is dropping items that are not the same domain

I'm trying to use morss with Lifehacker (https://lifehacker.com/rss), and it works great in combined mode, except all items from domains other than lifehacker.com are stripped out from the feed.

For instance, items with a link to https://lifehacker.com and https://vitals.lifehacker.com are kept, but items linking to https://kinjadeals.theinventory.com or https://theinventory.com are dropped.

Any ideas on how to avoid this?

Embed pictures into fulltext feed?

There is a little problem when using the full text content for some site: When we use our RSS readers to read something from a site, the server may block our access to the image source inside the article. The server checks our request header to see if this request is from a user that is reading something on their website. Certainly, this behavior can help them away from those annoying web-spider, but not friendly to us RSS user.
So I'm thinking that if we can embed those pictures in the article into the full text feed?

[Feature Request] Link to embed YouTube videos from YouTube RSS

Hi there. I've been using this tool for some time, amazing work, thank you so much.
A few days back, after a lot of frustrations with the YouTube for iOS application I decided to remove all my subscriptions and turn that into RSS feeds.

But since YouTube feeds are crap, I tried using the moRSS but I get empty articles (but with the link to the original video). What I think it'll be great is that if we put a link to embed to the vídeo in the article, so for example in some iOS RSS readers it will use the native player.

The feeds from YouTube comes from the base URL:
http://gdata.youtube.com/feeds/base/users/[USERNAME]/uploads?client=ytapi-youtube-rss-redirect&alt=rss&orderby=updated&v=2

And the embed base URL is:
https://www.youtube.com/embed/[VIDEO_ID]

I will try to hack as soon as possible with your code, but I'm a beginner in Python, and only worked with Python 3 so there are some things that I will have to learn.
Thanks

Running from CLI: relative import

I'm a complete newbie and I'm really sorry if this is a very simple problem to solve, but I'd be grateful for any guidance. I'm trying to get a local RSS to CSV downloader. This is what happens when I run morss from the terminal (Ubuntu 18.10):

python -m morss debug http://feeds.bbci.co.uk/news/rss.xml

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/home/david/programas/morss/morss/morss.py", line 14, in <module>
    from . import feeds
ValueError: Attempted relative import in non-package

All solutions I find seem to suggest to change the script itself, not anything about the environment or the command.

Thanks!

feature request: forward website captcha (e.g. for fb pages)

I tried getting an rss feed of a public facebook page. (fb.com/pageUsername/posts)

But fb requires me to prove that I'm human.
It would be great if fb would accept the moRSS interface as non-robot.
a solution would be to let moRSS access the page through a dummy fb profile, as it is also done by archive.today

[meta] todo-list

List of small bugs / things yet to be implemented

morss/morss.py

limit number of lxml fromstring/tostring calls (slow)
move iTunes spaghetti code outside the main code
only download feed with full Opener (html friendly) if using feedify (mode =xpath)
accept application/xhtml+xml as rss mimetype (kinda wrong but still the case sometimes...)

morss/readabilite.py

make relative url absolute
use custom xpath rule
use config file for site-by-side xpath rules
improve link density thingy

morss/crawler.py

Remove the from_304 crap
double check exec order
append HTTPRedirectHandler by hand (to force exec order)
put cache handler first but w/o the handler_order
put EncodingHandler before HTTPEquiv
check if 301 caching has to be done earlier

morss/feedify.py

do something about FB graph API 2.0 (drop it...?)

Error loading feeds

I'm trying your online tool and it have problems with:

1. http://www.sabercurioso.es/feed/ (Link provided is an HTML page, which doesn't link to a feed)
2. https://www.libretro.com/index.php/feed/ (Error downloading feed)
4. http://feeds.gawker.com/esgizmodo/full

Edit: Removed
3. https://blog.google/rss/ <-- I see that now they now are full text, so it doesn't need to pass by a full-text rss tool

Try to use class withouth success

Hello, thank you very much for your tool, it is a rare pearl in the web as it is today.

I'm looking for a job, so I want to synchronize the different sources in my RSS reader, thus I do not have to always check every website. There are 3 french websites for which it is impossible for me to add them:

Apec : https://www.apec.fr/candidat/recherche-emploi.html/emploi?motsCles=data%20scientist&lieux=711&sortsType=DATE
Here, according to Firefox tool's Inspect Element, the interesting class is container-result (the cardboard list)
Pôle Emploi : https://candidat.pole-emploi.fr/offres/recherche?lieux=11R&motsCles=data+scientist&offresPartenaires=false&range=0-19&rayon=10&tri=0
Here, the interesting class is result-list list-unstyled
Welcome to the jungle : https://www.welcometothejungle.com/fr/jobs?query=data%20scientist&page=1&aroundQuery=%C3%8Ele-de-France%2C%20France&refinementList%5Boffice.state%5D%5B0%5D=Ile-de-France&refinementList%5Boffice.country_code%5D%5B0%5D=FR
Here, the interesting class is qjc24y-1 jhGOKF

For each URL, I tried to customize the classes in morss.it but it doesn't work. What am I doing wrong?

Thank you for your precious help!

CLI version doesn't grab all articles, unlike web version

Hello, and sorry if it's a mistake on my part, but when trying to make a rss feed for https://shonumi.github.io/articles.html it only ends up grabbing the first article. I'm using the following command morss --items "//*[class=inner_text_large]" https://shonumi.github.io/articles.html

Using the website and selecting that element selects 5 articles. For some reason the cli version is stopping at the first one.

My version is current as I installed morss today.

SQLite3 Syntax Error

Hi Pictuga.
I know, this is an issue request. But first I wanna thank you for this amazing project! For one of my current projects, it works like a charm and has everything I could have asked for.

Now I have one problem though:
I'm using morss in a flask web app, running on apache wsgi. If I try using a SQLite database for cacheing, I receive the following error:

File "/venv/lib/python3.6/site-packages/morss/crawler.py", line 616, in __setitem__ self.con.execute('INSERT INTO data VALUES (?,?,?,?,?,?) ON CONFLICT(url) DO UPDATE SET code=?, msg=?, headers=?, data=?, timestamp=?', (url,) + value + value) sqlite3.OperationalError: near "ON": syntax error
I'm aware that this error is most definitely not caused by morss itself. In fact, I can run the same code from the same venv's BASH and it will work.
Still, I can't get it to work from my flask app (always getting this very same error).

What I have tried:

Temporarily chmod -R 777 the directory containing the .db (after chown'ing apache ^^)
Update the venv to the latest SQLite version
verify that all the feeds in question do indeed work
(as I said) run the same code from bash SUCCESSFULLY

Hope you can help. Would be much appreciated :)

detect whether the full content is already in rss feed

morss will also try to figure out whether the full content is already in place (for those websites which understood the whole point of RSS feeds). However this detection is very simple, and only works if the actual content is put in the "content" section in the feed and not in the "summary" section.

Many RSS feeds put full content in the "summary" or "description" section. I wander is there a better way to detect it.

Simply set content length threshold not work because there are many short feeds.
How about is_fulltext = num_images > 0 or content_length > 2000?

crawler.py fails on badly formatted (or empty) html pages

https://consortiumnews.com

Can't bypass feed autodetection on webpages to create custom feed

moRSS is a great tool for many pages that miss an RSS feed.

But for webpages based on wordpress, it does not parse the actual linked sub-page (e.g. for categories OR tags OR result pages of search queries). Instead it just takes the frontpage of the domain. Also it does not load a preview, where I can pick the CSS elements of the sub-page, I linked.

readability

Hi, I am interested in this project. Previously I was looking at something similar (https://bitbucket.org/fivefilters/full-text-rss) but I was glad to discover morss since I prefer Python over PHP.

I noticed that some feeds are more difficult than others, for example on this one:
http://www.internazionale.it/sitemaps/rss.xml
morss does not really extract the content:
http://test.morss.it/www.internazionale.it/sitemaps/rss.xml

What works well is Firefox reading mode. After some research I found that it is based on the old readability.com javascript code, which is now here: https://github.com/mozilla/readability; it can be run standalone from node.js.

Chrome has a similar functionality in testing and the source for that is here: https://github.com/chromium/dom-distiller; this one seems more complex to tun as it depends on Java...

What is the best way to incorporate more sophisticated algorithms for content extraction in morss ?

And what about customizing the extraction rule on a site-by-site basis ? full-text-rss above has a repository of site-specific extraction rules: https://github.com/fivefilters/ftr-site-config

raggajungle.biz

morss.it says it cannot load this feed: https://www.raggajungle.biz/category/free-downloads/feed

cannot import name 'cred'

after todays update, morss stopped with following error.
am running on docker

aceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 129, in init_process
    self.load_wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 138, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/lib/python3.8/site-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "/usr/lib/python3.8/site-packages/morss/__init__.py", line 3, in <module>
    from .wsgi import application
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 20, in <module>
    from . import cred

ImportError: cannot import name 'cred' from partially initialized module 'morss' (most likely due to a circular import) (/usr/lib/python3.8/site-packages/morss/__init__.py)
[2020-08-24 15:12:28 +0000] [8] [INFO] Worker exiting (pid: 8)
[2020-08-24 15:12:28 +0000] [9] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 129, in init_process
    self.load_wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 138, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/lib/python3.8/site-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "/usr/lib/python3.8/site-packages/morss/__init__.py", line 3, in <module>
    from .wsgi import application
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 20, in <module>
    from . import cred

ImportError: cannot import name 'cred' from partially initialized module 'morss' (most likely due to a circular import) (/usr/lib/python3.8/site-packages/morss/__init__.py)

[2020-08-24 15:12:28 +0000] [9] [INFO] Worker exiting (pid: 9)
/usr/lib/python3.8/os.py:1023: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  return io.open(fd, *args, **kwargs)
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 210, in run
    self.sleep()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 360, in sleep
    ready = select.select([self.PIPE[0]], [], [], 1.0)
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 245, in handle_chld
    self.reap_workers()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 525, in reap_workers
    raise HaltServer(reason, self.WORKER_BOOT_ERROR)
gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/gunicorn", line 11, in <module>
    load_entry_point('gunicorn==19.9.0', 'console_scripts', 'gunicorn')()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 61, in run
    WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 223, in run
    super(Application, self).run()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 72, in run
    Arbiter(self).run()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 232, in run
    self.halt(reason=inst.reason, exit_status=inst.exit_status)
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 345, in halt
    self.stop()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 393, in stop
    time.sleep(0.1)
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 245, in handle_chld
    self.reap_workers()
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 525, in reap_workers
    raise HaltServer(reason, self.WORKER_BOOT_ERROR)
gunicorn.errors.HaltServer: <HaltServer 'Worker failed to boot.' 3>
[2020-08-24 15:12:28 +0000] [10] [ERROR] Exception in worker process
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/gunicorn/arbiter.py", line 583, in spawn_worker
    worker.init_process()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 129, in init_process
    self.load_wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/workers/base.py", line 138, in load_wsgi
    self.wsgi = self.app.wsgi()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/base.py", line 67, in wsgi
    self.callable = self.load()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 52, in load
    return self.load_wsgiapp()
  File "/usr/lib/python3.8/site-packages/gunicorn/app/wsgiapp.py", line 41, in load_wsgiapp
    return util.import_app(self.app_uri)
  File "/usr/lib/python3.8/site-packages/gunicorn/util.py", line 350, in import_app
    __import__(module)
  File "/usr/lib/python3.8/site-packages/morss/__init__.py", line 3, in <module>
    from .wsgi import application
  File "/usr/lib/python3.8/site-packages/morss/wsgi.py", line 20, in <module>
    from . import cred
ImportError: cannot import name 'cred' from partially initialized module 'morss' (most likely due to a circular import) (/usr/lib/python3.8/site-packages/morss/__init__.py)
[2020-08-24 15:12:28 +0000] [10] [INFO] Worker exiting (pid: 10)

Error downloading feed (Invalid SSL Certificate)

This feed used to work until a couple of days ago when it started returning the error "Error downloading feed (Invalid SSL Certificate)". Without parsing it through morss, the feed loads fine in Firefox.
I guess this is not related to morss but is this certificate check mandatory?

http://morss.it/https://authoritynutrition.com/feed

Thanks for the great project.

Class picker's xpath sometimes matches duplicates

when I ask morss to create an rss feed of a sub-page linked in the menu of a webpage based on wix.com, there is a problem with the entry titles.

This is how the title of the rss feed content should look like:
"Fakten über Pandemie-Impfstoffe"

This is how it actually looks like, until now:
Team DKDE Apr 13 2 Min. Fakten über Pandemie-Impfstoffe Fakt 1.1 "Thiomersal in Impfstoffen Die Pandemieimpfstoffe werden 5 μg bzw. 25 μg Thiomersal (entsprechend 2,5 μg bzw. 12,4 μg Quecksilber) pro Dosis enthalt... 0 Ansichten Kommentar verfassen

moRSS' class picker is great tool. Loved using it:
https://morss.it/:items=%7C%7C*%5Bclass=_2eqGx%5D/https://www.dkde.online/blog

http://morss.it/ returning 403 error

I was attempting to try your test site. It's giving a "Forbidden error"

Couldn't load feed

Couldn't load feed https://blog.path.net/.
Please try again later, or report on GitHub.

curl https://blog.path.net/
<html>
<head><title>307 Temporary Redirect</title></head>
<body>
<center><h1>307 Temporary Redirect</h1></center>
<hr><center>openresty</center>
</body>
</html>

Looks like that site have anti bot protection

Simplest test give back an error

Hi!
I'd love to try your lib, but the simples example give me an error. I am trying to use morss as a library with the code

import morss
xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')

but i've got this error:

Traceback (most recent call last):
  File "a.py", line 3, in <module>
    xml_string = morss.process('http://feeds.bbci.co.uk/news/rss.xml')
AttributeError: 'module' object has no attribute 'process'

Can you help me?

ERROR: Link provided is not a valid feed

I am just starting to use morss and there are many rss feeds that return ERROR: Link provided is not a valid feed

python -m morss debug http://www.megabolsa.com/feed
'random page'
u'text/html'
ERROR: Link provided is not a valid feed

python -m morss debug http://rss.elconfidencial.com/espana/
'random page'
'text/xml, charset=UTF-8'
ERROR: Link provided is not a valid feed

python -m morss debug http://rss.elconfidencial.com/mundo/
'random page'
'text/xml, charset=UTF-8'
ERROR: Link provided is not a valid feed

They seem pretty valid to me and work ok in several RSS readers

Any quick tip about the source of the problem before diving deep into the code?

Thanks for your time

Trying to add item_date xpath

Hi!
I managed to obtain a nice rss from the result of search. This is a list of files. The problem is that there is an added date that i can't manage to add. I don't know nothing about Xpath...

Here is the date xpath of the 8th file (tr[8])

:item_time=||html|body|div[9]|main|div|div|section[3]|div|table|tbody|tr[8]|td[5]|div/

How can I write this so each item have its own date?

morss only returns few articles from rss url

Hi,

I try to morss this url : http://www.lequipe.fr/Xml/actu_rss.xml
Morss.it only returns 9 articles although there are many more.

Is there a limitation on numbers of returned items?

pictuga / morss Goto Github PK

morss's People

Contributors

Stargazers

Watchers

Forkers

morss's Issues

morss/morss.py

morss/readabilite.py

morss/crawler.py

morss/feedify.py

Recommend Projects

Recommend Topics

Recommend Org

Jobs