dhvcc / rss-parser Goto Github PK

View Code? Open in Web Editor NEW

37.0 1.0 4.0 261 KB

typed python RSS parsing module built using xmltodict and pydantic

Home Page: https://dhvcc.github.io/rss-parser/

License: GNU General Public License v3.0

Python 100.00%

rss rss-parser rss-feed-parser rss-feed-scraper typed-python pydantic bs4 python python3 python-3

rss-parser's Introduction

Hello

I'm Alexey, a software developer experienced in full stack web development. Linux enjoyer, Python/JavaScript developer.

Contacts

Setup

Machine	Shell	Editor	Terminal	Config
		/ /	/	/

Tech stack

Backend

Frontend

Deployment and Automation

Stats

rss-parser's People

Contributors

Stargazers

Watchers

Forkers

elreydetoda bbarikl zocker1999net fitmtfk

rss-parser's Issues

Too much data under one exception catch block

Hi.

Nice parser, glad I found that.

I have an idea to improve the code. You see, when I'm parsing my rss feed, I noticed that enclosure attribute is None, but it actually presented in a tree.

Looked at the code, I saw that user defined attribute placed under exception catch block here.

rss-parser/rss_parser/_parser.py

Line 124 in a13e8fa

try:

I think it's not a good idea to join different tags under one block, because it doesn't work unless all tags are presented. In my case there are no itunes:image tag but enclosure is.

As minimum, I suggest moving a processing of enclosure tag in separated if statement before exception like this

if item.hasattr(item, "enclosure"):
        item_dict.update(
                {
                    "enclosure": {
                        "content": "",
                        "attrs": {
                            "url": item.enclosure["url"],
                            "length": item.enclosure["length"],
                            "type": item.enclosure["type"],
                        },
                    }
                }
        )

It is a general solution because I don't know what the item actually is. As I got from the specification, enclosure tag is optional, but if it's presented, it must contain those three fields. It seems that the field presence in itunes tag are completely user-defined, though.

lxml requirements too strict

Installing rss-parser on Windows results in uninstallation of lxml4.6.3 and reinstall of 4.5.2 (which leads to codepage errors on my python 3.9, rollback to 4.6.3 and aborted rss-parser install).
Changing requirements in wheel to 4.6.3 fixes problem localy.

Atom Feed Support

NotImplementedError: ATOM feed is not currently supported

Is this currently being worked on?

I understand #1 was opened a while back, but is closed at the moment.

I can create a PR for this, but I am curious where this stands at the moment or if this repo is even active?

guid attribute is not parsed

I have spotted that guid attribute is not parsed by the parser. I need it beacuse most of the time includes a unique identifier. For example:

<guid isPermaLink="false">/node/18150</guid>

It would be good if that feature added to the library. Then I will depend on this library for my projects :)

Crash when there is no description_soup

Hia,

I have been playing around with your rss_parser and I noticed that when a RSS does not have a description element in an item it crashes.

Like with this feed: https://www.dutchcowboys.nl/feed/rss.

Is this intentional?

Kind regards,

Erik

It rss-parser support multiple source parse?

Hello. I'm currently make my own RSS parsers project with MULTIPLE source(url). It your library support for parse multiple source?

No support for ATOM feed

Tried to parse xml from: https://stackoverflow.com/feeds/tag/jupyter and module failed on trying to get version from RSS.

Release rss-parser 0.2.4

Its been a while since rss-parser has been considered "feature-complete" and the checks for edge cases in non-conventional rss has been made, is it possible to have a release in the near future?

Error when parsing rss files

Crash when trying parsing resources like investing.com

Traceback (most recent call last):
  File "/rss_news.py", line 121, in __get_investing_news
    entities = parser.parse()
  File "venv/lib/python3.10/site-packages/rss_parser/_parser.py", line 89, in parse
    "version": main_soup.rss.get("version"),
AttributeError: 'NoneType' object has no attribute 'get'

"https://ru.investing.com/rss/news_1.rss",
"https://ru.investing.com/rss/news_477.rss",
"https://ru.investing.com/rss/news_11.rss",
"https://ru.investing.com/rss/news_95.rss",
"https://ru.investing.com/rss/news_14.rss",
"https://ru.investing.com/rss/news_285.rss",
"https://ru.investing.com/rss/news_25.rss"

python3.10/site-packages/rss_parser/_parser.py", line 88, in parse
    "title": main_soup.title.text,
AttributeError: 'NoneType' object has no attribute 'text'

Link in the text

If there is a link with a keyword in the text, then the link disappears, only the keyword remains, can I do something about it?

Lib still broken for Python 3.8+

After 699c3aa the following error appears:
TypeError: <class 'rss_parser.models.types.only_list.OnlyList'> is not a generic class

from rss_parser import Parser throws error

Issue

The most basic example code I was running returned a - ImportError: cannot import name 'GenericModel' from 'pydantic.generics'
I simplified it down to from rss_parser import Parser - what am I doing wrong here?

Installing

A boring install of rss-parser, went fine.

(venv) (base) ck@DESKTOP-G0NAUG1:~$ pip install rss-parser
Collecting rss-parser
  Using cached rss_parser-1.0.0-py3-none-any.whl (24 kB)
Collecting xmltodict<0.14.0,>=0.13.0
  Using cached xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Collecting pydantic>=1.6.1
  Using cached pydantic-2.0.2-py3-none-any.whl (359 kB)
Collecting pytest<8.0.0,>=7.1.2
  Using cached pytest-7.4.0-py3-none-any.whl (323 kB)
Collecting annotated-types>=0.4.0
  Using cached annotated_types-0.5.0-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.6.1
  Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting pydantic-core==2.1.2
  Using cached pydantic_core-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
Collecting iniconfig
  Using cached iniconfig-2.0.0-py3-none-any.whl (5.9 kB)
Collecting tomli>=1.0.0
  Using cached tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting exceptiongroup>=1.0.0rc8
  Using cached exceptiongroup-1.1.2-py3-none-any.whl (14 kB)
Collecting pluggy<2.0,>=0.12
  Using cached pluggy-1.2.0-py3-none-any.whl (17 kB)
Collecting packaging
  Using cached packaging-23.1-py3-none-any.whl (48 kB)
Installing collected packages: xmltodict, typing-extensions, tomli, pluggy, packaging, iniconfig, exceptiongroup, pytest, pydantic-core, annotated-types, pydantic, rss-parser
Successfully installed annotated-types-0.5.0 exceptiongroup-1.1.2 iniconfig-2.0.0 packaging-23.1 pluggy-1.2.0 pydantic-2.0.2 pydantic-core-2.1.2 pytest-7.4.0 rss-parser-1.0.0 tomli-2.0.1 typing-extensions-4.7.1 xmltodict-0.13.0
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/home/ck/venv/bin/python -m pip install --upgrade pip' command.

Running

Running the most basic example

(venv) (base) ck@DESKTOP-G0NAUG1:~$ python
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rss_parser import Parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/__init__.py", line 1, in <module>
    from ._parser import Parser
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/_parser.py", line 6, in <module>
    from rss_parser.models.rss import RSS
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/rss.py", line 6, in <module>
    from rss_parser.models.channel import Channel
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/channel.py", line 6, in <module>
    from rss_parser.models.image import Image
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/image.py", line 4, in <module>
    from rss_parser.models.types.tag import Tag
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/__init__.py", line 2, in <module>
    from rss_parser.models.types.tag import Tag
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/tag.py", line 9, in <module>
    from pydantic.generics import GenericModel
ImportError: cannot import name 'GenericModel' from 'pydantic.generics' (/home/ck/venv/lib/python3.8/site-packages/pydantic/generics.py)
>>>
>>>

Invalid tags parsing

Hello. I have encountered a serious error in the library, which does not allow to parse some RSS sources correctly. For example, here the tags contain two entries and the parser throws an error when trying to read the elements.

channel -> content -> item -> 9 -> content -> link -> content str type expected (type=type_error.str)

channel -> content -> item -> 0 -> content -> enclosure -> content str type expected (type=type_error.str)

Can you fix it? Thank you!

Parsed return value does not populate enclosures

Hi! I'm trying to parse the following RSS, and need to get out the enclosures from each item. (Would you expect this to work? Or is rss-parser still WIP?)

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/rss.xml</link><description></description><language>en</language><lastBuildDate>Wed, 29 Mar 2023 04:13:56 GMT</lastBuildDate><generator>https://getnikola.com/</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Tesla4 lg</title><link>https://example.com/galleries/demo/tesla4_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate></item><item><title>Tesla conducts lg</title><link>https://example.com/galleries/demo/tesla_conducts_lg.webp</link><enclosure url="https://example.com/galleries/demo/tesla_conducts_lg.webp" length="9620" type="image/webp"></enclosure><guid isPermaLink="false">galleries/demo/tesla_conducts_lg.webp</guid><pubDate>Wed, 01 Jan 2014 00:02:00 GMT</pubDate></item><item><title>Tesla lightning1 lg</title><link>https://example.com/galleries/demo/tesla_lightning1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning1_lg.jpg" length="41123" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:03:00 GMT</pubDate></item><item><title>Tesla lightning2 lg</title><link>https://example.com/galleries/demo/tesla_lightning2_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning2_lg.jpg" length="36994" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning2_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:04:00 GMT</pubDate></item><item><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/tesla_tower1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_tower1_lg.jpg" length="18105" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_tower1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:05:00 GMT</pubDate></item></channel></rss>

Each item in the RSS has an enclosure, for example the first item, prettified:

<item>
	<title>Tesla4 lg</title>
	<link>https://example.com/galleries/demo/tesla4_lg.jpg</link>
	<enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"/>
	<guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid>
	<pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate>
</item>

I call parsed = rss_parser.Parser(xml=content).parse().

The return value looks good, but the enclosures are not populated. Calling 'vars' on it, it looks like:

{'title': 'Tesla tower1 lg', 'version': '2.0', 'language': 'en', 'description': '', 'feed': [FeedItem(title='Tesla4 lg', link='https://example.com/galleries/demo/tesla4_lg.jpg', publish_date='Wed, 01 Jan 2014 00:01:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla conducts lg', link='https://example.com/galleries/demo/tesla_conducts_lg.webp', publish_date='Wed, 01 Jan 2014 00:02:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning1 lg', link='https://example.com/galleries/demo/tesla_lightning1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:03:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning2 lg', link='https://example.com/galleries/demo/tesla_lightning2_lg.jpg', publish_date='Wed, 01 Jan 2014 00:04:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla tower1 lg', link='https://example.com/galleries/demo/tesla_tower1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:05:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={})]}

The returned first item looks like:

FeedItem(
    title='Tesla conducts lg',
    link='https://example.com/galleries/demo/tesla_conducts_lg.webp',
    publish_date='Wed, 01 Jan 2014 00:02:00 GMT',
    category='',
    description='',
    description_links=[],
    description_images=[],
    enclosure=None,
    itunes=None,
    other={}
)

The enclosure isn't populated, even though the enclosure looks valid to me in the first item RSS input.

Digging into the code, I suspect maybe an exception is getting swallowed in parse(), from the code that adds enclosures and itunes to the returned fields:

            try:
                # Add user-defined entries
                item_dict.update({"other": {}})
                for entrie in entries:
                    value = self.get_text(item, entrie)
                    value = re.sub(f"</?{entrie}>", "", value)
                    item_dict["other"].update({entrie: value})

                item_dict.update(
                    {
                        "enclosure": {
                            "content": "",
                            "attrs": {
                                "url": item.enclosure["url"],
                                "length": item.enclosure["length"],
                                "type": item.enclosure["type"],
                            },
                        },
                        "itunes": {
                            "content": "",
                            "attrs": {
                                "href": self.check_none(
                                    item.find("itunes:image"), # -> None
                                    main_soup.find("itunes:image"), # 
                                    "href",
                                    "href",
                                )
                            },
                        },
                    }
                )
            except (TypeError, KeyError, AttributeError):  # <--- exception swallowed here, maybe?
                pass

Sure enough, commenting out that try...except..., to make any raised exception visible, gives:

/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:127: in parse
    for entrie in entries:
/usr/lib/python3.10/typing.py:312: in inner
    return func(*args, **kwds)
/usr/lib/python3.10/typing.py:1143: in __getitem__
    params = tuple(_type_check(p, msg) for p in params)
/usr/lib/python3.10/typing.py:1143: in <genexpr>
    params = tuple(_type_check(p, msg) for p in params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arg = 0, msg = 'Parameters to generic types must be types.', is_argument = True, module = None

    def _type_check(arg, msg, is_argument=True, module=None, *, allow_special_forms=False):
        ...
        if not callable(arg):
>           raise TypeError(f"{msg} Got {arg!r:.100}.")
E           TypeError: Parameters to generic types must be types. Got 0.

This seems to be a result of me not passing a value for the 'entries' parameter to parse:

    def parse(self, entries: Optional[List[str]] = List) -> RSSFeed:

So 'entries' is 'List', which raises the above error if iterated over:

>>> from typing import List
>>> list(List)
Traceback (most recent call last):
...
    raise TypeError(f"{msg} Got {arg!r:.100}.")
TypeError: Parameters to generic types must be types. Got 0.
>>>

I don't want any custom attributes from the RSS, so I try again with an explicit empty list of custom attributes:

    parsed = rss_parser.Parser(xml=content).parse([])

Now we get a different exception (yay, progress! :-)

    def test_gallery_rss(build, output_dir):
        ...
>       parsed = rss_parser.Parser(xml=content).parse([])

tests/integration/test_demo_build.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:145: in parse
    "href": self.check_none(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = None, default = None, item_dict = 'href', default_dict = 'href'

    @staticmethod
    def check_none(
        item: object,
        default: str,
        item_dict: Optional[str] = None,
        default_dict: Optional[str] = None,
    ) -> Any:
        ...
            if default_dict:
>               return default[default_dict]
E               TypeError: 'NoneType' object is not subscriptable

/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:57: TypeError

So, parse is calling check_none with:

default = None
default_dict = "href"

and we try to return default[default_dict] (ie indexing None with a string), which raises.

Even if parse() conformed to the typing of check_none() parameters, and passed default as a string, we'd still be trying to index a string with a string, which would also raise.

I don't understand what this is meant to be doing. Why does check_none have parameters which are called "xxx_dict", but are typed as strings? Do you have any hints about what I'm doing wrong? Should I keep trying with this? Thanks for any advice.

Support for pydantic V2

There should be support for pydantic V2 which is now out
Also, I think there should be a gracefull fallback for lower pydantic versions than this library currently supports, because the only part of the code that's failing is the magical "Tag" class which is constructed dynamically. Ideally not to use this or support a fallback instead of raising exceptions

Security Vulnerability found by Dependabot

So, I was including your project in one of mine ( as you know already #4 ), and the format that I keep all the dependencies managed ( pipenv ) is a format that @github's @dependabot recognizes and scans for vulnerabilities.

So, whenever I pushed my code to github, I got a notification that they had found a vulnerability in the code for this line. this was the notification:

CVE Links:

So, I plan to create a PR, and upgrade it to the minimum version specified in the notice 4.6.3.

If you don't want to accept the PR, since you made the change for #3 ( 7311b2d ) it is possible for people to override the current version of lxml with adding lxml = ">=4.6.3" to their Pipfile ( as indicated in the picture above, but they will have to upgrade to the new version specified in #3 (comment).

Pager

Would be nice to have a -p/--pager argument to output results with pager