GithubHelp home page GithubHelp logo

dhvcc / rss-parser Goto Github PK

View Code? Open in Web Editor NEW
37.0 1.0 4.0 261 KB

typed python RSS parsing module built using xmltodict and pydantic

Home Page: https://dhvcc.github.io/rss-parser/

License: GNU General Public License v3.0

Python 100.00%
rss rss-parser rss-feed-parser rss-feed-scraper typed-python pydantic bs4 python python3 python-3

rss-parser's Introduction

Hello

I'm Alexey, a software developer experienced in full stack web development.
Linux enjoyer, Python/JavaScript developer.

Contacts

Gmail

Setup

Machine Shell Editor Terminal Config
Asus ROG Flow x13 GV302XV ZSH JetBrains / VSCode / NeoVim Windows Terminal / Kitty Repo / Desktop

Tech stack

Backend

Python Django FastAPI GraphQL PostgreSQL MySQL

Frontend

TypeScript Next.js React Redux MaterialUI

Deployment and Automation

Docker GNU Make NGINX Ansible AWS GitHub Actions

Stats

Profile stats

rss-parser's People

Contributors

bbarikl avatar ddkasa avatar dependabot[bot] avatar dhvcc avatar elreydetoda avatar renovate-bot avatar zocker1999net avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

rss-parser's Issues

Too much data under one exception catch block

Hi.

Nice parser, glad I found that.

I have an idea to improve the code. You see, when I'm parsing my rss feed, I noticed that enclosure attribute is None, but it actually presented in a tree.

Looked at the code, I saw that user defined attribute placed under exception catch block here.

I think it's not a good idea to join different tags under one block, because it doesn't work unless all tags are presented. In my case there are no itunes:image tag but enclosure is.

As minimum, I suggest moving a processing of enclosure tag in separated if statement before exception like this

if item.hasattr(item, "enclosure"):
        item_dict.update(
                {
                    "enclosure": {
                        "content": "",
                        "attrs": {
                            "url": item.enclosure["url"],
                            "length": item.enclosure["length"],
                            "type": item.enclosure["type"],
                        },
                    }
                }
        )

It is a general solution because I don't know what the item actually is. As I got from the specification, enclosure tag is optional, but if it's presented, it must contain those three fields. It seems that the field presence in itunes tag are completely user-defined, though.

lxml requirements too strict

Installing rss-parser on Windows results in uninstallation of lxml4.6.3 and reinstall of 4.5.2 (which leads to codepage errors on my python 3.9, rollback to 4.6.3 and aborted rss-parser install).
Changing requirements in wheel to 4.6.3 fixes problem localy.

Atom Feed Support

NotImplementedError: ATOM feed is not currently supported

Is this currently being worked on?

I understand #1 was opened a while back, but is closed at the moment.

I can create a PR for this, but I am curious where this stands at the moment or if this repo is even active?

guid attribute is not parsed

Hi

I have spotted that guid attribute is not parsed by the parser. I need it beacuse most of the time includes a unique identifier. For example:

<guid isPermaLink="false">/node/18150</guid>

It would be good if that feature added to the library. Then I will depend on this library for my projects :)

No support for ATOM feed

Tried to parse xml from: https://stackoverflow.com/feeds/tag/jupyter and module failed on trying to get version from RSS.

Release rss-parser 0.2.4

Its been a while since rss-parser has been considered "feature-complete" and the checks for edge cases in non-conventional rss has been made, is it possible to have a release in the near future?

Error when parsing rss files

Crash when trying parsing resources like investing.com

Traceback (most recent call last):
  File "/rss_news.py", line 121, in __get_investing_news
    entities = parser.parse()
  File "venv/lib/python3.10/site-packages/rss_parser/_parser.py", line 89, in parse
    "version": main_soup.rss.get("version"),
AttributeError: 'NoneType' object has no attribute 'get'

"https://ru.investing.com/rss/news_1.rss",
"https://ru.investing.com/rss/news_477.rss",
"https://ru.investing.com/rss/news_11.rss",
"https://ru.investing.com/rss/news_95.rss",
"https://ru.investing.com/rss/news_14.rss",
"https://ru.investing.com/rss/news_285.rss",
"https://ru.investing.com/rss/news_25.rss"

python3.10/site-packages/rss_parser/_parser.py", line 88, in parse
    "title": main_soup.title.text,
AttributeError: 'NoneType' object has no attribute 'text'

Link in the text

If there is a link with a keyword in the text, then the link disappears, only the keyword remains, can I do something about it?

from rss_parser import Parser throws error

Issue

The most basic example code I was running returned a - ImportError: cannot import name 'GenericModel' from 'pydantic.generics'
I simplified it down to from rss_parser import Parser - what am I doing wrong here?

Installing

A boring install of rss-parser, went fine.

(venv) (base) ck@DESKTOP-G0NAUG1:~$ pip install rss-parser
Collecting rss-parser
  Using cached rss_parser-1.0.0-py3-none-any.whl (24 kB)
Collecting xmltodict<0.14.0,>=0.13.0
  Using cached xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Collecting pydantic>=1.6.1
  Using cached pydantic-2.0.2-py3-none-any.whl (359 kB)
Collecting pytest<8.0.0,>=7.1.2
  Using cached pytest-7.4.0-py3-none-any.whl (323 kB)
Collecting annotated-types>=0.4.0
  Using cached annotated_types-0.5.0-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.6.1
  Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting pydantic-core==2.1.2
  Using cached pydantic_core-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
Collecting iniconfig
  Using cached iniconfig-2.0.0-py3-none-any.whl (5.9 kB)
Collecting tomli>=1.0.0
  Using cached tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting exceptiongroup>=1.0.0rc8
  Using cached exceptiongroup-1.1.2-py3-none-any.whl (14 kB)
Collecting pluggy<2.0,>=0.12
  Using cached pluggy-1.2.0-py3-none-any.whl (17 kB)
Collecting packaging
  Using cached packaging-23.1-py3-none-any.whl (48 kB)
Installing collected packages: xmltodict, typing-extensions, tomli, pluggy, packaging, iniconfig, exceptiongroup, pytest, pydantic-core, annotated-types, pydantic, rss-parser
Successfully installed annotated-types-0.5.0 exceptiongroup-1.1.2 iniconfig-2.0.0 packaging-23.1 pluggy-1.2.0 pydantic-2.0.2 pydantic-core-2.1.2 pytest-7.4.0 rss-parser-1.0.0 tomli-2.0.1 typing-extensions-4.7.1 xmltodict-0.13.0
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/home/ck/venv/bin/python -m pip install --upgrade pip' command.

Running

Running the most basic example

(venv) (base) ck@DESKTOP-G0NAUG1:~$ python
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rss_parser import Parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/__init__.py", line 1, in <module>
    from ._parser import Parser
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/_parser.py", line 6, in <module>
    from rss_parser.models.rss import RSS
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/rss.py", line 6, in <module>
    from rss_parser.models.channel import Channel
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/channel.py", line 6, in <module>
    from rss_parser.models.image import Image
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/image.py", line 4, in <module>
    from rss_parser.models.types.tag import Tag
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/__init__.py", line 2, in <module>
    from rss_parser.models.types.tag import Tag
  File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/tag.py", line 9, in <module>
    from pydantic.generics import GenericModel
ImportError: cannot import name 'GenericModel' from 'pydantic.generics' (/home/ck/venv/lib/python3.8/site-packages/pydantic/generics.py)
>>>
>>>

Invalid tags parsing

Hello. I have encountered a serious error in the library, which does not allow to parse some RSS sources correctly. For example, here the tags contain two entries and the parser throws an error when trying to read the elements.

channel -> content -> item -> 9 -> content -> link -> content str type expected (type=type_error.str)

channel -> content -> item -> 0 -> content -> enclosure -> content str type expected (type=type_error.str)

Can you fix it? Thank you!

Parsed return value does not populate enclosures

Hi! I'm trying to parse the following RSS, and need to get out the enclosures from each item. (Would you expect this to work? Or is rss-parser still WIP?)

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/rss.xml</link><description></description><language>en</language><lastBuildDate>Wed, 29 Mar 2023 04:13:56 GMT</lastBuildDate><generator>https://getnikola.com/</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Tesla4 lg</title><link>https://example.com/galleries/demo/tesla4_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate></item><item><title>Tesla conducts lg</title><link>https://example.com/galleries/demo/tesla_conducts_lg.webp</link><enclosure url="https://example.com/galleries/demo/tesla_conducts_lg.webp" length="9620" type="image/webp"></enclosure><guid isPermaLink="false">galleries/demo/tesla_conducts_lg.webp</guid><pubDate>Wed, 01 Jan 2014 00:02:00 GMT</pubDate></item><item><title>Tesla lightning1 lg</title><link>https://example.com/galleries/demo/tesla_lightning1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning1_lg.jpg" length="41123" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:03:00 GMT</pubDate></item><item><title>Tesla lightning2 lg</title><link>https://example.com/galleries/demo/tesla_lightning2_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning2_lg.jpg" length="36994" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning2_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:04:00 GMT</pubDate></item><item><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/tesla_tower1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_tower1_lg.jpg" length="18105" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_tower1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:05:00 GMT</pubDate></item></channel></rss>

Each item in the RSS has an enclosure, for example the first item, prettified:

<item>
	<title>Tesla4 lg</title>
	<link>https://example.com/galleries/demo/tesla4_lg.jpg</link>
	<enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"/>
	<guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid>
	<pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate>
</item>

I call parsed = rss_parser.Parser(xml=content).parse().

The return value looks good, but the enclosures are not populated. Calling 'vars' on it, it looks like:

{'title': 'Tesla tower1 lg', 'version': '2.0', 'language': 'en', 'description': '', 'feed': [FeedItem(title='Tesla4 lg', link='https://example.com/galleries/demo/tesla4_lg.jpg', publish_date='Wed, 01 Jan 2014 00:01:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla conducts lg', link='https://example.com/galleries/demo/tesla_conducts_lg.webp', publish_date='Wed, 01 Jan 2014 00:02:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning1 lg', link='https://example.com/galleries/demo/tesla_lightning1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:03:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning2 lg', link='https://example.com/galleries/demo/tesla_lightning2_lg.jpg', publish_date='Wed, 01 Jan 2014 00:04:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla tower1 lg', link='https://example.com/galleries/demo/tesla_tower1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:05:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={})]}

The returned first item looks like:

FeedItem(
    title='Tesla conducts lg',
    link='https://example.com/galleries/demo/tesla_conducts_lg.webp',
    publish_date='Wed, 01 Jan 2014 00:02:00 GMT',
    category='',
    description='',
    description_links=[],
    description_images=[],
    enclosure=None,
    itunes=None,
    other={}
)

The enclosure isn't populated, even though the enclosure looks valid to me in the first item RSS input.

Digging into the code, I suspect maybe an exception is getting swallowed in parse(), from the code that adds enclosures and itunes to the returned fields:

            try:
                # Add user-defined entries
                item_dict.update({"other": {}})
                for entrie in entries:
                    value = self.get_text(item, entrie)
                    value = re.sub(f"</?{entrie}>", "", value)
                    item_dict["other"].update({entrie: value})

                item_dict.update(
                    {
                        "enclosure": {
                            "content": "",
                            "attrs": {
                                "url": item.enclosure["url"],
                                "length": item.enclosure["length"],
                                "type": item.enclosure["type"],
                            },
                        },
                        "itunes": {
                            "content": "",
                            "attrs": {
                                "href": self.check_none(
                                    item.find("itunes:image"), # -> None
                                    main_soup.find("itunes:image"), # 
                                    "href",
                                    "href",
                                )
                            },
                        },
                    }
                )
            except (TypeError, KeyError, AttributeError):  # <--- exception swallowed here, maybe?
                pass

Sure enough, commenting out that try...except..., to make any raised exception visible, gives:

/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:127: in parse
    for entrie in entries:
/usr/lib/python3.10/typing.py:312: in inner
    return func(*args, **kwds)
/usr/lib/python3.10/typing.py:1143: in __getitem__
    params = tuple(_type_check(p, msg) for p in params)
/usr/lib/python3.10/typing.py:1143: in <genexpr>
    params = tuple(_type_check(p, msg) for p in params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arg = 0, msg = 'Parameters to generic types must be types.', is_argument = True, module = None

    def _type_check(arg, msg, is_argument=True, module=None, *, allow_special_forms=False):
        ...
        if not callable(arg):
>           raise TypeError(f"{msg} Got {arg!r:.100}.")
E           TypeError: Parameters to generic types must be types. Got 0.

This seems to be a result of me not passing a value for the 'entries' parameter to parse:

    def parse(self, entries: Optional[List[str]] = List) -> RSSFeed:

So 'entries' is 'List', which raises the above error if iterated over:

>>> from typing import List
>>> list(List)
Traceback (most recent call last):
...
    raise TypeError(f"{msg} Got {arg!r:.100}.")
TypeError: Parameters to generic types must be types. Got 0.
>>> 

I don't want any custom attributes from the RSS, so I try again with an explicit empty list of custom attributes:

    parsed = rss_parser.Parser(xml=content).parse([])

Now we get a different exception (yay, progress! :-)

    def test_gallery_rss(build, output_dir):
        ...
>       parsed = rss_parser.Parser(xml=content).parse([])

tests/integration/test_demo_build.py:48: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:145: in parse
    "href": self.check_none(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

item = None, default = None, item_dict = 'href', default_dict = 'href'

    @staticmethod
    def check_none(
        item: object,
        default: str,
        item_dict: Optional[str] = None,
        default_dict: Optional[str] = None,
    ) -> Any:
        ...
            if default_dict:
>               return default[default_dict]
E               TypeError: 'NoneType' object is not subscriptable

/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:57: TypeError

So, parse is calling check_none with:

  • default = None
  • default_dict = "href"

and we try to return default[default_dict] (ie indexing None with a string), which raises.

Even if parse() conformed to the typing of check_none() parameters, and passed default as a string, we'd still be trying to index a string with a string, which would also raise.

I don't understand what this is meant to be doing. Why does check_none have parameters which are called "xxx_dict", but are typed as strings? Do you have any hints about what I'm doing wrong? Should I keep trying with this? Thanks for any advice.

Support for pydantic V2

There should be support for pydantic V2 which is now out
Also, I think there should be a gracefull fallback for lower pydantic versions than this library currently supports, because the only part of the code that's failing is the magical "Tag" class which is constructed dynamically. Ideally not to use this or support a fallback instead of raising exceptions

Security Vulnerability found by Dependabot

So, I was including your project in one of mine ( as you know already #4 ), and the format that I keep all the dependencies managed ( pipenv ) is a format that @github's @dependabot recognizes and scans for vulnerabilities.

So, whenever I pushed my code to github, I got a notification that they had found a vulnerability in the code for this line. this was the notification:

image

CVE Links:

So, I plan to create a PR, and upgrade it to the minimum version specified in the notice 4.6.3.

If you don't want to accept the PR, since you made the change for #3 ( 7311b2d ) it is possible for people to override the current version of lxml with adding lxml = ">=4.6.3" to their Pipfile ( as indicated in the picture above, but they will have to upgrade to the new version specified in #3 (comment).

Pager

Would be nice to have a -p/--pager argument to output results with pager

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.