I'm Alexey, a software developer experienced in full stack web development.
Linux enjoyer, Python/JavaScript developer.
Machine | Shell | Editor | Terminal | Config |
---|---|---|---|---|
/ / | / | / |
typed python RSS parsing module built using xmltodict and pydantic
Home Page: https://dhvcc.github.io/rss-parser/
License: GNU General Public License v3.0
Hi.
Nice parser, glad I found that.
I have an idea to improve the code. You see, when I'm parsing my rss feed, I noticed that enclosure attribute is None, but it actually presented in a tree.
Looked at the code, I saw that user defined attribute placed under exception catch block here.
rss-parser/rss_parser/_parser.py
Line 124 in a13e8fa
I think it's not a good idea to join different tags under one block, because it doesn't work unless all tags are presented. In my case there are no itunes:image
tag but enclosure
is.
As minimum, I suggest moving a processing of enclosure tag in separated if
statement before exception like this
if item.hasattr(item, "enclosure"):
item_dict.update(
{
"enclosure": {
"content": "",
"attrs": {
"url": item.enclosure["url"],
"length": item.enclosure["length"],
"type": item.enclosure["type"],
},
}
}
)
It is a general solution because I don't know what the item
actually is. As I got from the specification, enclosure
tag is optional, but if it's presented, it must contain those three fields. It seems that the field presence in itunes
tag are completely user-defined, though.
Installing rss-parser on Windows results in uninstallation of lxml4.6.3 and reinstall of 4.5.2 (which leads to codepage errors on my python 3.9, rollback to 4.6.3 and aborted rss-parser install).
Changing requirements in wheel to 4.6.3 fixes problem localy.
NotImplementedError: ATOM feed is not currently supported
Is this currently being worked on?
I understand #1 was opened a while back, but is closed at the moment.
I can create a PR for this, but I am curious where this stands at the moment or if this repo is even active?
Hi
I have spotted that guid attribute is not parsed by the parser. I need it beacuse most of the time includes a unique identifier. For example:
<guid isPermaLink="false">/node/18150</guid>
It would be good if that feature added to the library. Then I will depend on this library for my projects :)
Hia,
I have been playing around with your rss_parser and I noticed that when a RSS does not have a description element in an item it crashes.
Like with this feed: https://www.dutchcowboys.nl/feed/rss.
Is this intentional?
Kind regards,
Erik
Hello. I'm currently make my own RSS parsers project with MULTIPLE source(url). It your library support for parse multiple source?
Tried to parse xml from: https://stackoverflow.com/feeds/tag/jupyter
and module failed on trying to get version from RSS.
Its been a while since rss-parser has been considered "feature-complete" and the checks for edge cases in non-conventional rss has been made, is it possible to have a release in the near future?
Crash when trying parsing resources like investing.com
Traceback (most recent call last):
File "/rss_news.py", line 121, in __get_investing_news
entities = parser.parse()
File "venv/lib/python3.10/site-packages/rss_parser/_parser.py", line 89, in parse
"version": main_soup.rss.get("version"),
AttributeError: 'NoneType' object has no attribute 'get'
"https://ru.investing.com/rss/news_1.rss",
"https://ru.investing.com/rss/news_477.rss",
"https://ru.investing.com/rss/news_11.rss",
"https://ru.investing.com/rss/news_95.rss",
"https://ru.investing.com/rss/news_14.rss",
"https://ru.investing.com/rss/news_285.rss",
"https://ru.investing.com/rss/news_25.rss"
python3.10/site-packages/rss_parser/_parser.py", line 88, in parse
"title": main_soup.title.text,
AttributeError: 'NoneType' object has no attribute 'text'
If there is a link with a keyword in the text, then the link disappears, only the keyword remains, can I do something about it?
After 699c3aa the following error appears:
TypeError: <class 'rss_parser.models.types.only_list.OnlyList'> is not a generic class
The most basic example code I was running returned a - ImportError: cannot import name 'GenericModel' from 'pydantic.generics'
I simplified it down to from rss_parser import Parser
- what am I doing wrong here?
A boring install of rss-parser, went fine.
(venv) (base) ck@DESKTOP-G0NAUG1:~$ pip install rss-parser
Collecting rss-parser
Using cached rss_parser-1.0.0-py3-none-any.whl (24 kB)
Collecting xmltodict<0.14.0,>=0.13.0
Using cached xmltodict-0.13.0-py2.py3-none-any.whl (10.0 kB)
Collecting pydantic>=1.6.1
Using cached pydantic-2.0.2-py3-none-any.whl (359 kB)
Collecting pytest<8.0.0,>=7.1.2
Using cached pytest-7.4.0-py3-none-any.whl (323 kB)
Collecting annotated-types>=0.4.0
Using cached annotated_types-0.5.0-py3-none-any.whl (11 kB)
Collecting typing-extensions>=4.6.1
Using cached typing_extensions-4.7.1-py3-none-any.whl (33 kB)
Collecting pydantic-core==2.1.2
Using cached pydantic_core-2.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
Collecting iniconfig
Using cached iniconfig-2.0.0-py3-none-any.whl (5.9 kB)
Collecting tomli>=1.0.0
Using cached tomli-2.0.1-py3-none-any.whl (12 kB)
Collecting exceptiongroup>=1.0.0rc8
Using cached exceptiongroup-1.1.2-py3-none-any.whl (14 kB)
Collecting pluggy<2.0,>=0.12
Using cached pluggy-1.2.0-py3-none-any.whl (17 kB)
Collecting packaging
Using cached packaging-23.1-py3-none-any.whl (48 kB)
Installing collected packages: xmltodict, typing-extensions, tomli, pluggy, packaging, iniconfig, exceptiongroup, pytest, pydantic-core, annotated-types, pydantic, rss-parser
Successfully installed annotated-types-0.5.0 exceptiongroup-1.1.2 iniconfig-2.0.0 packaging-23.1 pluggy-1.2.0 pydantic-2.0.2 pydantic-core-2.1.2 pytest-7.4.0 rss-parser-1.0.0 tomli-2.0.1 typing-extensions-4.7.1 xmltodict-0.13.0
WARNING: You are using pip version 22.0.4; however, version 23.1.2 is available.
You should consider upgrading via the '/home/ck/venv/bin/python -m pip install --upgrade pip' command.
Running the most basic example
(venv) (base) ck@DESKTOP-G0NAUG1:~$ python
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rss_parser import Parser
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/__init__.py", line 1, in <module>
from ._parser import Parser
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/_parser.py", line 6, in <module>
from rss_parser.models.rss import RSS
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/rss.py", line 6, in <module>
from rss_parser.models.channel import Channel
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/channel.py", line 6, in <module>
from rss_parser.models.image import Image
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/image.py", line 4, in <module>
from rss_parser.models.types.tag import Tag
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/__init__.py", line 2, in <module>
from rss_parser.models.types.tag import Tag
File "/home/ck/venv/lib/python3.8/site-packages/rss_parser/models/types/tag.py", line 9, in <module>
from pydantic.generics import GenericModel
ImportError: cannot import name 'GenericModel' from 'pydantic.generics' (/home/ck/venv/lib/python3.8/site-packages/pydantic/generics.py)
>>>
>>>
Hello. I have encountered a serious error in the library, which does not allow to parse some RSS sources correctly. For example, here the tags contain two entries and the parser throws an error when trying to read the elements.
channel -> content -> item -> 9 -> content -> link -> content str type expected (type=type_error.str)
channel -> content -> item -> 0 -> content -> enclosure -> content str type expected (type=type_error.str)
Can you fix it? Thank you!
Hi! I'm trying to parse the following RSS, and need to get out the enclosures from each item. (Would you expect this to work? Or is rss-parser still WIP?)
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/rss.xml</link><description></description><language>en</language><lastBuildDate>Wed, 29 Mar 2023 04:13:56 GMT</lastBuildDate><generator>https://getnikola.com/</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Tesla4 lg</title><link>https://example.com/galleries/demo/tesla4_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate></item><item><title>Tesla conducts lg</title><link>https://example.com/galleries/demo/tesla_conducts_lg.webp</link><enclosure url="https://example.com/galleries/demo/tesla_conducts_lg.webp" length="9620" type="image/webp"></enclosure><guid isPermaLink="false">galleries/demo/tesla_conducts_lg.webp</guid><pubDate>Wed, 01 Jan 2014 00:02:00 GMT</pubDate></item><item><title>Tesla lightning1 lg</title><link>https://example.com/galleries/demo/tesla_lightning1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning1_lg.jpg" length="41123" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:03:00 GMT</pubDate></item><item><title>Tesla lightning2 lg</title><link>https://example.com/galleries/demo/tesla_lightning2_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_lightning2_lg.jpg" length="36994" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_lightning2_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:04:00 GMT</pubDate></item><item><title>Tesla tower1 lg</title><link>https://example.com/galleries/demo/tesla_tower1_lg.jpg</link><enclosure url="https://example.com/galleries/demo/tesla_tower1_lg.jpg" length="18105" type="image/jpeg"></enclosure><guid isPermaLink="false">galleries/demo/tesla_tower1_lg.jpg</guid><pubDate>Wed, 01 Jan 2014 00:05:00 GMT</pubDate></item></channel></rss>
Each item in the RSS has an enclosure, for example the first item, prettified:
<item>
<title>Tesla4 lg</title>
<link>https://example.com/galleries/demo/tesla4_lg.jpg</link>
<enclosure url="https://example.com/galleries/demo/tesla4_lg.jpg" length="30200" type="image/jpeg"/>
<guid isPermaLink="false">galleries/demo/tesla4_lg.jpg</guid>
<pubDate>Wed, 01 Jan 2014 00:01:00 GMT</pubDate>
</item>
I call parsed = rss_parser.Parser(xml=content).parse()
.
The return value looks good, but the enclosures are not populated. Calling 'vars' on it, it looks like:
{'title': 'Tesla tower1 lg', 'version': '2.0', 'language': 'en', 'description': '', 'feed': [FeedItem(title='Tesla4 lg', link='https://example.com/galleries/demo/tesla4_lg.jpg', publish_date='Wed, 01 Jan 2014 00:01:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla conducts lg', link='https://example.com/galleries/demo/tesla_conducts_lg.webp', publish_date='Wed, 01 Jan 2014 00:02:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning1 lg', link='https://example.com/galleries/demo/tesla_lightning1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:03:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla lightning2 lg', link='https://example.com/galleries/demo/tesla_lightning2_lg.jpg', publish_date='Wed, 01 Jan 2014 00:04:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={}), FeedItem(title='Tesla tower1 lg', link='https://example.com/galleries/demo/tesla_tower1_lg.jpg', publish_date='Wed, 01 Jan 2014 00:05:00 GMT', category='', description='', description_links=[], description_images=[], enclosure=None, itunes=None, other={})]}
The returned first item looks like:
FeedItem(
title='Tesla conducts lg',
link='https://example.com/galleries/demo/tesla_conducts_lg.webp',
publish_date='Wed, 01 Jan 2014 00:02:00 GMT',
category='',
description='',
description_links=[],
description_images=[],
enclosure=None,
itunes=None,
other={}
)
The enclosure isn't populated, even though the enclosure looks valid to me in the first item RSS input.
Digging into the code, I suspect maybe an exception is getting swallowed in parse()
, from the code that adds enclosures and itunes to the returned fields:
try:
# Add user-defined entries
item_dict.update({"other": {}})
for entrie in entries:
value = self.get_text(item, entrie)
value = re.sub(f"</?{entrie}>", "", value)
item_dict["other"].update({entrie: value})
item_dict.update(
{
"enclosure": {
"content": "",
"attrs": {
"url": item.enclosure["url"],
"length": item.enclosure["length"],
"type": item.enclosure["type"],
},
},
"itunes": {
"content": "",
"attrs": {
"href": self.check_none(
item.find("itunes:image"), # -> None
main_soup.find("itunes:image"), #
"href",
"href",
)
},
},
}
)
except (TypeError, KeyError, AttributeError): # <--- exception swallowed here, maybe?
pass
Sure enough, commenting out that try...except..., to make any raised exception visible, gives:
/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:127: in parse
for entrie in entries:
/usr/lib/python3.10/typing.py:312: in inner
return func(*args, **kwds)
/usr/lib/python3.10/typing.py:1143: in __getitem__
params = tuple(_type_check(p, msg) for p in params)
/usr/lib/python3.10/typing.py:1143: in <genexpr>
params = tuple(_type_check(p, msg) for p in params)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = 0, msg = 'Parameters to generic types must be types.', is_argument = True, module = None
def _type_check(arg, msg, is_argument=True, module=None, *, allow_special_forms=False):
...
if not callable(arg):
> raise TypeError(f"{msg} Got {arg!r:.100}.")
E TypeError: Parameters to generic types must be types. Got 0.
This seems to be a result of me not passing a value for the 'entries' parameter to parse
:
def parse(self, entries: Optional[List[str]] = List) -> RSSFeed:
So 'entries' is 'List', which raises the above error if iterated over:
>>> from typing import List
>>> list(List)
Traceback (most recent call last):
...
raise TypeError(f"{msg} Got {arg!r:.100}.")
TypeError: Parameters to generic types must be types. Got 0.
>>>
I don't want any custom attributes from the RSS, so I try again with an explicit empty list of custom attributes:
parsed = rss_parser.Parser(xml=content).parse([])
Now we get a different exception (yay, progress! :-)
def test_gallery_rss(build, output_dir):
...
> parsed = rss_parser.Parser(xml=content).parse([])
tests/integration/test_demo_build.py:48:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:145: in parse
"href": self.check_none(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
item = None, default = None, item_dict = 'href', default_dict = 'href'
@staticmethod
def check_none(
item: object,
default: str,
item_dict: Optional[str] = None,
default_dict: Optional[str] = None,
) -> Any:
...
if default_dict:
> return default[default_dict]
E TypeError: 'NoneType' object is not subscriptable
/home/jhartley/.virtualenvs/nikola/lib/python3.10/site-packages/rss_parser/_parser.py:57: TypeError
So, parse is calling check_none with:
and we try to return default[default_dict] (ie indexing None with a string), which raises.
Even if parse()
conformed to the typing of check_none()
parameters, and passed default
as a string, we'd still be trying to index a string with a string, which would also raise.
I don't understand what this is meant to be doing. Why does check_none have parameters which are called "xxx_dict", but are typed as strings? Do you have any hints about what I'm doing wrong? Should I keep trying with this? Thanks for any advice.
There should be support for pydantic V2 which is now out
Also, I think there should be a gracefull fallback for lower pydantic versions than this library currently supports, because the only part of the code that's failing is the magical "Tag" class which is constructed dynamically. Ideally not to use this or support a fallback instead of raising exceptions
So, I was including your project in one of mine ( as you know already #4 ), and the format that I keep all the dependencies managed ( pipenv ) is a format that @github's @dependabot recognizes and scans for vulnerabilities.
So, whenever I pushed my code to github, I got a notification that they had found a vulnerability in the code for this line. this was the notification:
CVE Links:
So, I plan to create a PR, and upgrade it to the minimum version specified in the notice 4.6.3
.
If you don't want to accept the PR, since you made the change for #3 ( 7311b2d ) it is possible for people to override the current version of lxml with adding lxml = ">=4.6.3"
to their Pipfile ( as indicated in the picture above, but they will have to upgrade to the new version specified in #3 (comment).
Would be nice to have a -p/--pager
argument to output results with pager
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.