robertmyles / tidyrss Goto Github PK

View Code? Open in Web Editor NEW

78.0 4.0 19.0 12.65 MB

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds

Home Page: https://robertmyles.github.io/tidyRSS/

License: Other

R 99.94% HTML 0.06%

r rss tidyverse atom-feed jsonfeed rss-parser rss-feed atom-feed-parser json-feed

tidyrss's Introduction

tidyRSS

tidyRSS is a package for extracting data from RSS feeds, including Atom feeds and JSON feeds. For geo-type feeds, see the section on changes in version 2 below, or jump directly to tidygeoRSS, which is designed for that purpose.

It is easy to use as it only has one function, tidyfeed(), which takes five arguments:

the url of the feed;
a logical flag for whether you want the feed returned as a tibble or a list containing two tibbles;
a logical flag for whether you want HTML tags removed from columns in the dataframe;
a config list that is passed off to httr::GET();
and a parse_dates argument, a logical flag, which will attempt to parse dates if TRUE (see below).

If parse_dates is TRUE, tidyfeed() will attempt to parse dates using the anytime package. Note that this removes some lower-level control that you may wish to retain over how dates are parsed. See this issue for an example.

Installation

It can be installed directly from CRAN with:

install.packages("tidyRSS")

The development version can be installed from GitHub with the remotes package:

remotes::install_github("robertmyles/tidyrss")

Usage

Here is how you can get the contents of the R Journal:

library(tidyRSS)

tidyfeed("http://journal.r-project.org/rss.atom")

Changes in version 2.0.0

The biggest change in version 2 is that tidyRSS no longer attempts to parse geo-type feeds into sf tibbles. This functionality has been moved to tidygeoRSS.

Issues

XML feeds can be finicky things, if you find one that doesn’t work with tidyfeed(), feel free to create an issue with the url of the feed that you are trying. Pull Requests are welcome if you’d like to try and fix it yourself. For older RSS feeds, some fields will almost never be ‘clean’, that is, they will contain things like newlines (\n) or extra quote marks. Cleaning these in a generic way is more or less impossible so I suggest you use stringr, strex and/or tools from base R such as gsub to clean these. This will mainly affect the item_description column of a parsed RSS feed, and will not often affect Atom feeds (and should never be a problem with JSON).

There are two other related packages that I’m aware of:

rss
feedeR

In comparison to feedeR, tidyRSS returns more information from the RSS feed (if it exists), and development on rss seems to have stopped some time ago.

Other

For the schemas used to develop the parsers in this package, see:

RSS: https://validator.w3.org/feed/docs/rss2.html
Atom: https://validator.w3.org/feed/docs/atom.html
JSON: https://jsonfeed.org/version/1

I’ve implemented most of the items in the schemas above. The following are not yet implemented:

Atom meta info:

contributor, generator, logo, subtitle

Rss meta info:

cloud
image
textInput
skipHours
skipDays

tidyrss's People

Contributors

Stargazers

Watchers

Forkers

julianflowers jonocarroll luke-a isomorphisms giocomai drewarnold1 tuskan tombroekel romainfrancois grimbough jlisic werkstattcodes seakintruth lgnbhl chainsawriot jmcrav arf9999 hrbrmstr patrickcoyle

tidyrss's Issues

RSS entry url & feed url wrong

For e.g. "http://abandonedfootnotes.blogspot.com/feeds/posts/default?alt=rss"

Entry url & feed url come back as tags: "tag:blogger.com,1999:blog-35658622.post-1773446815936186178"

Error in eval(lhs, parent, parent) : object 'rss' not found

I was receiving the message Error in eval(lhs, parent, parent) : object 'rss' not found when running the following pitcher_list <- tidyRSS::tidyfeed("https://www.pitcherlist.com/feed").

Seeing other opened issues, I removed the tidyRSS package and installed using devtools from Github instead of CRAN. The following was run devtools::install_github("RobertMyles/tidyRSS"). The same error remained after a re-install.

"evaluation failed" for RSS from Austria

library(tidyRSS)
rss <- tidyfeed("http://derStandard.at/?page=rss&ressort=Newsroom")

results in many error messages: 'xmlXPathEval: evaluation failed'

Tidyfeed() RSS Import Error

Hi - I'm trying to extract data from a RSS page using tidyfeed(), however, I'm running into what appears to be a HTTP status error message. I tried the examples in the R documentation, which worked as expected.

Below are the simple commands I'm trying to run and the error message I keep receiving:

I saw a video posted on YouTube prior to the 2.0.5 release that worked, so I'm hoping you might be able to help figure out why this keeps failing.

Thanks for the help!

pass clean tags to clean_up

I forgot to pass 'clean tags' down to clean_up.

Also, check if atom feeds need this.

item_description not clean (contains html tags)

Hi Robert, thanks again for your great package!

my issue: some data frames created by the "tidyfeed" function contain html tags in the "item_description" column. See for example:
taz_df <- tidyfeed("http://www.taz.de/!p4615;rss/")
bild_df <- tidyfeed("https://www.bild.de/rssfeeds/vw-politik/vw-politik-16728980,view=rss2.bild.xml")
taz_df$item_description
bild_df$item_description

I found a function on stackoverflow which can clean the html tags away:
cleanfun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))}
cleanfun(taz_df$item_description) # html tags are gone.
See source: https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r
(I'm not sure if this is the best solution)

is there maybe a possibility to implement this (or any other solution) into tidyRSS? I could do the cleaning separately, but maybe you have the time to implement this into tidyRSS directly?

thank you very much!
Moritz

Error: Tibble columns must have compatible sizes.

Hi Robert,

many thanks for this very helpful package! I read that you welcome users to submit feeds which didn't work with tidyRSS.

I am currently struggling to get the feed below working. (I might be mistaken, but maybe it's related to description xml:space="preserve"?).
Here's the link to the page offering the rss feed.

Many thanks!

library(tidyRSS)

rss_link <- "https://www.parlament.gv.at/PAKT/VHG/XXVII/ME/ME_00055/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&GP=XXVII&ITYP=ME&INR=55&SUCH=&listeId=142&FBEZ=FP_142"
tidyRSS::tidyfeed(rss_link)
#> GET request successful. Parsing...
#> Error: Tibble columns must have compatible sizes.
#> * Size 6623: Existing data.
#> * Size 19869: Column `item_description`.
#> i Only values of size one are recycled.

^{Created on 2020-09-23 by the reprex package (v0.3.0)}

Error: object 'rss' not found

Hi. This line of code worked until I've reinstalled the package, where it fails as follows:

tidyRSS::tidyfeed("https://easystats.github.io/blog/categories/r/index.xml")
#> Error in eval(lhs, parent, parent): object 'rss' not found

^{Created on 2019-05-19 by the reprex package (v0.3.0)}

Is that expected? Thanks for your package!

Join error for some feeds

Error: by required, because the data sources have no common variables

Seen with:

tidyfeed fails when description contains HTML comments

Reprex

tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")

Explanation

Some CMS include in the RSS item description field the HTML comment tag  to delineate content above/below the fold.

<description>On April 28, 2021, NASA&#039;s Parker Solar Probe reached the sun&#039;s extended solar atmosphere, known as the corona, and spent five hours there. The spacecraft is the first to enter the outer boundaries of our sun. <!-- more --></description>

This causes tidyRSS:::rss_parse to return 2 entries for description resulting in a mismatch in the number of rows the function attempts to create via tibble.

See

tidyRSS/R/rss_parse.R

Line 41 in d5b223b

item_description = map(res_entry, "description", .default = def) %>%

Example feed URL: https://www.sciencedaily.com/rss/top/science.xml

> tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
GET request successful. Parsing...

Error: Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/tibble_error_incompatible_size>
Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Backtrace:
 1. tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
 2. tidyRSS:::rss_parse(response, list, clean_tags, parse_dates)
 3. tibble::tibble(...)
 4. tibble:::tibble_quos(xs, .rows, .name_repair)
 5. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])

Observed Under

version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.0.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2021-12-16
 rstudio  2022.02.0-daily+324 Prairie Trillium (desktop)
 pandoc   NA

 tibble        3.1.6   2021-11-07 [2] CRAN (R 4.1.0)
 tidyRSS     * 2.0.4   2021-10-07 [2] CRAN (R 4.1.0)

Not parsing google alerts feed correctly

I've been using the package to parse google alerts feeds but it's not collecting the URLs correctly :S

Fix tidyselect `where()`warning

In addition: Warning message:
Predicate functions must be wrapped in `where()`.

  # Bad
  data %>% select(is.character)

  # Good
  data %>% select(where(is.character))

ℹ Please update your code.

missing case_when from imports.

Strip out common unwanted characters

I.e.
\n\n\n\n\n\n// Define the...

get content

if possible, get back full text of blog posts etc. Currently there's a whole load of <img> and things like that to parse out.

support jsonfeed

https://jsonfeed.org/

IMO it likely makes more sense to support this in one "RSS" pkg than a separate one. I can help with it as well but didn't want to start a new one w/o positing this here first.

Searching For Multiple Terms

I am interested in using this to search Google News articles on specific topics. I had the following questions about this:

Suppose I wanted to search for "apple" and "covid" - I used to following code:

keyword <- "https://news.google.com/rss/search?q=apple&q=covid&hl=en-IN&gl=IN&ceid=IN:en"

# From the package vignette

google_news <- tidyfeed(
    keyword,
    clean_tags = TRUE,
    parse_dates = TRUE
)

I am not exactly sure as to how we are supposed to search for multiple terms? Do you use "&", "|" or do you just write both terms together (e.g. "applecovid")?
Is there is a way to restrict the dates between which the search will be performed?

3)I have feeling that "IN" stands for "India" - if I want to change this to "Canada", I think I need to replace "IN" with "CAN"? Is this correct?

Your Help is Greatly Appreciated,
Thanks,

Package throws error when loading an json RSS feed

Replication steps:

library(tidyRSS)
rss <- tidyfeed("https://itunes.apple.com/us/rss/customerreviews/id=1289313118/sortby=mostrecent/json")

Result:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Start tag expected, '<' not found [4]

entry_link is same for all items

This might be an issue with the feed itself, but the entry_link is the same for every entry (despite title and content being accurate):

tidyfeed('http://feeds.feedburner.com/GDBcode') %>%
  count(entry_link)

1 http://developers.googleblog.com/feeds/6345491852200823648/comments/default    25

# The same is true for these other two feeds
tidyfeed('http://feeds.feedburner.com/blogspot/gJZg') %>% count(entry_link)
tidyfeed('http://feeds.feedburner.com/GoogleOpenSourceBlog') %>% count(entry_link)

yet I have other feedburner sites that don't have this problem:

tidyfeed('http://feeds.feedburner.com/ProfessorRobJHyndman') %>% count(item_link)
tidyfeed("http://feeds.feedburner.com/kdnuggets-data-mining-analytics") %>% count(item_link)

side question: why does the Google feed use "entry_" rather than "item_"?

Thanks for your help!

Feed handling, why RCurl dep and github install link

Congrats on the CRAN submission!

However…

Neither of these worked (the first 2 things I tried):

tidyfeed("https://rud.is/b/feed/")

tidyfeed("https://rweekly.org/atom.xml")

Those are totally valid feed that work in every RSS reader but not in your pkg.

xml2 already Suggests curl and httr and most folks who are likely to use this pkg have the tidyverse installed. RCurl adds a somewhat needless dep onto systems.

And:

devtools::install_github("https://github.com/RobertMyles/tidyrss")

should be

devtools::install_github("RobertMyles/tidyrss")

item_link problems

From #3

https://www.google.com/alerts/feeds/03790873685617914995/10391961171333916339

Retrieving duration or other custom-named tags

I'm using tidyRSS to parse podcast RSS and would like to get the duration of each episode together with the title, ID, etc.

Is there a way to do that?

Remove message?

Remove this: GET request successful. Parsing...

RSS feed error for Vox.com

Querying "https://www.vox.com/rss/index.xml" returns the following error:

Error in UseMethod("xml_find_first") : 
  no applicable method for 'xml_find_first' applied to an object of class "xml_missing"

Wonderful package, though. Works brilliantly otherwise.

Error on loading library (Windows, RStudio)

Hi,
I wanted to try your package, but it doesn't even load. I am not sure, what kind of info you need. So I just start with the basics:

Error Message:

> library(tidyRSS)
Error: package or namespace load failed for ‘tidyRSS’:
 .onLoad failed in loadNamespace() for 'sf', details:
  call: get(genname, envir = envir)
  error: object 'group_map' not found
>

My system details:

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.2                         
year           2018                        
month          12                          
day            20                          
svn rev        75870                       
language       R                           
version.string R version 3.5.2 (2018-12-20)
nickname       Eggshell Igloo

When I installed the package via install.packages("tidyRSS"), it worked normally:

install.packages("tidyRSS")
Installing package into ‘C:/Users/mrwun/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
also installing the dependencies ‘e1071’, ‘praise’, ‘classInt’, ‘units’, ‘testthat’, ‘sf’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/e1071_1.7-0.1.zip'
Content type 'application/zip' length 1015780 bytes (991 KB)
downloaded 991 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/praise_1.0.0.zip'
Content type 'application/zip' length 19474 bytes (19 KB)
downloaded 19 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/classInt_0.3-1.zip'
Content type 'application/zip' length 79676 bytes (77 KB)
downloaded 77 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/units_0.6-2.zip'
Content type 'application/zip' length 1707573 bytes (1.6 MB)
downloaded 1.6 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/testthat_2.0.1.zip'
Content type 'application/zip' length 1724084 bytes (1.6 MB)
downloaded 1.6 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/sf_0.7-3.zip'
Content type 'application/zip' length 39223400 bytes (37.4 MB)
downloaded 37.4 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/tidyRSS_1.2.7.zip'
Content type 'application/zip' length 115056 bytes (112 KB)
downloaded 112 KB

package ‘e1071’ successfully unpacked and MD5 sums checked
package ‘praise’ successfully unpacked and MD5 sums checked
package ‘classInt’ successfully unpacked and MD5 sums checked
package ‘units’ successfully unpacked and MD5 sums checked
package ‘testthat’ successfully unpacked and MD5 sums checked
package ‘sf’ successfully unpacked and MD5 sums checked
package ‘tidyRSS’ successfully unpacked and MD5 sums checked

List of blogs

Hi Rob, for a separate project I've been putting together a list of data science and R related RSS feeds. Just letting you know about it in case it is useful for testing of tidyRSS. Please feel free to use the list however you like.

https://github.com/alastairrushworth/rssfeeds

Cheers!

Alastair

A url doesn't work

Hi, thank you for this great R package 😄

I found a url that doesn't return anything. Could you support it?
https://www.wrike.com/rss/42acd7c449ad444c8ccea977f83413be

tidyRSS fails to parse feeds: "xmlXPathEval: evaluation failed"

Hi @RobertMyles

Thanks for the amazing tidyRSS package, I find it very useful indeed! Thought I'd get in touch to file a quick issue as I've noticed that quite a number of feeds don't parse correctly.

For example:

# tested with v1.2.11
library(tidyRSS)
tidyfeed("http://abigailsee.com/feed.xml")

Returns the error:

Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = 1) : 
  xmlXPathEval: evaluation failed

I think the feed is ok, and it seems like tidyfeed gathers the feed ok, but something goes awry with the parsing somewhere? I noticed this issue with several other feeds that I've copied below

feed_vec <- 
  c("http://abigailsee.com/feed.xml",
    "https://adamgoodkind.com/feed.xml",
    "http://adomingues.github.io/feed.xml",
    "http://aebou.rbind.io/index.xml",
    "http://agrarianresearch.org/blog/?feed=rss2",
    "http://akosm.netlify.com/index.xml",
    "http://alburez.me/feed.xml",
    "http://alexmorley.me/feed.xml",
    "https://alexwhan.com/index.xml",
    "http://allthingsr.blogspot.com/feeds/posts/default?alt=rss",
    "http://allthiswasfield.blogspot.com/feeds/posts/default?alt=rss",
    "http://almostrandom.netlify.com/index.xml",
    "http://altran-data-analytics.netlify.com/index.xml",
    "https://www.amitkohli.com/index.xml",
    "http://analisisydecision.es/feed/",
    "http://andysouth.github.io/feed.xml",
    "http://annakrystalli.me/index.xml",
    "http://annarborrusergroup.github.io/feed.xml",
    "http://anotherblogaboutr.blogspot.com/feeds/posts/default?alt=rss",
    "http://anpefi.eu/index.xml",
    "https://fishandwhistle.net/index.xml",
    "https://www.ardata.fr/index.xml",
    "http://arnab.org/blog/atom.xml",
    "http://arunatma.blogspot.com/feeds/posts/default?alt=rss",
    "http://asbcllc.com/feed.xml",
    "http://ashiklom.github.io/feed.xml",
    "http://aurielfournier.github.io/feed.xml",
    "http://austinwehrwein.com/index.xml")

I'm working on a side project at the moment that involves about 3K RSS feeds, which I'm happy to share once I've tidied up a bit, it might be helpful with identifying other edge cases - I know how finicky RSS feeds can be! I'm also happy to help with this issue if you can point me in the right direction!

Thanks,

Alastair

item_description does not load - newspapers rss feeds

Hi @RobertMyles
I'm trying to download rrs feed data from European newspapers. The tidyfeed function downloads most of the data (title, link etc.), but unfortunately it does not download the item_description (in the case of newspapers, this is a short description of the article's content)

I tried this for three different newspapers, but the item_description never shows up:
gua_world <- tidyfeed("https://www.theguardian.com/world/rss")
lmd <- tidyfeed("https://www.lemonde.fr/rss/une.xml")
so_meldung <- tidyfeed("http://www.spiegel.de/schlagzeilen/tops/index.rss")

I looked at your code, but my R skills are not good enough to find the problem. Do you maybe know a solution to this?

(I first thought it might be a problem related to bad rss-formatting on the newspaper's websites, but since it does not work with three of Europe's biggest newspapers, the problem might be elsewhere.)

Best,
Moritz

Problem reading feeds

Hi,

I have used this package before and it worked for some feeds. I say some because feeds are finicky. However, the package suddenly stopped working. For example, I cannot get access to feeds such as https://www.federalreserve.gov/feeds/press_orders.xml, https://public.govdelivery.com/topics/USFDIC_11/feed.rss, or https://www.sec.gov/rss/investor/alerts.

The code I use is very simple:
library(tidyRSS)
tidyfeed("https://www.federalreserve.gov/feeds/press_all.xml", result = "all")

And here is the error I receive:
Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) :
This page does not appear to be a suitable feed. Have you checked that you entered the url correctly?
If you are certain that this is a valid rss feed, please file an issue at: https://github.com/RobertMyles/tidyRSS/issues. Please note that the feed may also be undergoing maintenance.

The site is not down since I can access the feed from Python or using other R packages.

I would appreciate if you can look into this.

Thank you!

reading Date for French

Hello,

pubDate in French format are sometimes not read by the tidyRSS function.
unfortunatly, the return column is NA, so we lose this information.

Example: View(tidyfeed("https://www.valdemarne.fr/rss.xml"))

Thank you

Rlang issue

I had to go back a version because I kept getting this error. Does not exist in 2.0.4


sa = tidyfeed('https://seekingalpha.com/author/schiffgold.xml')
GET request successful. Parsing...

Error in `.f()`:
! Predicate functions must return a single `TRUE` or `FALSE`, not a missing value
Run `rlang::last_error()` to see where the error occurred.

Problems with geoRSS

I'm having trouble getting tidyfeed to parse georss tags. See below.

rss = tidyfeed("https://inciweb.nwcg.gov/feeds/rss/incidents/")
xmlXPathEval: evaluation failed
xmlXPathEval: evaluation failed
...
...

Error in eval(lhs, parent, parent) : object 'rss' not found

From the documentation example:

rss <- tidyfeed("http://fivethirtyeight.com/all/feed")

is giving

Error in eval(lhs, parent, parent) : object 'rss' not found

This is happening for all rss/xml feeds.

Error loading library

After installation I try to load the library but it fails:

Error: package or namespace load failed for ‘tidyRSS’:
.onLoad failed in loadNamespace() for 'sf', details:
call: get(genname, envir = envir)
error: object 'group_map' not found

What is going wrong?

Cannot generate feed

Receiving Error: $ operator is invalid for atomic vectors all all kinds of feeds. :(

rss <- tidyfeed("http://feeds.bbci.co.uk/news/rss.xml")
Error: $ operator is invalid for atomic vectors
rss <- tidyfeed("http://andrewgelman.com/feed/")
Error: $ operator is invalid for atomic vectors

xml error

I have tried like that:
tidyRSS::tidyfeed("https://academic.oup.com/rss/site_6148/3994.xml")

An error message appear:
Error in doc$headers : $ operator is invalid for atomic vectors

Thank you.

Some RSS feeds return 403 error - here is a possible fix

I came across some RSS feeds that produce an error of this sort when using tidyRSS

Response [https://www.naumburger-tageblatt.de/feed/nachrichten/index.rss]
Date: 2019-08-29 12:36
Status: 403
Content-Type: text/html
Size: 94 B

403 Forbidden

Request forbidden by administrative rules.

Drawing on the feedeR package, which didn't produce this issue, I've implemented the following into the tidyfeed function. Maybe not the most elegant, but it seems to work.

doc <- try(httr::GET(feed, config), silent = TRUE)
if(grepl("403", doc$status_code))
{
XMLFILE = tempfile(fileext = "-index.xml")
options(HTTPUserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")
download.file(url = clean.url(feed),XMLFILE,quiet = TRUE)
doc<-xml2::read_xml(XMLFILE)
}else if(grepl("json", doc$headers$content-type)){
result <- json_parse(feed)
return(result)
} else{
doc <- doc %>% xml2::read_xml()
}

Tibble parsing error (Error: Tibble columns must have consistent lengths, only values of length one are recycled)

Hi Rob!

The new version of tidyRSS is great :)

I noticed that some feeds I have that parsed with a previous tidyRSS version were failing. I've attached a single example here. It seems to occur somewhere in the parsing of the feed into a tibble.

This using the most up-to-date version:

devtools::install_github('RobertMyles/tidyRSS')
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

GET request successful. Parsing...

Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 25: Columns `entry_title`, `entry_url`, `entry_last_updated`, `entry_content`, `entry_published`
* Length 26: Column `entry_author`
Run `rlang::last_error()` to see where the error occurred.

Using a slightly older version (I think this commit was in January):

devtools::install_github('RobertMyles/tidyRSS', ref = '35bcbb7e15be1c0edc1ca07cc33de64923a55a32')
# RESTART R FIRST
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

# A tibble: 25 x 8
   feed_title feed_link feed_author feed_last_updat… item_title item_date_updated  
   <chr>      <chr>     <chr>       <chr>            <chr>      <dttm>             
 1 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Setting u… 2016-02-24 00:21:50
 2 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Machine… 2016-02-23 21:35:30
 3 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Guess… 2016-02-16 12:58:40
 4 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Official … 2016-02-14 22:24:35
 5 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Five … 2016-02-08 20:14:09
 6 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Controv… 2015-08-05 15:51:26
 7 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Performan… 2015-07-17 18:31:18
 8 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… An exampl… 2015-07-14 16:55:36
 9 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Fastest w… 2015-07-13 17:03:02
10 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The R Con… 2015-07-02 14:15:57
# … with 15 more rows, and 2 more variables: item_link <chr>, item_content <chr>

Thanks,

Alastair

entry_category & item_category errors

Error: Column item_category must be length 15 (the number of rows) or one, not 7
Error: Column entry_category must be length 25 (the number of rows) or one, not 19

Seen with:

georss

https://inciweb.nwcg.gov/feeds/rss/incidents/ from Shiny app not parsing correctly.

Not getting full RSS tree

HI there - trying to get this:

result <- tidyfeed(feed = "https://www.fandango.com/rss/moviesnearme_90025.rss")

But I only get the movie theatre locations, not the movies playing at that location. Any suggestions?

Date parsing alternatives

anytime comes with Rcpp and BH, would be nice to have something lighter.

Error for feeds with no item

In this example, the RSS is valid but with no item in it. It raises an uninformative error.
For this case, I would expect an empty tibble.

require(tidyRSS)
#> Loading required package: tidyRSS
tidyfeed("https://www.kn-online.de/arc/outboundfeeds/rss/tags_slug/kiel-restaurants/")
#> GET request successful. Parsing...
#> Error in df[[get("listcol")]][[i]]: subscript out of bounds

^{Created on 2022-08-12 by the reprex package (v2.0.1)}

Rethink testing strategy

The current method of testing for tidyRSS produces non-deterministic results because a different feed is used each time tests are run. This makes test failures hard to reproduce and spurious, when testing on CRAN or elsewhere.

tidyRSS/tests/testthat/test.feeds.R

Lines 8 to 12 in 55e5255

 test_that("tidyfeed returns a data_frame", { 

 data("feeds") 

 rss <- sample(feeds$feeds, 1) 

 expect_is(tidyfeed(rss), "tbl_df") 

 })

This is made doubly bad because currently two of the feeds are no longer valid, so tests will fail if these examples happen to occur.

tidyRSS::tidyfeed(tidyRSS::feeds$feeds[[12]])
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseEntityRef: no name [68]
tidyRSS::tidyfeed(tidyRSS::feeds$feeds[[28]])
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : Premature end of data in tag lastBuildDate line 8 [77]

Created on 2018-01-24 by the reprex package (v0.1.1.9000).

In general relying on external network resources for tests is problematic, consider instead including a few (truncated) feeds in your package instead that could be used for testing.

If you do want to continue using external resources please use testthat::skip_for_cran() to disable these tests on CRAN.

List of rss/Atom feeds that are not parsing correctly

https://hayobethlehem.nl/feed/atom/

returns error (issue on xml2 repo)

http://w3future.com/weblog/rss.xml?notransform

returns error
malformed

http://www.intertwingly.net/blog/index.atom

mix up with link
works

http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml

works

broken RSS: datatau.com

Thanks for the awesome package, Robert!! I'm loving it.

I tried this feed:

http://www.datatau.com/rss

Got this error:

> df = tidyfeed('https://www.datatau.com/rss/')
GET request successful. Parsing...

Error in tidyfeed("https://www.datatau.com/rss/") : 
  Error in feed parse; please check URL.

  If you're certain that this is a valid rss feed,
  please file an issue at https://github.com/RobertMyles/tidyRSS/issues.
  Please note that the feed may also be undergoing maintenance.

here's my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.7.9.2 aRxiv_0.5.19      kableExtra_1.3.1  forcats_0.5.0    
 [5] stringr_1.4.0     purrr_0.3.4       readr_1.4.0       tidyr_1.1.2      
 [9] tibble_3.0.4      ggplot2_3.3.2     tidyverse_1.3.0   dplyr_1.0.2      
[13] DT_0.16           tidyRSS_2.0.3    

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.19          haven_2.3.1        colorspace_2.0-0  
 [5] vctrs_0.3.4        generics_0.1.0     viridisLite_0.3.0  htmltools_0.5.0   
 [9] yaml_2.2.1         rlang_0.4.8        pillar_1.4.6       withr_2.3.0       
[13] glue_1.4.2         DBI_1.1.0          dbplyr_2.0.0       modelr_0.1.8      
[17] readxl_1.3.1       lifecycle_0.2.0    munsell_0.5.0      anytime_0.3.9     
[21] gtable_0.3.0       cellranger_1.1.0   rvest_0.3.6        htmlwidgets_1.5.2 
[25] evaluate_0.14      knitr_1.30         curl_4.3           fansi_0.4.1       
[29] broom_0.7.2        Rcpp_1.0.5         renv_0.12.2        backports_1.2.0   
[33] scales_1.1.1       install.load_1.2.3 checkmate_2.0.0    webshot_0.5.2     
[37] jsonlite_1.7.1     fs_1.5.0           fastmatch_1.1-0    hms_0.5.3         
[41] digest_0.6.27      stringi_1.5.3      grid_4.0.3         cli_2.1.0         
[45] tools_4.0.3        magrittr_1.5       crayon_1.3.4       pkgconfig_2.0.3   
[49] ellipsis_0.3.1     xml2_1.3.2         reprex_0.3.0       assertthat_0.2.1  
[53] rmarkdown_2.5      httr_1.4.2         rstudioapi_0.13    R6_2.5.0          
[57] compiler_4.0.3

installation on Linux

I like your package very much - thanks for sharing it!

I tried to use it on Linux machine as well.
This means, I tried to install it on a Linux machine: "install.packages("tidyRSS")"

I had to add several Linux libraries for this:
sudo apt-get install libudunits2-dev
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libxml2-dev
sudo apt-get install gdal-bin
sudo apt-get install libgdal-dev

However, even with all this stuff, it did not work:
configure: GDAL: 1.10.1
checking GDAL version >= 2.0.0... no
configure: error: sf is not compatible with GDAL versions below 2.0.0

Actually, I do not understand why a RRS feed package depends on an "software library for reading and writing raster and vector geospatial data formats".

At the end - I gave up, at least for the Linux side.
No way to make it less complex?

tidyfeed throws error when reading feed

When I try to reed the RSS feed below with tidyfeed it throws an error

`rss <- tidyfeed("https://emm.newsbrief.eu/rss/rss?type=rtn&language=en&duplicates=false")

Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = 1) :
xmlXPathEval: evaluation failed
`

httr::GET
and xml2::read_xml both work fine

where is the function rss_parse defined?

RSS Author fields concatenating

For e.g. "http://abandonedfootnotes.blogspot.com/feeds/posts/default?alt=rss"

"Xavier Marquezhttp://www.blogger.com/profile/[email protected]"

Should be a list of something like:
"Xavier Marquez"
"http://www.blogger.com/profile/10099356104979121153"
"[email protected]"

	test_that("tidyfeed returns a data_frame", {
	data("feeds")
	rss <- sample(feeds$feeds, 1)
	expect_is(tidyfeed(rss), "tbl_df")
	})

robertmyles / tidyrss Goto Github PK

tidyrss's Introduction

tidyRSS

Installation

Usage

Changes in version 2.0.0

Issues

Related

Other

tidyrss's People

Contributors

Stargazers

Watchers

Forkers

tidyrss's Issues

Reprex

Explanation

Observed Under

Replication steps:

Result:

403 Forbidden

Recommend Projects

Recommend Topics

Recommend Org

Jobs