GithubHelp home page GithubHelp logo

robertmyles / tidyrss Goto Github PK

View Code? Open in Web Editor NEW
78.0 4.0 19.0 12.65 MB

An R package for extracting 'tidy' data frames from RSS, Atom and JSON feeds

Home Page: https://robertmyles.github.io/tidyRSS/

License: Other

R 99.94% HTML 0.06%
r rss tidyverse atom-feed jsonfeed rss-parser rss-feed atom-feed-parser json-feed

tidyrss's Introduction

tidyRSS

CRAN_Status_Badge CRAN_Download_Badge CRAN_Download_Badge R-CMD-check Codecov test coverage

tidyRSS is a package for extracting data from RSS feeds, including Atom feeds and JSON feeds. For geo-type feeds, see the section on changes in version 2 below, or jump directly to tidygeoRSS, which is designed for that purpose.

It is easy to use as it only has one function, tidyfeed(), which takes five arguments:

  • the url of the feed;
  • a logical flag for whether you want the feed returned as a tibble or a list containing two tibbles;
  • a logical flag for whether you want HTML tags removed from columns in the dataframe;
  • a config list that is passed off to httr::GET();
  • and a parse_dates argument, a logical flag, which will attempt to parse dates if TRUE (see below).

If parse_dates is TRUE, tidyfeed() will attempt to parse dates using the anytime package. Note that this removes some lower-level control that you may wish to retain over how dates are parsed. See this issue for an example.

Installation

It can be installed directly from CRAN with:

install.packages("tidyRSS")

The development version can be installed from GitHub with the remotes package:

remotes::install_github("robertmyles/tidyrss")

Usage

Here is how you can get the contents of the R Journal:

library(tidyRSS)

tidyfeed("http://journal.r-project.org/rss.atom")

Changes in version 2.0.0

The biggest change in version 2 is that tidyRSS no longer attempts to parse geo-type feeds into sf tibbles. This functionality has been moved to tidygeoRSS.

Issues

XML feeds can be finicky things, if you find one that doesn’t work with tidyfeed(), feel free to create an issue with the url of the feed that you are trying. Pull Requests are welcome if you’d like to try and fix it yourself. For older RSS feeds, some fields will almost never be ‘clean’, that is, they will contain things like newlines (\n) or extra quote marks. Cleaning these in a generic way is more or less impossible so I suggest you use stringr, strex and/or tools from base R such as gsub to clean these. This will mainly affect the item_description column of a parsed RSS feed, and will not often affect Atom feeds (and should never be a problem with JSON).

Related

There are two other related packages that I’m aware of:

In comparison to feedeR, tidyRSS returns more information from the RSS feed (if it exists), and development on rss seems to have stopped some time ago.

Other

For the schemas used to develop the parsers in this package, see:

I’ve implemented most of the items in the schemas above. The following are not yet implemented:

Atom meta info:

  • contributor, generator, logo, subtitle

Rss meta info:

  • cloud
  • image
  • textInput
  • skipHours
  • skipDays

tidyrss's People

Contributors

arf9999 avatar chainsawriot avatar grimbough avatar isomorphisms avatar jonocarroll avatar jsta avatar robertmyles avatar seakintruth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tidyrss's Issues

Error in eval(lhs, parent, parent) : object 'rss' not found

I was receiving the message Error in eval(lhs, parent, parent) : object 'rss' not found when running the following pitcher_list <- tidyRSS::tidyfeed("https://www.pitcherlist.com/feed").

Seeing other opened issues, I removed the tidyRSS package and installed using devtools from Github instead of CRAN. The following was run devtools::install_github("RobertMyles/tidyRSS"). The same error remained after a re-install.

Tidyfeed() RSS Import Error

Hi - I'm trying to extract data from a RSS page using tidyfeed(), however, I'm running into what appears to be a HTTP status error message. I tried the examples in the R documentation, which worked as expected.

Below are the simple commands I'm trying to run and the error message I keep receiving:
image

image

I saw a video posted on YouTube prior to the 2.0.5 release that worked, so I'm hoping you might be able to help figure out why this keeps failing.

Thanks for the help!

item_description not clean (contains html tags)

Hi Robert, thanks again for your great package!

my issue: some data frames created by the "tidyfeed" function contain html tags in the "item_description" column. See for example:
taz_df <- tidyfeed("http://www.taz.de/!p4615;rss/")
bild_df <- tidyfeed("https://www.bild.de/rssfeeds/vw-politik/vw-politik-16728980,view=rss2.bild.xml")
taz_df$item_description
bild_df$item_description

I found a function on stackoverflow which can clean the html tags away:
cleanfun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))}
cleanfun(taz_df$item_description) # html tags are gone.
See source: https://stackoverflow.com/questions/17227294/removing-html-tags-from-a-string-in-r
(I'm not sure if this is the best solution)

is there maybe a possibility to implement this (or any other solution) into tidyRSS? I could do the cleaning separately, but maybe you have the time to implement this into tidyRSS directly?

thank you very much!
Moritz

Error: Tibble columns must have compatible sizes.

Hi Robert,

many thanks for this very helpful package! I read that you welcome users to submit feeds which didn't work with tidyRSS.

I am currently struggling to get the feed below working. (I might be mistaken, but maybe it's related to description xml:space="preserve"?).
Here's the link to the page offering the rss feed.

Many thanks!

library(tidyRSS)

rss_link <- "https://www.parlament.gv.at/PAKT/VHG/XXVII/ME/ME_00055/filter.psp?view=RSS&jsMode=&xdocumentUri=&filterJq=&view=&GP=XXVII&ITYP=ME&INR=55&SUCH=&listeId=142&FBEZ=FP_142"
tidyRSS::tidyfeed(rss_link)
#> GET request successful. Parsing...
#> Error: Tibble columns must have compatible sizes.
#> * Size 6623: Existing data.
#> * Size 19869: Column `item_description`.
#> i Only values of size one are recycled.

Created on 2020-09-23 by the reprex package (v0.3.0)

Error: object 'rss' not found

Hi. This line of code worked until I've reinstalled the package, where it fails as follows:

tidyRSS::tidyfeed("https://easystats.github.io/blog/categories/r/index.xml")
#> Error in eval(lhs, parent, parent): object 'rss' not found

Created on 2019-05-19 by the reprex package (v0.3.0)

Is that expected? Thanks for your package!

tidyfeed fails when description contains HTML comments

Reprex

tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")

Explanation

Some CMS include in the RSS item description field the HTML comment tag <!-- more --> to delineate content above/below the fold.

<description>On April 28, 2021, NASA&#039;s Parker Solar Probe reached the sun&#039;s extended solar atmosphere, known as the corona, and spent five hours there. The spacecraft is the first to enter the outer boundaries of our sun. <!-- more --></description>

This causes tidyRSS:::rss_parse to return 2 entries for description resulting in a mismatch in the number of rows the function attempts to create via tibble.

image

See

item_description = map(res_entry, "description", .default = def) %>%

Example feed URL: https://www.sciencedaily.com/rss/top/science.xml

> tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
GET request successful. Parsing...

Error: Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Run `rlang::last_error()` to see where the error occurred.

> rlang::last_error()
<error/tibble_error_incompatible_size>
Tibble columns must have compatible sizes.
* Size 60: Existing data.
* Size 120: Column `item_description`.
ℹ Only values of size one are recycled.
Backtrace:
 1. tidyRSS::tidyfeed("https://www.sciencedaily.com/rss/top/science.xml")
 2. tidyRSS:::rss_parse(response, list, clean_tags, parse_dates)
 3. tibble::tibble(...)
 4. tibble:::tibble_quos(xs, .rows, .name_repair)
 5. tibble:::vectbl_recycle_rows(res, first_size, j, given_col_names[[j]])

Observed Under

version  R version 4.1.2 (2021-11-01)
 os       macOS Monterey 12.0.1
 system   x86_64, darwin17.0
 ui       RStudio
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2021-12-16
 rstudio  2022.02.0-daily+324 Prairie Trillium (desktop)
 pandoc   NA

 tibble        3.1.6   2021-11-07 [2] CRAN (R 4.1.0)
 tidyRSS     * 2.0.4   2021-10-07 [2] CRAN (R 4.1.0)

Fix tidyselect `where()`warning

In addition: Warning message:
Predicate functions must be wrapped in `where()`.

  # Bad
  data %>% select(is.character)

  # Good
  data %>% select(where(is.character))

ℹ Please update your code.

get content

if possible, get back full text of blog posts etc. Currently there's a whole load of <img> and things like that to parse out.

support jsonfeed

https://jsonfeed.org/

IMO it likely makes more sense to support this in one "RSS" pkg than a separate one. I can help with it as well but didn't want to start a new one w/o positing this here first.

Searching For Multiple Terms

I am interested in using this to search Google News articles on specific topics. I had the following questions about this:

Suppose I wanted to search for "apple" and "covid" - I used to following code:

keyword <- "https://news.google.com/rss/search?q=apple&q=covid&hl=en-IN&gl=IN&ceid=IN:en"

# From the package vignette

google_news <- tidyfeed(
    keyword,
    clean_tags = TRUE,
    parse_dates = TRUE
)

  1. I am not exactly sure as to how we are supposed to search for multiple terms? Do you use "&", "|" or do you just write both terms together (e.g. "applecovid")?

  2. Is there is a way to restrict the dates between which the search will be performed?

3)I have feeling that "IN" stands for "India" - if I want to change this to "Canada", I think I need to replace "IN" with "CAN"? Is this correct?

Your Help is Greatly Appreciated,
Thanks,

Package throws error when loading an json RSS feed

Replication steps:

library(tidyRSS)
rss <- tidyfeed("https://itunes.apple.com/us/rss/customerreviews/id=1289313118/sortby=mostrecent/json")

Result:

Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html,  : 
  Start tag expected, '<' not found [4]

entry_link is same for all items

This might be an issue with the feed itself, but the entry_link is the same for every entry (despite title and content being accurate):

tidyfeed('http://feeds.feedburner.com/GDBcode') %>%
  count(entry_link)

1 http://developers.googleblog.com/feeds/6345491852200823648/comments/default    25

# The same is true for these other two feeds
tidyfeed('http://feeds.feedburner.com/blogspot/gJZg') %>% count(entry_link)
tidyfeed('http://feeds.feedburner.com/GoogleOpenSourceBlog') %>% count(entry_link)

yet I have other feedburner sites that don't have this problem:

tidyfeed('http://feeds.feedburner.com/ProfessorRobJHyndman') %>% count(item_link)
tidyfeed("http://feeds.feedburner.com/kdnuggets-data-mining-analytics") %>% count(item_link)

side question: why does the Google feed use "entry_" rather than "item_"?

Thanks for your help!

Feed handling, why RCurl dep and github install link

Congrats on the CRAN submission!

However…

Neither of these worked (the first 2 things I tried):

tidyfeed("https://rud.is/b/feed/")

tidyfeed("https://rweekly.org/atom.xml")

Those are totally valid feed that work in every RSS reader but not in your pkg.

xml2 already Suggests curl and httr and most folks who are likely to use this pkg have the tidyverse installed. RCurl adds a somewhat needless dep onto systems.

And:

devtools::install_github("https://github.com/RobertMyles/tidyrss")

should be

devtools::install_github("RobertMyles/tidyrss")

Error on loading library (Windows, RStudio)

Hi,
I wanted to try your package, but it doesn't even load. I am not sure, what kind of info you need. So I just start with the basics:

Error Message:

> library(tidyRSS)
Error: package or namespace load failed for ‘tidyRSS’:
 .onLoad failed in loadNamespace() for 'sf', details:
  call: get(genname, envir = envir)
  error: object 'group_map' not found
> 

My system details:

platform       x86_64-w64-mingw32          
arch           x86_64                      
os             mingw32                     
system         x86_64, mingw32             
status                                     
major          3                           
minor          5.2                         
year           2018                        
month          12                          
day            20                          
svn rev        75870                       
language       R                           
version.string R version 3.5.2 (2018-12-20)
nickname       Eggshell Igloo  

When I installed the package via install.packages("tidyRSS"), it worked normally:

install.packages("tidyRSS")
Installing package into ‘C:/Users/mrwun/Documents/R/win-library/3.5’
(as ‘lib’ is unspecified)
also installing the dependencies ‘e1071’, ‘praise’, ‘classInt’, ‘units’, ‘testthat’, ‘sf’

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/e1071_1.7-0.1.zip'
Content type 'application/zip' length 1015780 bytes (991 KB)
downloaded 991 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/praise_1.0.0.zip'
Content type 'application/zip' length 19474 bytes (19 KB)
downloaded 19 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/classInt_0.3-1.zip'
Content type 'application/zip' length 79676 bytes (77 KB)
downloaded 77 KB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/units_0.6-2.zip'
Content type 'application/zip' length 1707573 bytes (1.6 MB)
downloaded 1.6 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/testthat_2.0.1.zip'
Content type 'application/zip' length 1724084 bytes (1.6 MB)
downloaded 1.6 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/sf_0.7-3.zip'
Content type 'application/zip' length 39223400 bytes (37.4 MB)
downloaded 37.4 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.5/tidyRSS_1.2.7.zip'
Content type 'application/zip' length 115056 bytes (112 KB)
downloaded 112 KB

package ‘e1071’ successfully unpacked and MD5 sums checked
package ‘praise’ successfully unpacked and MD5 sums checked
package ‘classInt’ successfully unpacked and MD5 sums checked
package ‘units’ successfully unpacked and MD5 sums checked
package ‘testthat’ successfully unpacked and MD5 sums checked
package ‘sf’ successfully unpacked and MD5 sums checked
package ‘tidyRSS’ successfully unpacked and MD5 sums checked

List of blogs

Hi Rob, for a separate project I've been putting together a list of data science and R related RSS feeds. Just letting you know about it in case it is useful for testing of tidyRSS. Please feel free to use the list however you like.

https://github.com/alastairrushworth/rssfeeds

Cheers!

Alastair

tidyRSS fails to parse feeds: "xmlXPathEval: evaluation failed"

Hi @RobertMyles

Thanks for the amazing tidyRSS package, I find it very useful indeed! Thought I'd get in touch to file a quick issue as I've noticed that quite a number of feeds don't parse correctly.

For example:

# tested with v1.2.11
library(tidyRSS)
tidyfeed("http://abigailsee.com/feed.xml")

Returns the error:

Error in xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = 1) : 
  xmlXPathEval: evaluation failed

I think the feed is ok, and it seems like tidyfeed gathers the feed ok, but something goes awry with the parsing somewhere? I noticed this issue with several other feeds that I've copied below

feed_vec <- 
  c("http://abigailsee.com/feed.xml",
    "https://adamgoodkind.com/feed.xml",
    "http://adomingues.github.io/feed.xml",
    "http://aebou.rbind.io/index.xml",
    "http://agrarianresearch.org/blog/?feed=rss2",
    "http://akosm.netlify.com/index.xml",
    "http://alburez.me/feed.xml",
    "http://alexmorley.me/feed.xml",
    "https://alexwhan.com/index.xml",
    "http://allthingsr.blogspot.com/feeds/posts/default?alt=rss",
    "http://allthiswasfield.blogspot.com/feeds/posts/default?alt=rss",
    "http://almostrandom.netlify.com/index.xml",
    "http://altran-data-analytics.netlify.com/index.xml",
    "https://www.amitkohli.com/index.xml",
    "http://analisisydecision.es/feed/",
    "http://andysouth.github.io/feed.xml",
    "http://annakrystalli.me/index.xml",
    "http://annarborrusergroup.github.io/feed.xml",
    "http://anotherblogaboutr.blogspot.com/feeds/posts/default?alt=rss",
    "http://anpefi.eu/index.xml",
    "https://fishandwhistle.net/index.xml",
    "https://www.ardata.fr/index.xml",
    "http://arnab.org/blog/atom.xml",
    "http://arunatma.blogspot.com/feeds/posts/default?alt=rss",
    "http://asbcllc.com/feed.xml",
    "http://ashiklom.github.io/feed.xml",
    "http://aurielfournier.github.io/feed.xml",
    "http://austinwehrwein.com/index.xml")

I'm working on a side project at the moment that involves about 3K RSS feeds, which I'm happy to share once I've tidied up a bit, it might be helpful with identifying other edge cases - I know how finicky RSS feeds can be! I'm also happy to help with this issue if you can point me in the right direction!

Thanks,

Alastair

item_description does not load - newspapers rss feeds

Hi @RobertMyles
I'm trying to download rrs feed data from European newspapers. The tidyfeed function downloads most of the data (title, link etc.), but unfortunately it does not download the item_description (in the case of newspapers, this is a short description of the article's content)

I tried this for three different newspapers, but the item_description never shows up:
gua_world <- tidyfeed("https://www.theguardian.com/world/rss")
lmd <- tidyfeed("https://www.lemonde.fr/rss/une.xml")
so_meldung <- tidyfeed("http://www.spiegel.de/schlagzeilen/tops/index.rss")

I looked at your code, but my R skills are not good enough to find the problem. Do you maybe know a solution to this?

(I first thought it might be a problem related to bad rss-formatting on the newspaper's websites, but since it does not work with three of Europe's biggest newspapers, the problem might be elsewhere.)

Best,
Moritz

Problem reading feeds

Hi,

I have used this package before and it worked for some feeds. I say some because feeds are finicky. However, the package suddenly stopped working. For example, I cannot get access to feeds such as https://www.federalreserve.gov/feeds/press_orders.xml, https://public.govdelivery.com/topics/USFDIC_11/feed.rss, or https://www.sec.gov/rss/investor/alerts.

The code I use is very simple:
library(tidyRSS)
tidyfeed("https://www.federalreserve.gov/feeds/press_all.xml", result = "all")

And here is the error I receive:
Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) :
This page does not appear to be a suitable feed. Have you checked that you entered the url correctly?
If you are certain that this is a valid rss feed, please file an issue at: https://github.com/RobertMyles/tidyRSS/issues. Please note that the feed may also be undergoing maintenance.

The site is not down since I can access the feed from Python or using other R packages.

I would appreciate if you can look into this.

Thank you!

Rlang issue

I had to go back a version because I kept getting this error. Does not exist in 2.0.4


sa = tidyfeed('https://seekingalpha.com/author/schiffgold.xml')
GET request successful. Parsing...

Error in `.f()`:
! Predicate functions must return a single `TRUE` or `FALSE`, not a missing value
Run `rlang::last_error()` to see where the error occurred.

Error loading library

After installation I try to load the library but it fails:

Error: package or namespace load failed for ‘tidyRSS’:
.onLoad failed in loadNamespace() for 'sf', details:
call: get(genname, envir = envir)
error: object 'group_map' not found

What is going wrong?

Some RSS feeds return 403 error - here is a possible fix

I came across some RSS feeds that produce an error of this sort when using tidyRSS

Response [https://www.naumburger-tageblatt.de/feed/nachrichten/index.rss]
Date: 2019-08-29 12:36
Status: 403
Content-Type: text/html
Size: 94 B

403 Forbidden

Request forbidden by administrative rules.

Drawing on the feedeR package, which didn't produce this issue, I've implemented the following into the tidyfeed function. Maybe not the most elegant, but it seems to work.

doc <- try(httr::GET(feed, config), silent = TRUE)
if(grepl("403", doc$status_code))
{
XMLFILE = tempfile(fileext = "-index.xml")
options(HTTPUserAgent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")
download.file(url = clean.url(feed),XMLFILE,quiet = TRUE)
doc<-xml2::read_xml(XMLFILE)
}else if(grepl("json", doc$headers$content-type)){
result <- json_parse(feed)
return(result)
} else{
doc <- doc %>% xml2::read_xml()
}

Tibble parsing error (Error: Tibble columns must have consistent lengths, only values of length one are recycled)

Hi Rob!

The new version of tidyRSS is great :)

I noticed that some feeds I have that parsed with a previous tidyRSS version were failing. I've attached a single example here. It seems to occur somewhere in the parsing of the feed into a tibble.

This using the most up-to-date version:

devtools::install_github('RobertMyles/tidyRSS')
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

GET request successful. Parsing...

Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 25: Columns `entry_title`, `entry_url`, `entry_last_updated`, `entry_content`, `entry_published`
* Length 26: Column `entry_author`
Run `rlang::last_error()` to see where the error occurred.

Using a slightly older version (I think this commit was in January):

devtools::install_github('RobertMyles/tidyRSS', ref = '35bcbb7e15be1c0edc1ca07cc33de64923a55a32')
# RESTART R FIRST
library(tidyRSS)
tidyfeed("http://bigcomputing.blogspot.com/feeds/posts/default")

# A tibble: 25 x 8
   feed_title feed_link feed_author feed_last_updat… item_title item_date_updated  
   <chr>      <chr>     <chr>       <chr>            <chr>      <dttm>             
 1 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Setting u… 2016-02-24 00:21:50
 2 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Machine… 2016-02-23 21:35:30
 3 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Guess… 2016-02-16 12:58:40
 4 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Official … 2016-02-14 22:24:35
 5 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The Five … 2016-02-08 20:14:09
 6 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… A Controv… 2015-08-05 15:51:26
 7 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Performan… 2015-07-17 18:31:18
 8 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… An exampl… 2015-07-14 16:55:36
 9 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… Fastest w… 2015-07-13 17:03:02
10 Big Compu… http://b… nphardhttp… 2020-03-14T04:0… The R Con… 2015-07-02 14:15:57
# … with 15 more rows, and 2 more variables: item_link <chr>, item_content <chr>

Thanks,

Alastair

Not getting full RSS tree

HI there - trying to get this:

result <- tidyfeed(feed = "https://www.fandango.com/rss/moviesnearme_90025.rss")

But I only get the movie theatre locations, not the movies playing at that location. Any suggestions?

Error for feeds with no item

In this example, the RSS is valid but with no item in it. It raises an uninformative error.
For this case, I would expect an empty tibble.

require(tidyRSS)
#> Loading required package: tidyRSS
tidyfeed("https://www.kn-online.de/arc/outboundfeeds/rss/tags_slug/kiel-restaurants/")
#> GET request successful. Parsing...
#> Error in df[[get("listcol")]][[i]]: subscript out of bounds

Created on 2022-08-12 by the reprex package (v2.0.1)

Rethink testing strategy

The current method of testing for tidyRSS produces non-deterministic results because a different feed is used each time tests are run. This makes test failures hard to reproduce and spurious, when testing on CRAN or elsewhere.

test_that("tidyfeed returns a data_frame", {
data("feeds")
rss <- sample(feeds$feeds, 1)
expect_is(tidyfeed(rss), "tbl_df")
})

This is made doubly bad because currently two of the feeds are no longer valid, so tests will fail if these examples happen to occur.

tidyRSS::tidyfeed(tidyRSS::feeds$feeds[[12]])
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : xmlParseEntityRef: no name [68]
tidyRSS::tidyfeed(tidyRSS::feeds$feeds[[28]])
#> Error in doc_parse_raw(x, encoding = encoding, base_url = base_url, as_html = as_html, : Premature end of data in tag lastBuildDate line 8 [77]

Created on 2018-01-24 by the reprex package (v0.1.1.9000).

In general relying on external network resources for tests is problematic, consider instead including a few (truncated) feeds in your package instead that could be used for testing.

If you do want to continue using external resources please use testthat::skip_for_cran() to disable these tests on CRAN.

broken RSS: datatau.com

Thanks for the awesome package, Robert!! I'm loving it.

I tried this feed:

http://www.datatau.com/rss

Got this error:

> df = tidyfeed('https://www.datatau.com/rss/')
GET request successful. Parsing...

Error in tidyfeed("https://www.datatau.com/rss/") : 
  Error in feed parse; please check URL.

  If you're certain that this is a valid rss feed,
  please file an issue at https://github.com/RobertMyles/tidyRSS/issues.
  Please note that the feed may also be undergoing maintenance.

here's my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] lubridate_1.7.9.2 aRxiv_0.5.19      kableExtra_1.3.1  forcats_0.5.0    
 [5] stringr_1.4.0     purrr_0.3.4       readr_1.4.0       tidyr_1.1.2      
 [9] tibble_3.0.4      ggplot2_3.3.2     tidyverse_1.3.0   dplyr_1.0.2      
[13] DT_0.16           tidyRSS_2.0.3    

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.0   xfun_0.19          haven_2.3.1        colorspace_2.0-0  
 [5] vctrs_0.3.4        generics_0.1.0     viridisLite_0.3.0  htmltools_0.5.0   
 [9] yaml_2.2.1         rlang_0.4.8        pillar_1.4.6       withr_2.3.0       
[13] glue_1.4.2         DBI_1.1.0          dbplyr_2.0.0       modelr_0.1.8      
[17] readxl_1.3.1       lifecycle_0.2.0    munsell_0.5.0      anytime_0.3.9     
[21] gtable_0.3.0       cellranger_1.1.0   rvest_0.3.6        htmlwidgets_1.5.2 
[25] evaluate_0.14      knitr_1.30         curl_4.3           fansi_0.4.1       
[29] broom_0.7.2        Rcpp_1.0.5         renv_0.12.2        backports_1.2.0   
[33] scales_1.1.1       install.load_1.2.3 checkmate_2.0.0    webshot_0.5.2     
[37] jsonlite_1.7.1     fs_1.5.0           fastmatch_1.1-0    hms_0.5.3         
[41] digest_0.6.27      stringi_1.5.3      grid_4.0.3         cli_2.1.0         
[45] tools_4.0.3        magrittr_1.5       crayon_1.3.4       pkgconfig_2.0.3   
[49] ellipsis_0.3.1     xml2_1.3.2         reprex_0.3.0       assertthat_0.2.1  
[53] rmarkdown_2.5      httr_1.4.2         rstudioapi_0.13    R6_2.5.0          
[57] compiler_4.0.3    

installation on Linux

I like your package very much - thanks for sharing it!

I tried to use it on Linux machine as well.
This means, I tried to install it on a Linux machine: "install.packages("tidyRSS")"

I had to add several Linux libraries for this:
sudo apt-get install libudunits2-dev
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install libxml2-dev
sudo apt-get install gdal-bin
sudo apt-get install libgdal-dev

However, even with all this stuff, it did not work:
configure: GDAL: 1.10.1
checking GDAL version >= 2.0.0... no
configure: error: sf is not compatible with GDAL versions below 2.0.0

Actually, I do not understand why a RRS feed package depends on an "software library for reading and writing raster and vector geospatial data formats".

At the end - I gave up, at least for the Linux side.
No way to make it less complex?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.