GithubHelp home page GithubHelp logo

html2rss / html2rss Goto Github PK

View Code? Open in Web Editor NEW
112.0 5.0 9.0 940 KB

📰 Build RSS 2.0 feeds from websites (and JSON APIs) with a few CSS selectors.

Home Page: https://html2rss.github.io/components/html2rss

License: MIT License

Ruby 99.83% Shell 0.17%
ruby html scraper rss feed rss-generator extract html2rss rss-feed-scraper feed-configs

html2rss's Introduction

html2rss logo

Gem Version Yard Docs Retro Badge: valid RSS

This Ruby gem builds RSS 2.0 feeds from a feed config.

With the feed config containing the URL to scrape and CSS selectors for information extraction (like title, URL, ...) your RSS builds. Extractors and chain-able post processors make information extraction, processing and sanitizing a breeze. Scraping JSON responses and setting HTTP request headers is supported, too.

Searching for a ready to use app which serves generated feeds via HTTP? Head over to html2rss-web!

To support the development, feel free to sponsor this project on Github. Thank you! 💓

Installation

Install gem install html2rss
Usage html2rss help

You can also install it as a dependency in your Ruby project:

🤩 Like it? Star it! ⭐️
Add this line to your Gemfile: gem 'html2rss'
Then execute: bundle
In your code: require 'html2rss'

Generating a feed on the CLI

Create a file called my_config_file.yml with this example content:

channel:
  url: https://stackoverflow.com/questions
selectors:
  items:
    selector: "#hot-network-questions > ul > li"
  title:
    selector: a
  link:
    selector: a
    extractor: href

Build the RSS with: html2rss feed ./my_config_file.yml.

Generating a feed with Ruby

Here's a minimal working example within Ruby:

require 'html2rss'

rss =
  Html2rss.feed(
    channel: { url: 'https://stackoverflow.com/questions' },
    selectors: {
      items: { selector: '#hot-network-questions > ul > li' },
      title: { selector: 'a' },
      link: { selector: 'a', extractor: 'href' }
    }
  )

puts rss

The feed config and its options

A feed config consists of a channel and a selectors Hash. The contents of both hashes are explained in the chapters below.

Good to know:

  • You'll find extensive example feed configs at spec/*.test.yml.
  • See html2rss-configs for ready-made feed configs!
  • If you've already created feed configs, you're invited to send a PR to html2rss-configs to make your config available to the general public.

Alright, let's move on.

The channel

attribute type default remark
url required String
title optional String auto-generated
description optional String auto-generated
ttl optional Integer 360 TTL in minutes
time_zone optional String 'UTC' TimeZone name
language optional String 'en' Language code
author optional String Format: email (Name)
headers optional Hash {} Set HTTP request headers. See notes below.
json optional Boolean false Handle JSON response. See notes below.

Dynamic parameters in channel attributes

Sometimes there are structurally equal pages with different URLs. In such a case you can add dynamic parameters to the channel's attributes.

Example of a dynamic id parameter in the channel URLs:

channel:
  url: "http://domainname.tld/whatever/%<id>s.html"

Command line usage example:

bundle exec html2rss feed the_feed_config.yml id=42
See a Ruby example
config = Html2rss::Config.new({ channel: { url: 'http://domainname.tld/whatever/%<id>s.html' } }, {}, { id: 42 })
Html2rss.feed(config)

See the more complex formatting of the sprintf method for formatting options.

The selectors

First, you must give an items selector hash which contains a CSS selector. The selector selects a collection of HTML tags from which the RSS feed items are build. Except the items selector, all other keys are scoped to each item of the collection.

Then, to build a valid RSS 2.0 item, you need to have at least a title or a description. You can have both.

Having an items and a title selector is already enough to build a simple feed.

Your selectors Hash can contain arbitrary named selectors, but only a few will make it into the RSS feed (This due to the RSS 2.0 specification):

RSS 2.0 tag name in html2rss remark
title title
description description Supports HTML.
link link A URL.
author author
category categories See notes below.
guid guid Default title/description. See notes below.
enclosure enclosure See notes below.
pubDate updated An instance of Time.
comments comments A URL.
source source Not yet supported.

The selector hash

Every named selector in your selectors hash can have these attributes:

name value
selector The CSS selector to select the tag with the information.
extractor Name of the extractor. See notes below.
post_process A hash or array of hashes. See notes below.

Using extractors

Extractors help with extracting the information from the selected HTML tag.

  • The default extractor is text, which returns the tag's inner text.
  • The html extractor returns the tag's outer HTML.
  • The href extractor returns a URL from the tag's href attribute and corrects relative ones to absolute ones.
  • The attribute extractor returns the value of that tag's attribute.
  • The static extractor returns the configured static value (it doesn't extract anything).
  • See file list of extractors.

Extractors might need extra attributes on the selector hash. 👉 Read their docs for usage examples.

See a Ruby example
Html2rss.feed(
  channel: {}, selectors: { link: { selector: 'a', extractor: 'href' } }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  link:
    selector: 'a'
    extractor: 'href'

Using post processors

Extracted information can be further manipulated with post processors.

name
gsub Allows global substitution operations on Strings (Regexp or simple pattern).
html_to_markdown HTML to Markdown, using reverse_markdown.
markdown_to_html converts Markdown to HTML, using kramdown.
parse_time Parses a String containing a time in a time zone.
parse_uri Parses a String as URL.
sanitize_html Strips unsafe and uneeded HTML and adds security related attributes.
substring Cuts a part off of a String, starting at a position.
template Based on a template, it creates a new String filled with other selectors values.

⚠️ Always make use of the sanitize_html post processor for HTML content. Never trust the internet! ⚠️

👉 Read their docs for usage examples.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    description: {
      selector: '.content', post_process: { name: 'sanitize_html' }
    }
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  description:
    selector: '.content'
    post_process:
      - name: sanitize_html

Chaining post processors

Pass an array to post_process to chain the post processors.

YAML example: build the description from a template String (in Markdown) and convert that Markdown to HTML
channel:
  # ... omitted
selectors:
  # ... omitted
  price:
    selector: '.price'
  description:
    selector: '.section'
    post_process:
      - name: template
        string: |
          # %{self}

          Price: %{price}
      - name: markdown_to_html

Note the use of | for a multi-line String in YAML.

Adding <category> tags to an item

The categories selector takes an array of selector names. Each value of those selectors will become a <category> on the RSS item.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    genre: {
      # ... omitted
      selector: '.genre'
    },
    branch: { selector: '.branch' },
    categories: %i[genre branch]
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  genre:
    selector: ".genre"
  branch:
    selector: ".branch"
  categories:
    - genre
    - branch

Custom item GUID

By default, html2rss generates a GUID from the title or description.

If this does not work well, you can choose other attributes from which the GUID is build. The principle is the same as for the categories: pass an array of selectors names.

In all cases, the GUID is a SHA1-encoded string.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    title: {
      # ... omitted
      selector: 'h1'
    },
    link: { selector: 'a', extractor: 'href' },
    guid: %i[link]
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  title:
    selector: "h1"
  link:
    selector: "a"
    extractor: "href"
  guid:
    - link

Adding an <enclosure> tag to an item

An enclosure can be any file, e.g. a image, audio or video.

The enclosure selector needs to return a URL of the content to enclose. If the extracted URL is relative, it will be converted to an absolute one using the channel's URL as base.

Since html2rss does no further inspection of the enclosure, its support comes with trade-offs:

  1. The content-type is guessed from the file extension of the URL.
  2. If the content-type guessing fails, it will default to application/octet-stream.
  3. The content-length will always be undetermined and therefore stated as 0 bytes.

Read the RSS 2.0 spec for further information on enclosing content.

See a Ruby example
Html2rss.feed(
  channel: {},
  selectors: {
    enclosure: { selector: 'img', extractor: 'attribute', attribute: 'src' }
  }
)
See a YAML feed config example
channel:
  # ... omitted
selectors:
  # ... omitted
  enclosure:
    selector: "img"
    extractor: "attribute"
    attribute: "src"

Scraping and handling JSON responses

Although this gem's name is html2rss, it's possible to scrape and process JSON.

Adding json: true to the channel config will convert the JSON response to XML.

See a Ruby example
Html2rss.feed(
  channel: {
    url: 'https://example.com', json: true
  },
  selectors: {} # ... omitted
)
See a YAML feed config example
channel:
  url: https://example.com
  json: true
selectors:
  # ... omitted
See example of a converted JSON object

This JSON object:

{
  "data": [{ "title": "Headline", "url": "https://example.com" }]
}

converts to:

<object>
  <data>
    <array>
      <object>
        <title>Headline</title>
        <url>https://example.com</url>
      </object>
    </array>
  </data>
</object>

Your items selector would be array > object, the item's link selector would be url.

See example of a converted JSON array

This JSON array:

[{ "title": "Headline", "url": "https://example.com" }]

converts to:

<array>
  <object>
    <title>Headline</title>
    <url>https://example.com</url>
  </object>
</array>

Your items selector would be array > object, the item's link selector would be url.

Set any HTTP header in the request

You can add any HTTP headers to the request to the channel URL. Use this to e.g. have Cookie or Authorization information sent or to spoof the User-Agent.

See a Ruby example
Html2rss.feed(
channel: {
  url: 'https://example.com',
  headers: {
    'User-Agent': 'html2rss-request',
    'X-Something': 'Foobar',
    Authorization: 'Token deadbea7',
    Cookie: 'monster=MeWantCookie'
  }
},
selectors: {}
)
See a YAML feed config example
channel:
  url: https://example.com
  headers:
    "User-Agent": "html2rss-request"
    "X-Something": "Foobar"
    "Authorization": "Token deadbea7"
    "Cookie": "monster=MeWantCookie"
selectors:
  # ...

The headers provided by the channel are merged into the global headers.

Reverse the order of items

By default, html2rss keeps the order of the collection returned from the items selector. The items selector hash can optionally contain an order attribute. If its value is reverse, the order of items in the RSS will reverse.

See a YAML feed config example
channel:
  # ... omitted
selectors:
  items:
    selector: 'ul > li'
    order: 'reverse'
  # ... omitted

Note that the order of items, according to the RSS 2.0 spec, should not matter to the feed-consuming client.

Usage with a YAML config file

This step is not required to work with this gem. If you're using html2rss-web and want to create your private feed configs, keep on reading!

First, create a YAML file, e.g. feeds.yml. This file will contain your global config and multiple feed configs under the key feeds.

Example:

headers:
  "User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
feeds:
  myfeed:
    channel:
    selectors:
  myotherfeed:
    channel:
    selectors:

Your feed configs go below feeds. Everything else is part of the global config.

Find a full example of a feeds.yml at spec/feeds.test.yml.

Now you can build your feeds like this:

Build feeds in Ruby
require 'html2rss'

myfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myfeed')
myotherfeed = Html2rss.feed_from_yaml_config('feeds.yml', 'myotherfeed')
Build feeds on the command line
$ html2rss feed feeds.yml myfeed
$ html2rss feed feeds.yml myotherfeed

Display the RSS feed nicely in a web browser

To display RSS feeds nicely in a web browser, you can:

  • add a plain old CSS stylesheet, or
  • use XSLT (eXtensible Stylesheet Language Transformations).

A web browser will apply these stylesheets and show the contents as described.

In a CSS stylesheet, you'd use element selectors to apply styles.

If you want to do more, then you need to create a XSLT. XSLT allows you to use a HTML template and to freely design the information of the RSS, including using JavaScript and external resources.

You can add as many stylesheets and types as you like. Just add them to your global configuration.

Ruby: a stylesheet config example
config = Html2rss::Config.new(
  { channel: {}, selectors: {} }, # omitted
  {
    stylesheets: [
      {
        href: '/relative/base/path/to/style.xls',
        media: :all,
        type: 'text/xsl'
      },
      {
        href: 'http://example.com/rss.css',
        media: :all,
        type: 'text/css'
      }
    ]
  }
)

Html2rss.feed(config)
YAML: a stylesheet config example
stylesheets:
  - href: "/relative/base/path/to/style.xls"
    media: "all"
    type: "text/xsl"
  - href: "http://example.com/rss.css"
    media: "all"
    type: "text/css"
feeds:
  # ... omitted

Recommended further readings:

Gotchas and tips & tricks

  • Check that the channel URL does not redirect to a mobile page with a different markup structure.
  • Do not rely on your web browser's developer console. html2rss does not execute JavaScript.
  • Fiddling with curl and pup to find the selectors seems efficient (curl URL | pup).
  • CSS selectors are versatile. Here's an overview.

Development

  1. Check out the repository: git clone ... && cd html2rss
  2. Install Ruby >=3.3, if you haven't already.
  3. Run bin/setup to install dependencies.
  4. Run the test suite bundle exec rspec.
    • To generate Test Coverage, run COVERAGE=true bundle exec rspec and open coverage/index.html.
  5. For an interactive prompt You can also run bin/console.
Releasing a new version
  1. git pull
  2. increase version in lib/html2rss/version.rb
  3. bundle
  4. git add Gemfile.lock lib/html2rss/version.rb
  5. VERSION=$(ruby -e 'require "./lib/html2rss/version.rb"; puts Html2rss::VERSION')
  6. git commit -m "chore: release $VERSION"
  7. git tag v$VERSION
  8. standard-changelog -f
  9. git add CHANGELOG.md && git commit --amend
  10. git tag v$VERSION -f
  11. git push && git push --tags

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/html2rss/html2rss.

html2rss's People

Contributors

admksh avatar dependabot-preview[bot] avatar dependabot[bot] avatar gildesmarais avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

html2rss's Issues

Trouble installing executable

Sorry, small support request (after much searching for a solution).

gem install --user-install html2rss correctly installs the gem to $HOME/.local/share/gem/ruby/3.0.0/gems. But no shell executable seems to be created anywhere, and certainly not in $HOME/.local/share/gem/ruby/3.0.0/bin (which is in $PATH).

And so no html2rss command is available. I would like to try the CLI.

What am I doing wrong?

feat: short-form config

the amount of typing to create a feed is too high.

instead of

hotthai:
    channel:
      title: 'hotthai'
      url: https://hot-thai-kitchen.com/hot-thai-kitchen-newsletter/
      ttl: 720
    selectors:
      items:
        selector: '.mailpoet_archive li'
      title:
        selector: '.mailpoet_archive_subject > a'
      link:
        selector: '.mailpoet_archive_subject > a'
        extractor: 'href'
        post_process:
          name: 'parse_uri'
      updated:
        selector: '.mailpoet_archive_date'
        post_process:
          name: parse_time

it could become something shorter.

hotthai:
  url: https://hot-thai-kitchen.com/hot-thai-kitchen-newsletter/
  ttl: 720
  items: '.mailpoet_archive li'
  title: '.mailpoet_archive_subject > a'
  link: '.mailpoet_archive_subject > a@href|parse_uri'
  updated: '.mailpoet_archive_date|parse_time'

deps: upgrade faraday to 2.x

currently pinned. since a good amount of time since 2.0 pasted, h2r should be ready.

would be nice to keep supporting 1.x, but i don't think i'd like to carry glue-code for that.

Cant run after installing

I am stuck at the very start, I have some experience with the linux command line but none in ruby or ruby applications, I have just installed ruby on my headless arch box, then installed the html2rss gem with "gem install html2rss" also confirmed the ruby bin is in path but cant get html2rss to run, also checking the ruby bin shows no application in there called html2rss

PS: I was apprehensive to ask what seems to be a very simple question which surely must be answered already in several places, but my google kungfu suddenly seems to be very weak cause couple of hours in to my research and still no answer

Add CI workflow to simplify the release of a new gem version

The last release on Rubygems is very old (>2 years already).

As a developer I want to release a new gem version often.

=> There should be a release workflow to simplify the process of releasing a new version

Requirements:

  • keep the changelog up to date
  • make use of github's releases page
  • increase the version in lib/html2rss/version.rb
  • tag the version in git
  • build gem and push it to rubygems
  • update/remove relevant section of the readme

Reverse order of items

In my case the website I want to parse to RSS adds their newest items to the end of the site.
So if I create the RSS Feed the oldest article is at the top and the newest one at the end.
I already searched for a command to reverse the order a scope through the CSS selector but I did not found one.
It would be nice if a reverse parameter could be added to the parse methods.
Thank you in advance

Support stylesheets

  • The Settings (href, type, media) should be in the global config.
  • support multiple stylesheets (make them an array)
  • Default stylesheet should be: none.

Relevant code:

maker.xml_stylesheets.new_xml_stylesheet do |xss|
      xss.href = "/rss.xsl"
      xss.type = "text/xsl"
      xss.media = "all"
    end

Afterwards, add a stylesheet to html2rss-web in the default config:

  • add styles to public/
  • add to config/feeds.yml

Cookies support

Any chance for supporting websites that require login. Possibly by adding a cookie parameter on the configs?

Unable to parse a url via link selector

Hi there, first of all thanks a lot for the great project, it is exactly what I've been looking for.

I am trying to generate a feed but am unable to correctly parse the link. As I've never used Ruby I am unsure if this comes from my config or is a bug.
The selected link is relative and html2rss should normally prepend the base I think.
Here's the config:

will:
  channel:
    url: https://dumpoir.com/v/will_bosi
    title: Will Bosi
    ttl: 120
  selectors:
    items:
      selector: .content__item
    title:
      selector: .content__text
    link:
      selector: .content__item > .content__img-wrap > a
      extractor: href

And here the error I am getting

/usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/utils.rb:21:in `build_absolute_url_from_relative': undefined method `absolute?' for nil (NoMethodError)

      return url if url.absolute?
                       ^^^^^^^^^^
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/item_extractors/href.rb:39:in `get'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/item.rb:58:in `extract'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/item.rb:46:in `method_missing'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder/item.rb:22:in `public_send'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder/item.rb:22:in `block in add'
	from /usr/local/lib/ruby/3.3.0/set.rb:501:in `each_key'
	from /usr/local/lib/ruby/3.3.0/set.rb:501:in `each'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder/item.rb:21:in `add'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder.rb:29:in `block (2 levels) in build'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder.rb:28:in `each'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder.rb:28:in `block in build'
	from /usr/local/lib/ruby/gems/3.3.0/gems/rss-0.3.0/lib/rss/maker/base.rb:439:in `make'
	from /usr/local/lib/ruby/gems/3.3.0/gems/rss-0.3.0/lib/rss/maker/base.rb:403:in `make'
	from /usr/local/lib/ruby/gems/3.3.0/gems/rss-0.3.0/lib/rss/maker.rb:29:in `make'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/rss_builder.rb:20:in `build'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss.rb:65:in `feed'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss.rb:38:in `feed_from_yaml_config'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/lib/html2rss/cli.rb:26:in `feed'
	from /usr/local/bundle/gems/thor-1.3.0/lib/thor/command.rb:28:in `run'
	from /usr/local/bundle/gems/thor-1.3.0/lib/thor/invocation.rb:127:in `invoke_command'
	from /usr/local/bundle/gems/thor-1.3.0/lib/thor.rb:527:in `dispatch'
	from /usr/local/bundle/gems/thor-1.3.0/lib/thor/base.rb:584:in `start'
	from /usr/local/bundle/bundler/gems/html2rss-8e9c589fbd73/exe/html2rss:6:in `<top (required)>'
	from /app/bin/html2rss:27:in `load'
	from /app/bin/html2rss:27:in `<main>'

Any help or pointers would be very welcome! Thanks!

EDIT:
FYI the url to extract looks like this: <a href="/c/2319792744204593923" style="display: block; position: relative;">
Maybe the leading / confuses the parser into thinking its an absolute path?

feat: support dynamic url, add date

some sites with interesting content have dynamic urls, e.g. some newsletter with their online archive.

let's start with adding support for date formatting options in the channel's url.

overrule content-type enclosure

I'm parsing this site https://deventer.info/agenda/ that servers images with extension .png or .jpg, but it turns out they're all content-type image/webp so none of the images are displayed when I load the feed.

I would really like to have an option to overrule the content-type of an enclosure. I guess it's for performance reasons that the actual content-type of a resource isn't checked?

Static extractor broken (at least from YAML)

Use the following example yml file:

channel:
  url: https://google.com
selectors:
  items:
    selector: html
  title:
    extractor: static
    static: "Test string"
  description:
    selector: body

The syntax matches what is documented in the comment of https://github.com/html2rss/html2rss/blob/master/lib/html2rss/item_extractors/static.rb

Running this through html2rss results in the following:

bundler: failed to load command: html2rss (/home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bin/html2rss)
/home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/config/selectors.rb:36:in `initialize': unknown keywords: static (ArgumentError)
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/config/selectors.rb:36:in `new'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/config/selectors.rb:36:in `selector'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/config.rb:55:in `selector_attributes_with_channel'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:55:in `extract'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:46:in `method_missing'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:74:in `title_or_description'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:67:in `valid?'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:126:in `keep_if'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/item.rb:126:in `from_url'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/rss_builder.rb:26:in `block in build'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/rss-0.2.9/lib/rss/maker/base.rb:439:in `make'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/rss-0.2.9/lib/rss/maker/base.rb:403:in `make'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/rss-0.2.9/lib/rss/maker.rb:29:in `make'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/rss_builder.rb:20:in `build'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss.rb:65:in `feed'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss.rb:38:in `feed_from_yaml_config'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/lib/html2rss/cli.rb:26:in `feed'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/thor-1.2.1/lib/thor/command.rb:27:in `run'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/thor-1.2.1/lib/thor/invocation.rb:127:in `invoke_command'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/thor-1.2.1/lib/thor.rb:392:in `dispatch'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/thor-1.2.1/lib/thor/base.rb:485:in `start'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bundler/gems/html2rss-f4feed552def/exe/html2rss:6:in `<top (required)>'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bin/html2rss:25:in `load'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/bin/html2rss:25:in `<top (required)>'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli/exec.rb:58:in `load'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli/exec.rb:58:in `kernel_load'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli/exec.rb:23:in `run'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli.rb:484:in `exec'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli.rb:31:in `dispatch'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/vendor/thor/lib/thor/base.rb:485:in `start'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/cli.rb:25:in `start'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/exe/bundle:48:in `block in <top (required)>'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/lib/bundler/friendly_errors.rb:103:in `with_friendly_errors'
	from /home/arvid/.rbenv/versions/3.1.2/lib/ruby/gems/3.1.0/gems/bundler-2.3.4/exe/bundle:36:in `<top (required)>'
	from /home/arvid/.rbenv/versions/3.1.2/bin/bundle:25:in `load'
	from /home/arvid/.rbenv/versions/3.1.2/bin/bundle:25:in `<main>'

rss2html was installed as per the instructions in https://github.com/html2rss/html2rss-configs

I'm a python & C++ programmer normally, not a ruby dev, so I may be doing something wrong.

Allow item's title to be customised

Having something to format the title, e.g. something like

%author%: %title% %rating%

where each variable has it's own (text) selector in the config object.

Remove reverse option

It prevents potential optimizations and the order of items is - according to spec - irrelevant.

  • remove from code
  • remove from docs/readme

Add API documentation

Have an autogenerated API doc would be very cool.

  • add rdoc documentation
  • add doc generation to travis
  • host docs on github pages

Fail to install on ARM server

Using an AWS ARM server running Debian, the following error is produced:

gem install html2rss
Building native extensions. This could take a while...
ERROR: Error installing html2rss:
ERROR: Failed to build gem native extension.

current directory: /var/lib/gems/3.1.0/gems/nokogumbo-2.0.5/ext/nokogumbo

/usr/bin/ruby3.1 -I /usr/lib/ruby/vendor_ruby -r ./siteconf20240630-1015-ncwk3p.rb extconf.rb
checking for whether -I/var/lib/gems/3.1.0/gems/nokogiri-1.16.6-aarch64-linux/ext/nokogiri is accepted as CFLAGS... *** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of necessary
libraries and/or headers. Check the mkmf.log file for more details. You may
need configuration options.

Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/usr/bin/$(RUBY_BASE_NAME)3.1
--with-libxml2
--without-libxml2
/usr/lib/ruby/3.1.0/mkmf.rb:498:in try_do': The compiler failed to generate an executable file. (RuntimeError) You have to install development tools first. from /usr/lib/ruby/3.1.0/mkmf.rb:624:in block in try_compile'
from /usr/lib/ruby/3.1.0/mkmf.rb:571:in with_werror' from /usr/lib/ruby/3.1.0/mkmf.rb:624:in try_compile'
from /usr/lib/ruby/3.1.0/mkmf.rb:688:in try_cflags' from /usr/lib/ruby/3.1.0/mkmf.rb:694:in block (2 levels) in append_cflags'
from /usr/lib/ruby/3.1.0/mkmf.rb:1007:in block in checking_for' from /usr/lib/ruby/3.1.0/mkmf.rb:362:in block (2 levels) in postpone'
from /usr/lib/ruby/3.1.0/mkmf.rb:332:in open' from /usr/lib/ruby/3.1.0/mkmf.rb:362:in block in postpone'
from /usr/lib/ruby/3.1.0/mkmf.rb:332:in open' from /usr/lib/ruby/3.1.0/mkmf.rb:358:in postpone'
from /usr/lib/ruby/3.1.0/mkmf.rb:1006:in checking_for' from /usr/lib/ruby/3.1.0/mkmf.rb:693:in block in append_cflags'
from /usr/lib/ruby/3.1.0/mkmf.rb:692:in each' from /usr/lib/ruby/3.1.0/mkmf.rb:692:in append_cflags'
from extconf.rb:76:in `

'

To see why this extension failed to compile, please check the mkmf.log which can be found here:

/var/lib/gems/3.1.0/extensions/aarch64-linux/3.1.0/nokogumbo-2.0.5/mkmf.log

extconf failed, exit code 1

Gem files will remain installed in /var/lib/gems/3.1.0/gems/nokogumbo-2.0.5 for inspection.
Results logged to /var/lib/gems/3.1.0/extensions/aarch64-linux/3.1.0/nokogumbo-2.0.5/gem_make.out

The log contains the following:

LD_LIBRARY_PATH=.:/usr/lib/aarch64-linux-gnu "aarch64-linux-gnu-gcc -o conftest -I/usr/in>
checked program was:
/* begin */
1: #include "ruby.h"
2:
3: int main(int argc, char *argv)
4: {
5: return !!argv[argc];
6: }
/
end */

feat: parse JSON from <script> of a page

Many pages include the interesting information in JSON in a script tag somewhere (e.g. to use it in a SPA).

Now, since every script tag looks the same and just the content matters, we'd need to select the correct once of many first. From there on, it should be the same process to generate the RSS.

  • select the correct script tag
  • handle global variable assignment in script-tag
  • support parsing json from <script>

Sometimes it's not JSON, but javascript (e.g. to assign JS objects to a global variable. Nuxt does that).

Javascript's JSON.stringify simply ignores non-json-able notations when serializing.

JSON.stringify({ "a": "Bbb", "b": function() { alert() }, "c": "d"})
=> '{"a":"Bbb","c":"d"}'

This behaviour is desirable for these cases.

HELP for config file

I have created a feeds.yml file that looks like this:

headers: "User-Agent": "Mozilla/5.0 (Windows NT 10.0; rv:68.0) Gecko/20100101 Firefox/68.0" feeds: channel: url: https://www.horoscope.com/us/horoscopes/general/horoscope-general-daily-today.aspx?sign=10 language: en selectors: items: selector: "div.main-horoscope" description: selector: "p" link: selector: "#src-horo-today" extractor: "href"
And I have added that file in the lib folder.

But I also see that I should create a config file that looks like this:

`require 'html2rss'

rss =
Html2rss.feed(
channel: { url: 'https://stackoverflow.com/questions' },
selectors: {
items: { selector: '#hot-network-questions > ul > li' },
title: { selector: 'a' },
link: { selector: 'a', extractor: 'href' }
}
)

puts rss`

Where should I add this file?

So far, my app doesn't work, so I guess I'm doing something wrong... I'm a beginner, so...

Allow custom GUID

Right now the GUID is just an automatically generated hash of each item's title or description, over which the user's feed configuration has no control. It would be great if the guid could be treated as yet another selector that the user could configure, so that custom behaviors could be defined. For example, keeping the GUID the same if the title changes just slightly, or forcing an update if a selector other than the title or description change(for example, the "link" selector).

Thanks a lot for developing this package!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.