GithubHelp home page GithubHelp logo

edoust / readsharp Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ceee/readsharp

0.0 3.0 0.0 785 KB

:rooster: Extract meaningful website contents using a port of NReadability

License: MIT License

C# 99.37% CSS 0.63%

readsharp's Introduction

ReadSharp

ReadSharp was previously PocketSharp.Reader and is now hosted without the PocketSharp dependency.

Install ReadSharp using NuGet

Install-Package ReadSharp

What's it all about?

The library extracts the main content of a website and returns the article as HTML with it's associated title, description, favicon and all included images.

The content can be encapsulated in a <body>-Tag and displayed as a readable website with a custom CSS (it's up to you!).

ReadSharp is based on a custom PCL port of NReadability and SgmlReader, which are included in the solution.

Association with Pocket

This library is a replacement for the Article View API by Pocket which is limited by usage and privacy.

With ReadSharp you won't hit any usage limits, as you are extracting the content directly. And it's open source.


Example

using ReadSharp;

Reader reader = new Reader();
Article article;

try
{
  article = await reader.Read(new Uri("http://frontendplay.com/story/4/http-caching-demystified-part-2-implementation"));
}
catch (ReadException exc)
{
  // handle exception
}

Options

HttpOptions

You can pass HttpOptions to the Reader constructor, which count for all requests:

  • HttpMessageHandler CustomHttpHandler
    Use your own HTTP handler
  • int? RequestTimeout
    Define a custom timeout in seconds, after which requests should cancel
  • bool UseMobileUserAgent
    Gets or sets a value indicating whether [use mobile user agent]
  • string UserAgent
    Override the user agent, which is passed to the destination server
  • string UserAgentMobile
    Override the mobile user agent, which is passed to the destination server
  • bool UseMobileUserAgent
    There are desktop and mobile default user agents. By enabling this property, the mobile user agent is used. If you pass a custom user agent, this property is ignored!
  • int MultipageLimit
    Gets or sets the download limit for articles with multiple pages (default: 10)

ReadOptions

There are also ReadOptions available, which are passed on every request:

  • bool HasHeaderTags
    Return complete HTML document or just the body part
  • bool HasNoHeadline
    Removes <h1> title from the article
  • bool UseDeepLinks
    If you check this option, deep-links (containing hashes, e.g. href="#article") are not transformed into absolute URIs
  • bool PrettyPrint
    Determines whether the HTML output will be formatted
  • bool PreferHTMLEncoding
    Determines whether to prefer the encoding found in the HTML or the one found in the HTTP Header (default: true)
  • bool MultipageDownload
    Download all pages for articles with multiple pages (default: false)
  • bool ReplaceImagesWithPlaceholders
    If true, replace all img-tags with placeholders

Article Model

The Article contains following fields:

  • string Title (the title of the page)
  • string Description (description of the page, extracted from meta information)
  • string Content (contains the article)
  • Uri FrontImage (main page image extracted from meta tags like apple-touch-icon and others)
  • Uri Favicon (the favicon of the page)
  • List<ArticleImage> Images (contains all images found in the text)
  • string NextPage (contains the next page URI, if available)

Article Image

  • Uri Uri
  • string Title (extracted from the title attribute)
  • string AlternativeText (extracted from the alt attribute)

Supported platforms

ReadSharp is a Portable Class Library, therefore it's compatible with multiple platforms and Universal Apps:

  • .NET >= 4.5 (including WPF)
  • UWP
  • Windows Phone (Silverlight + WinPRT) >= 8
  • Windows Store >= 8
  • Xamarin iOS + Android
  • WP7 and Silverlight are dropped in 6.0, use ReadSharp < 6.0, if you want to support them

Forked Dependencies

forks are included in the primary source code

Contributors

ceee
ceee

License

MIT License

readsharp's People

Contributors

ceee avatar edoust avatar leegreenwood avatar

Watchers

James Cloos avatar  avatar me avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.