GithubHelp home page GithubHelp logo

Extract only Text about anglesharp HOT 11 CLOSED

anglesharp avatar anglesharp commented on May 3, 2024
Extract only Text

from anglesharp.

Comments (11)

FlorianRappl avatar FlorianRappl commented on May 3, 2024 4

The dom is a document - if you would use it on dom.Body or dom.DocumentElement it should actually work.

from anglesharp.

jeremycook avatar jeremycook commented on May 3, 2024 4

@fpavlic this may not interest you anymore but you could remove elements you do not want from the DOM before calling .Text(). So find and remove all <script> and <style> elements first.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024 1

The Text<T> method does. You get it by using AngleSharp.Extensions - as with all other extension methods. Hope this helps you @ricardobrandao!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

The HTML DOM has a property called textContent (this is TextContent in AngleSharp) for node objects. Usually if you use this on e.g. the document root (HTML) element it should give you the whole textual content.

But beware - there might be an unusual amount of spaces and newlines in there, since those are not getting stripped out by the parser - that you do not see most of them in rendered content is a feature of the HTML renderer.

Tell me if it does fit your needs, otherwise I will implement another method to deal with this (outside of the official specification).

from anglesharp.

devmondo avatar devmondo commented on May 3, 2024

thanks for quick reply, excellent it worked, but the problem is it also gives you CSS and Scripts, we should strip them as we are only interested in Text inside body

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

Hm this is how real browsers also behave. I will write a method "ToText()" which will already give you stripped content.

from anglesharp.

devmondo avatar devmondo commented on May 3, 2024

thanks a lot :)

from anglesharp.

ricardobrandao avatar ricardobrandao commented on May 3, 2024

@FlorianRappl this was not implemented yet, right? I didn't found anything that extracts the text from a INodeList or even from the IHtmlDocument.

from anglesharp.

ricardobrandao avatar ricardobrandao commented on May 3, 2024

I saw that one but I wasn't able to make it work then. For instance, this test fails (it retuns a null value):

[Fact]
public void StripHtmlTagsTest1()
{
    var input = "<div><span>Hello, world! Some <a href=\"#\">link.</a></span></div>";
    var dom = new HtmlParser().Parse(input);
    Assert.Equal("Hello, world! Some link.", dom.Text());
}

However I've found now that this one does the trick:

[Fact]
public void StripHtmlTagsTest2()
{
    var input = "<div><span>Hello, world! Some <a href=\"#\">link.</a></span></div>";
    var nodeList = new HtmlParser().ParseFragment(input, null);
    Assert.Equal("Hello, world! Some link.", string.Concat(nodeList.Select(x => x.Text())));
}

Thanks for the quick reply :)

from anglesharp.

fpavlic avatar fpavlic commented on May 3, 2024

Is there a way to get just 'visible' content e.g. "DocumentElement.TextContent" without Scripts and Styles?

This is something @devmondo mentioned 2013th.

Thanks

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

AngleSharp can't decide which text is relevant for you and which isn't. It will just return what the spec says it should return.

from anglesharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.