GithubHelp home page GithubHelp logo

anglesharp / anglesharp Goto Github PK

View Code? Open in Web Editor NEW
5.0K 192.0 549.0 51.33 MB

:angel: The ultimate angle brackets parser library parsing HTML5, MathML, SVG and CSS to construct a DOM based on the official W3C specifications.

Home Page: https://anglesharp.github.io

License: MIT License

C# 85.57% HTML 13.57% JavaScript 0.80% PowerShell 0.03% Shell 0.02% TypeScript 0.01% Batchfile 0.01%
anglesharp dom c-sharp html parser library angle-bracket linq hacktoberfest

anglesharp's Introduction

logo

AngleSharp

CI GitHub Tag NuGet Count Issues Open Gitter Chat StackOverflow Questions CLA Assistant

AngleSharp is a .NET library that gives you the ability to parse angle bracket based hyper-texts like HTML, SVG, and MathML. XML without validation is also supported by the library. An important aspect of AngleSharp is that CSS can also be parsed. The included parser is built upon the official W3C specification. This produces a perfectly portable HTML5 DOM representation of the given source code and ensures compatibility with results in evergreen browsers. Also standard DOM features such as querySelector or querySelectorAll work for tree traversal.

⚡⚡ Migrating from AngleSharp 0.9 to AngleSharp 0.10 or later (incl. 1.0)? Look at our migration documentation. ⚡⚡

Key Features

  • Portable (using .NET Standard 2.0)
  • Standards conform (works exactly as evergreen browsers)
  • Great performance (outperforms similar parsers in most scenarios)
  • Extensible (extend with your own services)
  • Useful abstractions (type helpers, jQuery like construction)
  • Fully functional DOM (all the lists, iterators, and events you know)
  • Form submission (easily log in everywhere)
  • Navigation (a BrowsingContext is like a browser tab - control it from .NET!).
  • LINQ enhanced (use LINQ with DOM elements, naturally without wrappers)

The advantage over similar libraries like HtmlAgilityPack is that the exposed DOM is using the official W3C specified API, i.e., that even things like querySelectorAll are available in AngleSharp. Also the parser uses the HTML 5.1 specification, which defines error handling and element correction. The AngleSharp library focuses on standards compliance, interactivity, and extensibility. It is therefore giving web developers working with C# all possibilities as they know from using the DOM in any modern browser.

The performance of AngleSharp is quite close to the performance of browsers. Even very large pages can be processed within milliseconds. AngleSharp tries to minimize memory allocations and reuses elements internally to avoid unnecessary object creation.

Simple Demo

The simple example will use the website of Wikipedia for data retrieval.

var config = Configuration.Default.WithDefaultLoader();
var address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
var context = BrowsingContext.New(config);
var document = await context.OpenAsync(address);
var cellSelector = "tr.vevent td:nth-child(3)";
var cells = document.QuerySelectorAll(cellSelector);
var titles = cells.Select(m => m.TextContent);

Or the same with explicit types:

IConfiguration config = Configuration.Default.WithDefaultLoader();
string address = "https://en.wikipedia.org/wiki/List_of_The_Big_Bang_Theory_episodes";
IBrowsingContext context = BrowsingContext.New(config);
IDocument document = await context.OpenAsync(address);
string cellSelector = "tr.vevent td:nth-child(3)";
IHtmlCollection<IElement> cells = document.QuerySelectorAll(cellSelector);
IEnumerable<string> titles = cells.Select(m => m.TextContent);

In the example we see:

  • How to setup the configuration for supporting document loading
  • Asynchronously get the document in a new context using the configuration
  • Performing a query to get all cells with the content of interest
  • The whole DOM supports LINQ queries

Every collection in AngleSharp supports LINQ statements. AngleSharp also provides many useful extension methods for element collections that cannot be found in the official DOM.

Supported Platforms

AngleSharp has been created as a .NET Standard 2.0 compatible library. This includes, but is not limited to:

  • .NET Core (2.0 and later)
  • .NET Framework (4.6.2 and later)
  • Xamarin.Android (7.0 and 8.0)
  • Xamarin.iOS (10.0 and 10.14)
  • Xamarin.Mac (3.0 and 3.8)
  • Mono (4.6 and 5.4)
  • UWP (10.0 and 10.0.16299)
  • Unity (2018.1)

Documentation

The documentation of AngleSharp is located in the docs folder. More examples, best-practices, and general information can be found there. The documentation also contains a list of frequently asked questions.

More information is also available by following some of the hyper references mentioned in the Wiki. In-depth articles will be published on the CodeProject, with links being placed in the Wiki at GitHub.

Use-Cases

  • Parsing HTML (incl. fragments)
  • Parsing CSS (incl. selectors, declarations, ...)
  • Constructing HTML (e.g., view-engine)
  • Minifying CSS, HTML, ...
  • Querying document elements
  • Crawling information
  • Gathering statistics
  • Web automation
  • Tools with HTML / CSS / ... support
  • Connection to page analytics
  • HTML / DOM unit tests
  • Automated JavaScript interaction
  • Testing other concepts, e.g., script engines
  • ...

Vision

The project aims to bring a solid implementation of the W3C DOM for HTML, SVG, MathML, and CSS to the CLR - all written in C#. The idea is that you can basically do everything with the DOM in C# that you can do in JavaScript (plus, of course, more).

Most parts of the DOM are included, even though some may still miss their (fully specified / correct) implementation. The goal for v1.0 is to have all practically relevant parts implemented according to the official W3C specification (with useful extensions by the WHATWG).

The API is close to the DOM4 specification, however, the naming has been adjusted to apply with .NET conventions. Nevertheless, to make AngleSharp really useful for, e.g., a JavaScript engine, attributes have been placed on the corresponding interfaces (and methods, properties, ...) to indicate the status of the field in the official specification. This allows automatic generation of DOM objects with the official API.

This is a long-term project which will eventually result in a state of the art parser for the most important angle bracket based hyper-texts.

Our hope is to build a community around web parsing and libraries from this project. So far we had great contributions, but that goal was not fully achieved. Want to help? Get in touch with us!

Participating in the Project

If you know some feature that AngleSharp is currently missing, and you are willing to implement the feature, then your contribution is more than welcome! Also if you have a really cool idea - do not be shy, we'd like to hear it.

If you have an idea how to improve the API (or what is missing) then posts / messages are also welcome. For instance there have been ongoing discussions about some styles that have been used by AngleSharp (e.g., HTMLDocument or HtmlDocument) in the past. In the end AngleSharp stopped using HTMLDocument (at least visible outside of the library). Now AngleSharp uses names like IDocument, IHtmlElement and so on. This change would not have been possible without such fruitful discussions.

The project is always searching for additional contributors. Even if you do not have any code to contribute, but rather an idea for improvement, a bug report or a mistake in the documentation. These are the contributions that keep this project active.

Live discussions can take place in our Gitter chat, which supports using GitHub accounts.

More information is found in the contribution guidelines. All contributors can be found in the CONTRIBUTORS file.

This project has also adopted the code of conduct defined by the Contributor Covenant to clarify expected behavior in our community.

For more information see the .NET Foundation Code of Conduct.

Funding / Support

If you use AngleSharp frequently, but you do not have the time to support the project by active participation you may still be interested to ensure that the AngleSharp projects keeps the lights on.

Therefore we created a backing model via Bountysource. Any donation is welcome and much appreciated. We will mostly spend the money on dedicated development time to improve AngleSharp where it needs to be improved, plus invest in the web utility eco-system in .NET (e.g., in JavaScript engines, other parsers, or a renderer for AngleSharp to mention some outstanding projects).

Visit Bountysource for more details.

Development

AngleSharp is written in the most recent version of C# and thus requires Roslyn as a compiler. Using an IDE like Visual Studio 2019+ is recommended on Windows. Alternatively, VSCode (with OmniSharp or another suitable Language Server Protocol implementation) should be the tool of choice on other platforms.

The code tries to be as clean as possible. Notably the following rules are used:

  • Use braces for any conditional / loop body
  • Use the -Async suffixed methods when available
  • Use VIP ("Var If Possible") style (in C++ called AAA: Almost Always Auto) to place types on the right

More important, however, is the proper usage of tests. Any new feature should come with a set of tests to cover the functionality and prevent regression.

Changelog

A very detailed changelog exists. If you are just interested in major releases then have a look at the GitHub releases.

.NET Foundation

This project is supported by the .NET Foundation.

License

AngleSharp is released using the MIT license. For more information see the license file.

anglesharp's People

Contributors

alexanderuv avatar campersau avatar daveaglick avatar denis-ivanov avatar dsupuran avatar dv00d00 avatar ericmutta avatar florianrappl avatar georgiosd avatar hellbrick avatar hyspace avatar iamcarbon avatar ivandrofly avatar jansafronov avatar jodydonetti avatar kzrnm avatar lahma avatar laurynasr avatar martinwelsch avatar matkoch avatar meziantou avatar miroslav22 avatar silkfire avatar simoncropp avatar suchiman avatar tbolon avatar thecloudlesssky avatar tsu1980 avatar xp44mm avatar zukarusan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anglesharp's Issues

Implement Helpers

Helpers for URL management, easier DOM modifications, etc. would be nice.

NthChild with formula does not work

Hi,

I found a bug. When using QuerySelectorAll and using nth-child with calculations, for example tr:nth-child(n+2) it does not work correctly. It returns all children instead of the ones from second node on.

Nuances playing around with AngleSharp and Jint

First of all, let me congratulate you for AngleSharp. It's a true .net gem with clear goals, roadmap and implementation.

I was playing around with AngleSharp 0.7 and Jint, trying to get an existing website to run a specific javascript which creates an iframe on the document.
I finally did not make it work, probably because 0.7 or the "master" branch are not js ready yet. Despite that, I am summarizing below the items that troubled me and that I see are not changed for the upcoming release:

  1. The Navigator property of the AnalysisWindow can't be set as there is no setter. This forced me to specifically implement IWindow.Navigator ontop of AnalysisWindow just to put some value there.
  2. The ContentWindow of the IHtmlInlineFrameElement can't be set too. This could not be overridden so I did some monkey workaround in DomFunctionInstance.Call to return the content window of the parent document.
  3. Events are implemented in "master" but I see there is no "new Event()" implementation. According to MDN, every browser but IE has it. I am not sure this is in the standard or not.
  4. Trying "window.foo = 'bla'; console.log(window.foo);" results in javascript error because of the way window is set as the context in JavaScriptEngine.Evaluate. I worked around this by not returning a new DomNodeInstance from "ToJSValue" in DomFunctionInstance.Call if the "_method.Invoke" result is reference equals to "node.Value". Instead I returned node.Value directly.
  5. DomNodeInstance does not have a prototype nor set as extensible nor declares its own Class name. This results in some weird Jint exceptions.
  6. The conversions in AngleSharp.Scripting.Extensions.cs are too type safe. A simple "window.setAttribute('allowfullscreen', true)" instead of 'true' results in type cast exception.
  7. Finally, I did not find a way to raise the readystatechange event, so I gave up!

I know most of these are not there because I was exploring a half implemented area, however I felt it would be useful to write them down and let you know. I don't consider these bugs so I wouldn't be surprised if you closed this ticket all together as being premature.
Again, thank you for this wonderful project. I don't know the eta nor the milestones of the upcoming release but I am looking forward to it!

Cannot modify a relative URL in an anchor element

The task is to change a bit some of the relative URLs in the document.

var document = DocumentBuilder.Html("<a href='project/part2'>text</a>");
var a = (IHtmlAnchorElement)document.Body.FirstElementChild;
a.Href += "/1";
Console.WriteLine(document.Body.InnerHtml);

Expected output: <a href="project/part2/1">text</a>
Actual output: <a href="about:///project/part2/1">text</a>

Is it supposed to be like this? What's the best way to work this around?

IParentNode.Append adds the nodes twice

Consider this snippet.

var doc = DocumentBuilder.Html("");
var children = new[] {
    doc.CreateElement("span"),
    doc.CreateElement("em")
};
Console.WriteLine(children.Length);
doc.Body.Append(children);
Console.WriteLine(doc.Body.ChildNodes.Length);
Console.WriteLine(doc.Body.InnerHtml);

One might expect that it would print

2
2
<span></span><em></em>

But it actually prints

2
4
<span></span><em></em><span></span><em></em>

Children of cloned element are parent-less

It seems that if you clone an element, remove it from the document, and then re-add it (append it), that the element itself will have the correct Parent, but none of it's children (or grandchildren) have any parent at all. This seems wrong - unless I am misunderstanding something.

I was attempting to use this to use an element (and it's contents) as a template, that I could use to stamp out modified copies - but I also need to get at the parents of the children of the cloned elements.

Here is a failing test:
Gist: https://gist.github.com/jglinsek/dab7f950a3054cf97301

public void replace_the_parent_with_a_clone_and_the_children_of_the_cloned_parent_should_have_a_parent()
{
    const string html = @"
<html>
<body>
    <div class='parent'>
        <div class='child'>
        </div>
    </div>
</body>
</html>
";
    var doc = DocumentBuilder.Html(html);
    var originalParent = doc.QuerySelector(".parent");

    //clone the parent
    var clonedParent = originalParent.Clone();
    clonedParent.Parent.ShouldBe(null);

    //remove the original parent
    var grandparent = originalParent.Parent;
    originalParent.Remove();
    originalParent.Parent.ShouldBe(null);
    grandparent.ShouldNotBe(null);

    //replace the original parent with the cloned parent
    grandparent.AppendChild(clonedParent);
    //the clone itself has the correct parent
    clonedParent.Parent.ShouldBe(grandparent);

    //all the children (and grandchildren) of the cloned element have no parent?
    var cloneElement = (IElement) clonedParent;
    cloneElement.FirstChild.Parent.ShouldNotBe(null); //FAILS
}

Preserve exact content when round tripping CSS stylesheets

Deserialization + serialization of a stylesheet should result in exactly the original stylesheet, including all whitespaces. This is an important guarantee in real-world editing - the user's formatting should not be messed up.

  1. If portions of a stylesheet are modified, only the formatting of those portions should be changed.
  2. All content from the source should be represented by something in the DOM. As an example of what I mean see this description of how Roslyn represents .net source code.

A straightforward way of achieving this is to store the original parsed text of any ICssObject. Then when ToCss() is called it can just return the text instead of recomputing it. Recomputing is only required if the ICssObject is modified. (As an optimization, rather than copying the text of the original document into every element, one can keep the entire document as a string and reference sections of the string from within the ICssObject. But this is just a possible implementation detail.)

Wrong value of alfa channel background property after parsing

AngleSharp returns wrong value of alfa channel If content has specified background property in rgba format.
Example:
<table style="border:none;border-collapse:collapse;">
<tr>
<td align="right" valign="top" style="border: 1px solid #000;background:rgba(128,128,128,255);">
Test
</td>
</tr>
</table>

please support windows phone 8.1

I'm founding a html parser for wp8.1. AngleSharp is powerful, but I could not use it with wp8.1. Please support wp8.1. Thanks.

Memory leak

Hello.
Sorry about my English.

I have job , download and parse data from about 30000 web pages.

In parse Method , I use AngleSharp like this :

public override IEnumerable<DIRTY_SCHEDULE> Fetch(string ctx, string url = "") {
    var dom = DocumentBuilder.Html(ctx);
    //不支持 even
    //var trs = dom.QuerySelectorAll("#accordion2 table tbody tr:even");
    var trs = dom.QuerySelectorAll("#accordion2>.accordion-group>.accordion-heading>table>tbody>tr");
    for (var i = 0; i < trs.Length; i = i + 2) {
        var tr = trs[i];
        var tds = tr.QuerySelectorAll("td");
        var entry = new DIRTY_SCHEDULE {
            CARRIER = tds[0].Text(),
            ROUTE = tds[1].Text().Trim(),
            VESSEL = tds[2].Text().Trim(),
            VOYAGE = tds[3].Text().Trim(),
            ORGIN = tds[4].Text().Trim(),
            ETD = tds[5].Text().Trim().ToDateTime("yyyy-MM-dd", DateTime.Now),
            DEST = tds[6].Text().Trim(),
            ETA = tds[7].Text().Trim().ToDateTime("yyyy-MM-dd", DateTime.Now),
            TT = tds[8].Text().Trim().ToDecimalOrNull(),
            DIRTY_SCHEDULE_TRANSF = this.FetchTransf(trs[i + 1]).ToList(),
            SOURCE = url,
            APP = "Fetcher.Soushipping",
        };

        entry.UNQTAG = entry.GetUNQTag();

        yield return entry;
    }
}

private IEnumerable<DIRTY_SCHEDULE_TRANSF> FetchTransf(IElement tr) {
    var tbls = tr.QuerySelectorAll("table.widget");
    //第一个列出的是起始地
    for (var i = 1; i < tbls.Length; i++) {
        var rows = tbls[i].QuerySelectorAll("tr");
        if (rows.Length == 3)
            yield return new DIRTY_SCHEDULE_TRANSF {
                VESSEL = rows[0].Text().Trim(),
                AT = rows[1].QuerySelector("td").Text().Trim(), //rows[1].FirstChild.Text().Trim(),
                VOYAGE = rows[2].Text().Trim(),
                SEQ = i - 1
            };
    }
}

It work fine.
But, Memory leaked very serious.

You can see it from here :
Image of Yaktocat
3:20 1.8G!
What's wrong ?
IElement not disposed ?

But use HtmlAgilityPack :
Image of Yaktocat
3:20 only 270 M.

The code use HtmlAgilityPack:

public override IEnumerable<DIRTY_SCHEDULE> Fetch(string ctx, string url = "") {
    var doc = new HtmlDocument();
    doc.LoadHtml2(ctx);
    var root = doc.DocumentNode;
    var trs = root.QuerySelectorAll("#accordion2>.accordion-group>.accordion-heading>table>tbody>tr")
        .ToList();
    for (var i = 0; i < trs.Count(); i = i + 2) {
        var tr = trs[i];
        var tds = tr.QuerySelectorAll("td").ToList();
        var entry = new DIRTY_SCHEDULE {
            CARRIER = tds[0].InnerText.Clear(),
            ROUTE = tds[1].InnerText.Clear(),
            VESSEL = tds[2].InnerText.Clear(),
            VOYAGE = tds[3].InnerText.Clear(),
            ORGIN = tds[4].InnerText.Clear(),
            ETD = tds[5].InnerText.Clear().ToDateTime("yyyy-MM-dd", DateTime.Now),
            DEST = tds[6].InnerText.Clear(),
            ETA = tds[7].InnerText.Clear().ToDateTime("yyyy-MM-dd", DateTime.Now),
            TT = tds[8].InnerText.Clear().ToDecimalOrNull(),
            DIRTY_SCHEDULE_TRANSF = this.FetchTransf(trs[i + 1]).ToList(),
            SOURCE = url,
            APP = "Fetcher.Soushipping",
        };

        entry.UNQTAG = entry.GetUNQTag();

        yield return entry;
    }
}


private IEnumerable<DIRTY_SCHEDULE_TRANSF> FetchTransf(HtmlNode tr) {
    var tbls = tr.QuerySelectorAll("table.widget").ToList();
    //第一个列出的是起始地
    for (var i = 1; i < tbls.Count(); i++) {
        var rows = tbls[i].QuerySelectorAll("tr").ToList();
        if (rows.Count == 3)
            yield return new DIRTY_SCHEDULE_TRANSF {
                VESSEL = rows[0].InnerText.Clear(),
                AT = rows[1].QuerySelector("td").InnerText.Clear(), //rows[1].FirstChild.Text().Trim(),
                VOYAGE = rows[2].InnerText.Clear(),
                SEQ = i - 1
            };
    }
}

Erorr: System.Net.HttpWebRequest.set_UserAgent cannot be used on the current platform.

Hi, I am new to AngleSharp. First of all, I would like to say to the author, Great Job, thanks!

I am using AngleSharp to write a small Windows Store app. I've tried to call DocumentBuilder.HtmlAsync() to get a html document object, but I got a runtime error:

A first chance exception of type 'System.InvalidOperationException' occurred in mscorlib.dll
An exception of type 'System.InvalidOperationException' occurred in mscorlib.dll but was not handled in user code
Additional information: The API 'System.Net.HttpWebRequest.set_UserAgent(System.String)' cannot be used on the current platform. See http://go.microsoft.com/fwlink/?LinkId=248273 for more information.

Here is what I am doing. Am I missing something?

        AngleSharp.Configuration.UseDefaultHttpRequester = true;
        var document = await AngleSharp.DocumentBuilder.HtmlAsync(new Uri("http://stackoverflow.com/questions/20115672/how-does-the-static-type-affect-this-code"));

My developing environment is Windows 8.1 and Visual Studio 2013 express for Windows Store app.
Thanks

Bug with CssBorderProperty

There is some logic errors with line 119 & 122 in file AngleSharp/DOM/Css/Properties/Border/CSSBorderProperty.cs, probably caused by copy and paste. Also border style definition like "border: 1px outset currentColor" is not well supported.

Table scoping rules

The table element does not automatically close an open paragraph. See #46 for the initial hint.

Rendering?

This is a fantastic project and I can definitely think of more than one place I'm interested in applying it in the not-too-distant future! What are your thoughts on the amount of work to take what you've done and render it to a surface? One application I can see for it is as the basis for user interface rendering for .Net-based games. Currently, to leverage the power of HTML and CSS for a game, you really need to embed something like Awesomium or CEF. Having a completely managed library available that can do all the work of rendering to an offscreen surface and being able to process a stream of input events (either simulated or otherwise) would be really powerful.

Encoding Error

Test code is here:

internal class Program
{
    private const string TestUrl = @"http://www.baidu.com";
    private static readonly Regex TitleRegex = new Regex("<title>(.*?)</title>");

    private static void Main(string[] args)
    {
        var angleSharpTitle = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute)).Title;
        var htmlAgilityPackTitle = new HtmlWeb().Load(TestUrl).DocumentNode.SelectSingleNode("//title").InnerText;
        var webClientTitle = WebClientDownload();
    }

    private static string WebClientDownload()
    {
        return
            TitleRegex.Match((new WebClient() {Encoding = Encoding.UTF8}.DownloadString(TestUrl)))
                .Groups[1].Value;
    }
}

www.baidu.com is a website in China. In the test code, HtmlAgilityPack and WebClient are both correct, but AngleSharp does not, one character is not display correctly.

Two possible bugs of version 0.70

I've used AngleSharp(version 0.50 or older) for a small Windows Store app project for a quite long time(more than a half year) and it works great and stable. But when I today update it to version 0.70, I encountered two minor problems.

  1. DocumentBuilder.Html(Uri uri) method won't work for Windows Store app project. Following is the code:

    static void Test()
    {
        IDocument doc = DocumentBuilder.Html(new Uri("http://stackoverflow.com/", UriKind.Absolute));
        Debug.WriteLine(doc.ToHtml());
    }
    

The compiler complains:

"A first chance exception of type 'System.InvalidOperationException' occurred in mscorlib.dll

Additional information: The API 'System.Net.HttpWebRequest.set_UserAgent(System.String)' cannot be used on the current platform. See http://go.microsoft.com/fwlink/?LinkId=248273 for more information."

  1. DocumentBuilder.Html(string pageSource) can't correctly parse page source which contains special charset. For example:

    static async void Test2()
    {
        HttpClient httpClient = new HttpClient();
        IBuffer response = await httpClient.GetBufferAsync(new Uri("http://item.jd.com/11312278.html"));
    
        string pageSource = Encoding.GetEncoding("gb2312").GetString(response.ToArray(), 0, (int)response.Length - 1);
    
        Debug.WriteLine("original pagesource == "+pageSource); //correct output
    
        IDocument doc = DocumentBuilder.Html(pageSource);
        Debug.WriteLine("parsed html text == " + doc.ToHtml()); //incorrect output(Chinese characters all display as unreadable special characters)
    }
    

These two problems are all new, maybe it's due to some minor changes in the new version.

True headless parsing support - CSSStyleRule/CSSRule

In your design principle of the above, we need like the following, so no dependency on IWindow

CSSStyleRule/CSSRule
internal override void ComputeStyle(CssPropertyBag style, IWindow window, IElement element)
        {
            if (_selector.Match(element))
                style.ExtendWith(_style, _selector.Specifity);
        }
        internal override void ComputeStyle(CssPropertyBag style, IDocument document, IElement element) 
        {
            if (_selector.Match(element))
                style.ExtendWith(_style, _selector.Specifity);
        }
StyleExtensions

/// 
        /// Inherits the unspecified properties from the element's parents.
        /// 
        /// The bag to modify.
        /// The element that has unresolved properties.
        /// The associated window object.
        public static void InheritFrom(this CssPropertyBag bag, IElement element, IDocument document)
        {
            var parent = element.ParentElement;

            if (parent != null)
            {
                var styling = document.computeDeclarations(parent); 

                foreach (var property in styling.Declarations)
                {
                    var styleProperty = bag[property.Name];

                    if (styleProperty == null || styleProperty.IsInherited)
                        bag.TryUpdate(property);
                }
            }
        }
internal static CSSStyleDeclaration computeDeclarations(this IDocument document, IElement element)
        {
            var bag = new CssPropertyBag();

```
        document.RulesIterator((CSSRuleList rules) =>
        {
            rules.ComputeStyle(bag, document, element);
            return true;
        });

        var htmlElement = element as HTMLElement;

        if (htmlElement != null)
            bag.ExtendWith(htmlElement.Style, Priority.Inline);

        bag.InheritFrom(element, document);
        return new CSSStyleDeclaration(bag);
    }
    public static ICssStyleDeclaration ComputeDeclarations(this IDocument document, IElement element)
    {
        return document.computeDeclarations(element);
    }
```

Some documentation

Hi,

this project looks awesome, would you kindly add some documentation on usage.

thanks in advanced.

CSSProperty should be named CssDeclaration

According to this section of the css specification the correct term for property/value pairs (including the important flag) is Declaration, not Property. AngleSharp currently names this class CssProperty. This is confusing, especially since CSSStyleDeclaration has this:

List<CSSProperty> _declarations;

Suggest changing the name of CSSProperty to CssDeclaration. Note the use of CamelCase - I prefer Css to CSS, in keeping with .net coding conventions. AngleSharp seems divided on the issue :)

Edit: Looks like this (incorrect) naming is used for all derived declaration types as well, so this might be a bigger change. However, that only makes it all the more needed. Hopefully Visual Studio can make the renaming easy.

DefaultRequester - 'HttpWebResponse' memory leak

Hi
There is a 'HttpWebResponse' memory leak, as this not disposed. I think you are holding onto to 'Content" stream. The current implementation does not scale beyond few external link requests to, say, style sheets.

I modified thus:

             DefaultResponse GetResponse()
            {
                if (_response == null)
                    return null;

                var result = new DefaultResponse();
                var headers = _response.Headers.AllKeys.Select(m => new { Key = m, Value = _response.Headers[m] });
                result.Content = new System.IO.MemoryStream(); 
                _response.GetResponseStream().CopyTo(result.Content);
                result.Content.Position = 0;
                result.StatusCode = _response.StatusCode;
                result.Address = new Url(_response.ResponseUri);

                foreach (var header in headers)
                    result.Headers.Add(header.Key, header.Value);

                _response.Dispose();
                _response = null;
                return result;
            }

Finding all Rules that match a selector - useful additon

namespace AngleSharp.Extensions
{
    using System;
    using System.Linq;
    using AngleSharp.DOM.Css;
    using AngleSharp.DOM;
    using System.Collections.Generic;
    public static class QueryExtensionsEx
    {
        public static List QueryRulesAll(this IDocument document, ISelector selector = null)
        {
            var result = new List();

            document.RulesIterator((CSSRuleList rules) =>
            {
                foreach (var rule in rules.OfType())
                {
                    var cssStyleRule = rule as ICssStyleRule;
                    if (selector == null || (cssStyleRule != null && cssStyleRule.Selector.Text == selector.Text))
                        result.Add(rule);
                }
                return true;
            });

            return result;
        }
    }
}
internal static bool RulesIterator(this IDocument document, Func func)
        {
            var stylesheets = document.Head.Children.OfType().Where(hle => hle.Sheet != null).Select(hle => hle.Sheet).Concat(document.StyleSheets);
            foreach (var stylesheet in stylesheets)
            {
                var sheet = stylesheet as CSSStyleSheet;

                if (sheet != null && !stylesheet.IsDisabled)
                {
                    var rules = (CSSRuleList)sheet.Rules;
                    var continueIterate = func(rules);
                    if (!continueIterate)
                        return false;
                }
            }
            return true;
        }

CssSelectorConstructor - for pseudoelement - ::selection

Hi
The css parser doesn't pick pseudo element '::selection' because of the following. I set the state back to State.Data. Was this your design interntion ?

Boolean OnPseudoElement(CssToken token)
        {
            if (token.Type == CssTokenType.Ident)
            {
                state = State.Data;  ///  Added. This seems to move the parser along ??
                var data = ((CssKeywordToken)token).Data;

                switch (data)
                {
                }
         }

Empty CssText for simple style rule definition with version 0.6.0

Just downloaded version 0.6.0 and found the following issue after a quick try:

For a simple style definition, the CssText property shows empty

string css = @"html{font-family:sans-serif;}";
var stylesheet = AngleSharp.Parser.Css.CssParser.ParseStyleSheet(css);

foreach (var rule in stylesheet.Rules)
{
Console.WriteLine(rule.CssText);
}

The end result ends up something like this: html {\r\n;;}

It used to work fine in version 0.5.1.

Anything I did wrong here?

Extract only Text

hi,
in Html Agility Pack there is a command " HtmlToText()" to strip out Html Tags and return only the raw text, how can we do this here.

thanks in advanced.

V8 Support

Hello,

do you plan to integrate javascript support (V8 Js engine) ?
Or this library will expose dom only and no script support supposed?

Thanks

Interfaces for ext. resources

Draft interfaces for optional resource and rendering defined like the ones already in their for the IoC container, or the HTTP requester.

Wrong handling of elements that contain 'lang' attribute

If input document is encoded in utf8 and contains html elements with 'lang' attribute, value of those elements is mojibake'd
To reproduce this bug I just copied text from MS Word to clipbard and tried to parse resulted html thru AngleSharp library.
Word generates an html with copied text as follows:

<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=utf-8>
</head>
<body lang=RU style='tab-interval:35.4pt'>
<!--StartFragment-->
<p class=MsoNormal><span lang=EN-US style='mso-ansi-language:EN-US'>тест</span><o:p></o:p></p>
<!--EndFragment-->
</body>
</html>

To simplify, I removed all extra meta-information about styles and whatever from this sample html.
Text inside span has attribute lang and entire document is already encoded with utf. While traversing node-tree I receive wrong text inside span.

SVG and MathML DOM

Start writing classes for various elements in the MathML and SVG DOM.

Parsing third party (unknown) css properties

Can the css parser parse unknown properties and return them as simple string values to be handled by calling code? From my (very limited) look into AngleSharp, it seems that properties with unknown names are being discarded.

Alternatively, is it possible to pass in a handler for properties that AngleSharp does not know? The idea is the handler parses and instantiates the correct CssProperty subtype for properties it cares about.

CSS bug: Can't set set the Display property of Style

Hi,I've use AngleSharp version 0.4 for months, and it is robust and works very well for so long, until today when I tried to change the Style of a table row.

My project crashed at this line:
tr.Style.Display = "none";
where tr is of type HTMLTableRowElement. The error message is as below:

A first chance exception of type 'System.NullReferenceException' occurred in AngleSharp.DLL
Additional information: Object reference not set to an instance of an object.

Probably it is something wrong in the internal, because I can inspect that the Display property is not null and has the value of "inherited".

btw, great job!

XHTMLish output

I need to convert a document to such a format that it should be both HTML and XML in the same time. That is, the markup like <select/> is not allowed, should be <select></select> instead, but the markup like <img src="..." /> is required. Would be nice if AngularSharp could provide some control over the process of DOM serialization.

Html Agility Pack has something like this. At least, it can output XHTML.

HTML Element constructor access modifiers

Currently, it is not possible to instantiate a HTML Element for example HTMLInputElement as its constructor declared with internal modifier. Sometimes we may need to create and add dynamic elements to DOM. Please change all the HTML elements constructors so that they can be added dynamically. HtmlDocumet.AppendChild is of no use in this case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.