GithubHelp home page GithubHelp logo

Encoding Error about anglesharp HOT 8 CLOSED

anglesharp avatar anglesharp commented on May 3, 2024
Encoding Error

from anglesharp.

Comments (8)

FlorianRappl avatar FlorianRappl commented on May 3, 2024

Thanks, I will look into that!

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

With what version of AngleSharp have you tried? The latest one on NuGet, a specific (maybe older) version or the latest one available here?

I used the following code:

const String TestUrl = @"http://www.baidu.com";
Regex TitleRegex = new Regex("<title>(.*?)</title>");

var angleSharpTitle = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute)).Title;
var plainTitle = TitleRegex.Match(new HttpClient().GetStringAsync(TestUrl).Result).Groups[1].Value;

Then I inspected the two titles (as you did, I guess). However, to me both look the same:

image

I also compared the byte content via (e.g.)

Encoding.Unicode.GetBytes(plainTitle)

Both strings have 18 bytes and the same sequence / values (126,118,166,94,0,78,11,78,12,255,96,79,49,92,229,119,83,144). Can you re-check please? What is the title supposed to be (in bytes)?

I can only compare pictures here. For me the VS and browser generated symbols (Opera 26) look the same:

image

Thanks!

from anglesharp.

h82258652 avatar h82258652 commented on May 3, 2024

qq 20141128002909
the third character is not correct.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

Again: Version? I can't reproduce it...

from anglesharp.

h82258652 avatar h82258652 commented on May 3, 2024

I think I may be found why this error occurred.
If the response content encoding is different with the meta charset, it will occurred.
I make an example, here is the link:
https://onedrive.live.com/redir?resid=B406AF403E9168EA!13195&authkey=!AB4w4T2GQOur2XE&ithint=file%2c7z

I run it in IE11, Chrome, the title and body are no problems.
But in AngleSharp, it is wrong.
PS: I use AngleSharp version is 0.7, published at 2014-11-08.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

I also tried the current version on NuGet. Still - I get the same for the HttpClient raw and the AngleSharp. The chosen character set is utf-8. What character set do you have? The only explanation I have is the local encoding start value... (therefore what document encoding has been selected for you might be an indicator in this case).

var document = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute));
var charset = document.CharacterSet;

Maybe this way we can catch the origin of the bug.

from anglesharp.

h82258652 avatar h82258652 commented on May 3, 2024

Today, I try another test.


    class Program
    {
        static void Main(string[] args)
        {
            const String TestUrl = @"http://www.baidu.com/";
            Regex TitleRegex = new Regex("<title>(.*?)</title>");
            var angleSharpTitle = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute), new Configuration()
            {
                Culture = new CultureInfo("en-US")
            }).Title;
            var angleSharpTitle2 = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute), new Configuration()
            {
                Culture = new CultureInfo("zh-CN")
            }).Title;
            var plainTitle = TitleRegex.Match(new HttpClient().GetStringAsync(TestUrl).Result).Groups[1].Value;
        }
    }

en-US is right and zh-CN is wrong.
In your computer, en-US is the default culture info, but zh-CN is my computer's default culture info.
If we don't tell the DocumentBuilder which culture info it use, it will use the local culture info same with the computer, that's the reason the same codes will get different result between you and me.

and document.CharacterSet is utf-8 the same with you.

from anglesharp.

FlorianRappl avatar FlorianRappl commented on May 3, 2024

Alright, that means my guess (and the only remaining possibility) was right that the already consumed stream might be transformed wrong after the encoding changes (Windows-1252 to UTF-8 vs GB18030 to UTF-8). I can reproduce the bug now, thanks. Will be fixed soon.

from anglesharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.