Comments (8)
Thanks, I will look into that!
from anglesharp.
With what version of AngleSharp have you tried? The latest one on NuGet, a specific (maybe older) version or the latest one available here?
I used the following code:
const String TestUrl = @"http://www.baidu.com";
Regex TitleRegex = new Regex("<title>(.*?)</title>");
var angleSharpTitle = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute)).Title;
var plainTitle = TitleRegex.Match(new HttpClient().GetStringAsync(TestUrl).Result).Groups[1].Value;
Then I inspected the two titles (as you did, I guess). However, to me both look the same:
I also compared the byte content via (e.g.)
Encoding.Unicode.GetBytes(plainTitle)
Both strings have 18 bytes and the same sequence / values (126,118,166,94,0,78,11,78,12,255,96,79,49,92,229,119,83,144). Can you re-check please? What is the title supposed to be (in bytes)?
I can only compare pictures here. For me the VS and browser generated symbols (Opera 26) look the same:
Thanks!
from anglesharp.
the third character is not correct.
from anglesharp.
Again: Version? I can't reproduce it...
from anglesharp.
I think I may be found why this error occurred.
If the response content encoding is different with the meta charset, it will occurred.
I make an example, here is the link:
https://onedrive.live.com/redir?resid=B406AF403E9168EA!13195&authkey=!AB4w4T2GQOur2XE&ithint=file%2c7z
I run it in IE11, Chrome, the title and body are no problems.
But in AngleSharp, it is wrong.
PS: I use AngleSharp version is 0.7, published at 2014-11-08.
from anglesharp.
I also tried the current version on NuGet. Still - I get the same for the HttpClient raw and the AngleSharp. The chosen character set is utf-8
. What character set do you have? The only explanation I have is the local encoding start value... (therefore what document encoding has been selected for you might be an indicator in this case).
var document = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute));
var charset = document.CharacterSet;
Maybe this way we can catch the origin of the bug.
from anglesharp.
Today, I try another test.
class Program
{
static void Main(string[] args)
{
const String TestUrl = @"http://www.baidu.com/";
Regex TitleRegex = new Regex("<title>(.*?)</title>");
var angleSharpTitle = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute), new Configuration()
{
Culture = new CultureInfo("en-US")
}).Title;
var angleSharpTitle2 = DocumentBuilder.Html(new Uri(TestUrl, UriKind.Absolute), new Configuration()
{
Culture = new CultureInfo("zh-CN")
}).Title;
var plainTitle = TitleRegex.Match(new HttpClient().GetStringAsync(TestUrl).Result).Groups[1].Value;
}
}
en-US is right and zh-CN is wrong.
In your computer, en-US is the default culture info, but zh-CN is my computer's default culture info.
If we don't tell the DocumentBuilder which culture info it use, it will use the local culture info same with the computer, that's the reason the same codes will get different result between you and me.
and document.CharacterSet
is utf-8 the same with you.
from anglesharp.
Alright, that means my guess (and the only remaining possibility) was right that the already consumed stream might be transformed wrong after the encoding changes (Windows-1252 to UTF-8 vs GB18030 to UTF-8). I can reproduce the bug now, thanks. Will be fixed soon.
from anglesharp.
Related Issues (20)
- Im not able to get any element from the site HOT 5
- Request for Support / Sponsorship HOT 1
- IHtmlDocument has IDisposable - for what?) HOT 1
- Redirect to Custom URL Scheme HOT 2
- Issues with Headers HOT 2
- Use libraries provided by framework HOT 3
- Provide repo link as part of nuget package HOT 8
- QuerySelectorAll problem HOT 1
- SemVer scheme in AngleSharp -alpha versions broken HOT 1
- Multipart/form-data support HOT 1
- IndexOutOfRangeException in AngleSharp.Common.ArrayPoolBuffer.Append HOT 1
- Getting Attributes for each Element HOT 1
- InvalidOperationException: Stack empty in AngleSharp.Html.Parser.HtmlDomBuilder HOT 1
- Attributes in Elements HOT 3
- Parser Issue Findings from Fuzzing HOT 4
- Additional Findings from Fuzzing HOT 3
- NullReferenceException when using own HttpClient HOT 1
- How to Change the Accept header in DocumentRequest HOT 1
- Link element source not loaded when element is appended to document
- AngleSharp 1.2.0-beta.410 can not be loaded in .Net Framework because of System.Memory version error HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anglesharp.