jackwfinlay / jsonize Goto Github PK
View Code? Open in Web Editor NEWConvert HTML to JSON.
License: MIT License
Convert HTML to JSON.
License: MIT License
Currently the library doesn't let you choose the method of output. Add handler to JsonizeConfiguration and Jsonize classes to handle this option.
Add the ability to convert the json created back to an html string
JsonizeParser parser = new JsonizeParser();
JsonizeSerializer serializer = new JsonizeSerializer();
Jsonizer jsonizer = new Jsonizer(parser, serializer);
jsonizer.ParseToHtmlAsync(JsonizeNode);
Need to create another version to work with older versions of .NET. Can probably go further back than 4.6.
Documentation needs to be produced to show usage and set up for new users. This needs to be in the README and in the wiki here on Github.
Null values are currently ignored during serialization. Add the option to select whether to show them or not.
Assembly 'Jsonize.Parser' with identity 'Jsonize.Parser, Version=3.1.0.0, Culture=neutral, PublicKeyToken=null' uses 'AngleSharp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=e83494dcdc6d31ea' which has a higher version than referenced assembly 'AngleSharp' with identity 'AngleSharp, Version=0.17.1.0, Culture=neutral, PublicKeyToken=e83494dcdc6d31ea'
When using JsonizeParserConfiguration i am getting this error
Doesnt parse or returns text after children tags in parent tag.
Example of html:
<p> Завдання nj jnjnjk knj ccjnds
<span style="background-color:rgb(97,189,109);">
kjc djck sdjkc
</span>
dsckj dc dc csd c
</p>
Pursing result:
{
"NodeType": "Element",
"Tag": "p",
"Text": "Завдання nj jnjnjk knj ccjnds",
"Attr": {},
"Children": [
{
"NodeType": "Element",
"Tag": "span",
"Text": "kjc djck sdjkc",
"Attr": {
"style": "background-color:rgb(97,189,109);"
},
"Children": []
}
]
}
And it also trims all start and end spaces in tags. So the text can not be set to one object, cause spaces gone
Why if I set EmptyTextNodeHandling
to EmptyTextNodeHandling.Ignore
I don't have "node" property in the resulting JSON? Is it ok or not?
EmptyTextNodeHandling.Include
example:
JsonizeConfiguration jsonizeConfiguration = new JsonizeConfiguration
{
EmptyTextNodeHandling = EmptyTextNodeHandling.Include
};
string html = "<html><head></head><body><form></form><p></p></body></html>";
Jsonize jsonize = new Jsonize(html);
string result = jsonize.ParseHtmlAsJsonString(jsonizeConfiguration);
/*
result:
{
"node":"Document",
"child":[
{
"node":"Element",
"tag":"html",
"child":[
{
"node":"Element",
"tag":"head",
"text":""
},
{
"node":"Element",
"tag":"body",
"child":[
{
"node":"Element",
"tag":"form",
"text":""
},
{
"node":"Element",
"tag":"p",
"text":""
}
]
}
]
}
]
}
*/
EmptyTextNodeHandling.Ignore
example:
JsonizeConfiguration jsonizeConfiguration = new JsonizeConfiguration
{
EmptyTextNodeHandling = EmptyTextNodeHandling.Ignore
};
string html = "<html><head></head><body><form></form><p></p></body></html>";
Jsonize jsonize = new Jsonize(html);
string result = jsonize.ParseHtmlAsJsonString(jsonizeConfiguration);
/*
result:
{
"node": "Document",
"child": [
{
"tag": "html",
"child": [
{
"tag": "head"
},
{
"tag": "body",
"child": [
{
"tag": "form"
},
{
"tag": "p"
}
]
}
]
}
]
}
*/
As I can see JsonizeNode.Node
property is sets only if innerText is not empty or if EmptyTextNodeHandling == EmptyTextNodeHandling.Include
:
// Jsonize.GetChildren method
// ...
if (_emptyTextNodeHandling == EmptyTextNodeHandling.Include || !String.IsNullOrWhiteSpace(innerText))
{
if (!htmlNode.HasChildNodes)
{
childJsonizeNode.Text = innerText;
}
childJsonizeNode.Node = htmlNode.NodeType.ToString();
addToParent = true;
}
// ...
Is it bug or feature?
Currently the library ignores empty text nodes. Need to add support for whether to add them to output or not.
I want to set up configuration options as an object to be passed to the method so that parameters on the method aren't constantly changing. A good example is Newtonsoft.Json's JsonSerializer class.
I start working on tests and found some problem (or may be I misunderstanding
tag processing).My test:
Jsonize jsonize = new Jsonize("<html><head></head><body><form></form></body></html>");
var result = jsonize.ParseHtmlAsJsonString(jsonizeConfiguration);
Result JSON string:
{
"node": "Document",
"child": [
{
"node": "Element",
"tag": "html",
"child": [
{
"tag": "head"
},
{
"node": "Element",
"tag": "body",
"child": [
{
"tag": "form"
},
{
"node": "Text",
"text": "</form>"
}
]
}
]
}
]
}
Is it true that node "Text" with text "</form>" is not correct? And it should look so:
{
"node": "Document",
"child": [
{
"tag": "html",
"child": [
{
"tag": "head"
},
{
"tag": "body",
"child": [
{
"tag": "form"
}
]
}
]
}
]
}
I'm getting above error when running dotnet restore
and can't getting it to work (whether on Windows or on macOS). Any ideas?
dotnet --info on Windows
.NET Command Line Tools (1.0.0-preview2-003131)
Product Information:
Version: 1.0.0-preview2-003131
Commit SHA-1 hash: (hidden)
Runtime Environment:
OS Name: Windows
OS Version: 10.0.14393
OS Platform: Windows
RID: win10-x64
dotnet --info on macOS
.NET Command Line Tools (1.0.0-preview2-003131)
Product Information:
Version: 1.0.0-preview2-003131
Commit SHA-1 hash: (hidden)
Runtime Environment:
OS Name: Mac OS X
OS Version: 10.12
OS Platform: Darwin
RID: osx.10.12-x64
Currently the test package is calling the http endpoint once per test and this slow and wasteful. The tests need to be refactored to only call it once at the start of the test program.
Switched to xUnit in version 2.0.0 (new jsonize-2.0.0 branch) to reduce dependencies and make use of visual studio (& vscode) debugging and test runner features. Porting of tests is required. I know the tests are primitive, but some of them are useful. Discard any that require manual inspection (primarily the ones that download a string from a url).
All new tests should have a defined input and assertions that the resulting output matches a given predefined expected output.
Currently all exceptions are thrown to the calling program. Catch expected exceptions e.g. anything that calls remote data sources should attempt to catch any exceptions and resolve them at first chance. If required to re-throw: Wrap them in a new type of exception, with the thrown exception as an inner exception on the new type of exception.
Types of Exceptions to create:
Suggestions about implementation and exception types is welcome.
Return the root Node object as an output option. Add options to traverse the child Nodes within.
Investigate whether migration to the AngleSharp HTML parsing engine is feasible.
Points to consider:
Currently the tests are very basic and don't accurately cover the methods being tested (they just run the methods and the output is inspected manually). Proper test cases need to be set up. Currently we are using nUnitlite.
Add methods to load HTML from a source. See the Test project or README for ideas on how to do it.
Currently, the inner-text of html text nodes is trimmed to remove white-space as it can be excessive due to the file formatting if loading HTML from a URL source. This should be optional through a JsonizeConfiguration
setting. The default should be to trim the white-space.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.