GithubHelp home page GithubHelp logo

ashi009 / node-fast-html-parser Goto Github PK

View Code? Open in Web Editor NEW
164.0 5.0 135.0 20 KB

A very fast HTML parser, generating a simplified DOM, with basic element query support.

License: MIT License

JavaScript 100.00%

node-fast-html-parser's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

node-fast-html-parser's Issues

Problems with TS validator

TS2345: Argument of type '{ script: boolean; }' is not assignable to parameter of type '{ lowerCaseTagName?: boolean; noFix?: boolean; }'.
Object literal may only specify known properties, and 'script' does not exist in type '{ lowerCaseTagName?: boolean; noFix?: boolean; }'

The function is declared like
export function parse(data: string, options?: {
lowerCaseTagName?: boolean;
noFix?: boolean;
}) {

So the params of the func are not declared, and not pass TS validation.

How to read the text inside a tag?

How can i read the text inside a tag? For example

<div id="ERROR_MESSAGE" hidden="true">Please check the information</div>

I want to extract "Please check the information" from above tags, who has id "ERROR_MESSAGE".

Let me know how i can do that.

Error "data.substring is not a function"

node_modules\node-html-parser\dist\index.js:576
const text = data.substring(lastTextPos, kMarkupPattern.lastIndex - match[0].len
gth);
^
TypeError: data.substring is not a function
at Object.parse (D:\xampp\htdocs\nodejs\parse\node_modules\node-html-parser\dist\index.js:57
6:35)
at D:\xampp\htdocs\nodejs\parse\index.js:92:25
at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:511:3)

Could not get attribute

If the attribute value contains ", then the attribute will break. like this:

var root = parseHTML('<p a=12 data-id="!$$&amp;" yAz=\'1\' data-props="abc\"d"></p>');
root.firstChild.attributes.should.eql({
	'a': '12',
	'data-id': '!$$&',
	'yAz': '1',
	'data-props': 'abc"d'  // abc
});

How to convert back to HTML

Hi, great library, using it quickly parse HTML, is there a utility method that can convert modified tree back to HTML?

What react native version do you use?

I have tried from ver.43... to ver.44 but have the same issue with module "fast-html-parser": UnableToResolveError: Unable to resolve module util from /.../node_modules/apollojs/server.js

What versions in your 'package.json' in these lines:
"react": "^16.0.0-alpha.6",
"react-native": "^0.43.4"
?

Plans to move away from dynamic functions?

return pMatchFunctionCache[matcher] = new Function('el', source);

I'm not able to use this library in a serverless environment because of these dynamic functions. Any plans to refactor the code to not depend on these?

Expose .addClass/.removeClass for HTMLElement?

I have been trying to inject a class name into a parsed HTMLElement and when I call rootNode.toString(), the class does not appear. Printing the element that gets fed into .exchangeChild does infact show the class, though.

Am I doing something wrong, or is there perhaps a workaround?

const targetElement = node as HTMLElement;
const parentElement = targetElement.parentNode as HTMLElement;

const component = createComponent(targetElement.tagName);

const renderedComponentHtml = minify(component.render(targetElement.attributes, targetElement.innerHTML));
const renderedElement = parse(renderedComponentHtml) as HTMLElement;

if (options.injectClass) {
	const childElement = renderedElement.childNodes[0] as HTMLElement;
	childElement.classNames = [targetElement.tagName];

	console.log(renderedElement.childNodes[0]);
}

parentElement.exchangeChild(targetElement, renderedElement);

Output:

   HTMLElement {
      childNodes:
       [ HTMLElement {
           childNodes: [Array],
           tagName: 'table',
           rawAttrs: 'class="table"',
           parentNode: [Circular],
           classNames: [Array],
           nodeType: 1 },
         HTMLElement {
           childNodes: [Array],
           tagName: 'div',
           rawAttrs: '',
           parentNode: [Circular],
           classNames: [],
           nodeType: 1 },
         HTMLElement {
           childNodes: [Array],
           tagName: 'table',
           rawAttrs: 'class="table"',
           parentNode: [Circular],
           classNames: [Array],
           nodeType: 1 } ],
      tagName: 'body',
      rawAttrs: '',
      parentNode: null,
      classNames: [ 'aur-base' ],
      nodeType: 1 }

Refactor

This package hasn't been updated for years. Although the code still works, some bits should be refactored to embrace the new specs.

Some todo/rules for the refactor:

  • no new dependency (package management for node sucks, and I don't want to be part of it)
  • adopt TypeScript, and drop apollo.js (TypeScript provides everything apollo.js does and it's typed)
  • address some of the issues in this repo

Licence file is missing

Amazing work on the parser's performance.
Couldn't find licence file anywhere, I know its MIT, but if possible could you please add LICENCE file to the repo.
Thanks a lot

Issue when attribute contains html tag

When parsing the following HTML

<a href="http://example.com" title="<strong>Sample</strong> Text">
Sample Text
</a>

root.querySelector('a').innerHTML returns Sample Text"> Sample Text

How to select by a property?

Thanks for making this package. It does work easily using selectors but I don't understand what's a simple way to select a node by node name and attribute (jQuery style), e.g.

<meta property="og:title" content="Apple iPad mini 4 128Gb WiFi+4G Gold (MK782)">

I tried parsed.querySelectorAll('meta[property="og:title"]') but it retuens empty result.

Error in apollojs server.js

When I am importing var HTMLParser = require('fast-html-parser'); I got this error.

node: 8.9.1
npm: 5.8.0
webpack-dev-server: 2.7.1
webpack: 3.10.0
OS: debian

ERROR in ./node_modules/apollojs/server.js
Module not found: Error: Can't resolve './package' in '/home/vlad/workspace/riotjsbyfly/node_modules/apollojs'
 @ ./node_modules/apollojs/server.js 270:11-31
 @ ./node_modules/fast-html-parser/index.js
 @ ./src/pages/index/ri/app.tag
 @ ./src/pages/index/index.js
 @ multi (webpack)-dev-server/client?http://localhost:5555 ./src/pages/index/index.js
webpack: Failed to compile.

Did anybody see/meet this?

Typescript

VSC, nodejs project.

const root = parse(innerDivHtml)

const names = (root.querySelectorAll(".bui-commands"))

In tsconfig.json

"typeRoots": [
        "C:/Users/Mone/AppData/Roaming/npm/node_modules/@types",
        "C:/Users/Mone/AppData/Roaming/npm/node_modules/@types/node",
        "node_modules/node-html-parser/dist",
        "node_modules/node-html-parser/dist/umd"]

querySelectorAll is present in index.d.ts

querySelectorAll is marked red with "Property 'querySelectorAll' does not exist on type '(TextNode & { valid: boolean; }) | (HTMLElement & { valid: boolean; })'.
Property 'querySelectorAll' does not exist on type 'TextNode & { valid: boolean; }'.ts(2339)"

Add a note to the README that this is no longer maintained

Hi,

I saw on an issue here that this library is no longer maintained, it would be super helpful to add this to the README at the top so people know to search for alternatives and / or know they will need to fork the project if they need any changes.

Thanks for the awesome project,
Matt

Browser version

Is there a prebuilt min.js browser version of this library?

The method I'm currently using for getting the node tree is this:

var wrapper = document.createElement('div');
wrapper.innerHTML = html;
var children = wrapper.children

Will this library be faster?

Improve Documentation

What is needed?
There's nothing in the document that talks about accessing elements other than first or last element.
That is, there's no mention of the childNodes property.

What needs to be done?
Include the same in README.md

Issues with parsing rawText

I'm having an issue while loading external content. I have gotten everything to work fine when loading a single dom node, but when loading a list of nodes I'm having issues. I'm trying to create an object of text after scraping the html that has been loaded.

  • If I break the loop by just returning, it returns the rawText fine
  • If I do NOT use rawText and allow the loop to proceed it works (but I have HTMLElement, childNodes, TextNode, rawText and all the rest of the attribute stuff)
  • If I append ".rawText" to that previous selector within the loop that I know works, I get an error that null is not an object.

I am using this inside of a react native project, if that's any concern.

Header tags should be considered block elements

I would expect the outcome of the following html snippet when invoking structuredText to be content inside html but instead I'm getting content insidehtml

<p>content</p><span><u><h1>inside</h1><i>htm<u>l</u></i></u></span>

According to what is rendered as html it was supposed to have an extra space on the header element, correct?

https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements

<h1>, <h2>, <h3>, <h4>, <h5>, <h6>
Heading levels 1-6.

2021-06-23 at 10 23 01@2x

HTMLParser.parse("<p>content</p><span><u><h1>inside</h1><i>htm<u>l</u></i></u></span>").structuredText;

parse id

var root = parseHTML('<a id="id1" data-id="id2"></a>');

root.id will be id2, not id1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.