GithubHelp home page GithubHelp logo

purplepapa / myhtml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lexborisov/myhtml

0.0 2.0 0.0 16.28 MB

Fast C/C++ HTML 5 Parser. Using threads.

License: GNU Lesser General Public License v2.1

CMake 0.20% Makefile 0.27% C 65.50% C++ 0.48% Objective-C 33.54%

myhtml's Introduction

MyHTML — a pure C HTML parser

Build Status Donate Private Maven, RPM, DEB, PyPi and RubyGem Repository | packagecloud

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies.

This is one of module of the Modest project.

Now

The current version is 4.0.5. Last stable version

See Releases

Features

  • Asynchronous Parsing, Build Tree and Indexation
  • Fully conformant with the HTML5 specification
  • Two API - high and low-level
  • Manipulation of elements: add, change, delete and other
  • Manipulation of elements attributes: add, change, delete and other
  • Support 39 character encoding by specification encoding.spec.whatwg.org
  • Support detecting character encodings
  • Support Single Mode parsing
  • Support Build without POSIX Threads
  • Support for fragment parsing
  • Support for parsing by chunks
  • No outside dependencies
  • C99 support
  • Passes all tree construction tests from html5lib-tests
  • Tested by 1 billion HTML pages (by commoncrawl.org)

Changes

Please, see CHANGELOG.md file

Further developments

  • Modest — Modest is a fast HTML Render implemented as a pure C99 library with no outside dependencies
  • MyCSS — Fast C/C++ CSS Parser (Cascading Style Sheets Parser)

Support encodings for InputStream

X_USER_DEFINED, UTF_8, UTF_16LE, UTF_16BE, BIG5, EUC_KR, GB18030,
IBM866, ISO_8859_10, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, ISO_8859_2, ISO_8859_3,
ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, KOI8_R, KOI8_U, MACINTOSH,
WINDOWS_1250, WINDOWS_1251, WINDOWS_1252, WINDOWS_1253, WINDOWS_1254, WINDOWS_1255, WINDOWS_1256,
WINDOWS_1257, WINDOWS_1258, WINDOWS_874, X_MAC_CYRILLIC, ISO_2022_JP, GBK, SHIFT_JIS, EUC_JP, ISO_8859_8_I

Support encodings for output

Program working in UTF-8 and returns all in UTF-8

Detecting character encodings

Now it UTF-8, UTF-16LE, UTF16BE and russian windows-1251, koi8-r, iso-8859-5, x-mac-cyrillic, ibm866

Installation

See INSTALL.md

Introduction

Introduction

Benchmark

Dependencies

None

External Bindings and Wrappers

Examples

See examples directory

Simple example

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <myhtml/api.h>

int main(int argc, const char * argv[])
{
    char html[] = "<div><span>HTML</span></div>";
    
    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    
    // first tree init
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);
    
    // parse html
    myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));
    
    // print result
    // or see serialization function with callback: myhtml_serialization_tree_callback
    mycore_string_raw_t str = {0};
    myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str);
    printf("%s\n", str.data);
    
    // release resources
    mycore_string_raw_destroy(&str, false);
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);
    
    return 0;
}

AUTHOR

Alexander Borisov [email protected]

COPYRIGHT AND LICENSE

Copyright (C) 2015-2018 Alexander Borisov

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

See the LICENSE file.

myhtml's People

Contributors

lexborisov avatar azq2 avatar f2404 avatar wrfsh avatar kostya avatar vtorri avatar brano543 avatar emielbruijntjes avatar pascal-cuoq avatar drudru avatar madcapjake avatar overbryd avatar mody avatar mrsorcus avatar justanotheranonymoususer avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.