GithubHelp home page GithubHelp logo

gumbo-php's Introduction

Gumbo PHP

Gumbo PHP is low-level extension for HTML5 parsing.

Software License Build Status PHP 7 ready

Gumbo PHP builds DOMDocument using Gumbo HTML5 Parser. This solution solves all problems with HTML5 parsing or pages with inline JavaScript.

use Layershifter\Gumbo\Parser;

$document = Parser::load('<a>Apples and bananas.</a>');
var_dump($document->saveHTML());

string(33) "<a>Apples and bananas.</a>
"

Requirements

The following versions of PHP are supported.

  • PHP 5.6
  • PHP 7.0

Install

To build gumbo-php extenstion PHP-devel package is required. The package should contain phpize utility.

$ git clone https://github.com/layershifter/gumbo-php.git
$ cd gumbo-php
$ phpize
$ ./configure
$ make
$ make install

This will build a 'gumbo.so' shared extension, load it in php.ini using:

[gumbo]
extension = gumbo.so

Known issues

  • double encoding of entities (#6)
$doc = \Layershifter\Gumbo\Parser::load('<h1>Hello&nbsp;world!</h1>');
var_dump($doc->saveHTML());

string "<h1>Hello&amp;nbsp;world!</h1>"

Testing

$ composer install
$ composer test

Sponsors

SORGE
SORGE - website tracking tool

License

This library is released under the Apache 2.0 license. Please see License File for more information.

gumbo-php's People

Contributors

layershifter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

gumbo-php's Issues

Build on OSX

Hello. Running into an issue compiling on OSX. I can't seem to get past the ./configure step, any suggestions?

./configure
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for a sed that does not truncate output... /usr/bin/sed
checking for cc... cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether cc accepts -g... yes
checking for cc option to accept ISO C89... none needed
checking how to run the C preprocessor... cc -E
checking for icc... no
checking for suncc... no
checking whether cc understands -c and -o together... yes
checking for system library directory... lib
checking if compiler supports -R... no
checking if compiler supports -Wl,-rpath,... yes
checking build system type... x86_64-apple-darwin15.5.0
checking host system type... x86_64-apple-darwin15.5.0
checking target system type... x86_64-apple-darwin15.5.0
checking for PHP prefix... /usr/local/Cellar/php70/7.0.7
checking for PHP includes... -I/usr/local/Cellar/php70/7.0.7/include/php -I/usr/local/Cellar/php70/7.0.7/include/php/main -I/usr/local/Cellar/php70/7.0.7/include/php/TSRM -I/usr/local/Cellar/php70/7.0.7/include/php/Zend -I/usr/local/Cellar/php70/7.0.7/include/php/ext -I/usr/local/Cellar/php70/7.0.7/include/php/ext/date/lib
checking for PHP extension directory... /usr/local/Cellar/php70/7.0.7/lib/php/extensions/no-debug-non-zts-20151012
checking for PHP installed headers prefix... /usr/local/Cellar/php70/7.0.7/include/php
checking if debug is enabled... no
checking if zts is enabled... no
checking for re2c... no
configure: WARNING: You will need re2c 0.13.4 or later if you want to regenerate PHP parsers.
checking for gawk... no
checking for nawk... no
checking for awk... awk
checking if awk is broken... no
checking for gumbo support... yes, shared
checking for gumbo files in default path... checking for gumbo_destroy_output in -lgumbo... no
configure: error: wrong version of gumbo of it's not found

Double encoding of entities.

When an string with HTML entities like &nbsp; is loaded, the returned document has these ampersands encoded again to &amp;nbsp; breaking the original HTML.

$payload = <<< 'HTML'
<html lang="en">
  <head>
    <meta charset="utf-8">
    <title>
      Hello&nbsp;world!
    </title>
  </head>
  <body>
    <h1>
      Hello&nbsp;world!
    </h1>
  </body>
</html>
HTML;
$doc = \Layershifter\Gumbo\Parser::load($payload);
var_dump($doc->saveHTML());

Output:

string(162) "<html lang="en"><head><meta charset="utf-8"><title>                                                                                                                                                                                   
      Hello&amp;nbsp;world!                                                                                                                                                                                                                        
    </title></head><body><h1>                                                                                                                                                                                                                      
      Hello&amp;nbsp;world!                                                                                                                                                                                                                        
    </h1></body></html>                                                                                                                                                                                                                            
" 

Whitespace ignored in pre element

First of all, thanks a lot for Gumbo PHP! It's very useful!

I noticed what looks like a bug in the handling of the pre element where whitespace is important and should be preserved. Here's code to reproduce:

<?php
$whitespace_test =
'<html><body><pre>
<span>Line 1</span>
    <span>Line 2</span>
<span>Line 3</span>
</pre>
<pre>
Line 1
    Line 2
Line 3
</pre></body></html>';

// Gumbo PHP
$doc = \Layershifter\Gumbo\Parser::load($whitespace_test);
echo $doc->saveHTML();

// Regular DOMDocument
$doc = new DOMDocument();
$doc->loadHTML($whitespace_test, LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Gumbo PHP outputs:

<html><body><pre><span>Line 1</span><span>Line 2</span><span>Line 3</span></pre><pre>Line 1
    Line 2
Line 3
</pre></body></html>

Regular DOMDocument outputs:

<html><body><pre>
<span>Line 1</span>
    <span>Line 2</span>
<span>Line 3</span>
</pre>
<pre>
Line 1
    Line 2
Line 3
</pre></body></html>

I noticed there's a section related to whitespace commented out in parser.c, could that be what's causing this? And if so, is it safe to uncomment it?

Any help appreciated. Thanks!

Unable to load gumbo.so on CentOS

Hi.
Loading gumbo.so failed on CentOS(php 5.6.30).

A PHP Error was encountered
Severity: Core Warning
Message: PHP Startup: Unable to load dynamic library '/usr/lib64/php/modules/gumbo.so' - /usr/lib64/php/modules/gumbo.so: undefined symbol: php_dom_create_object

I think that the build is successful.
Do you have any suggestions?

$ ./configure
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for a sed that does not truncate output... /bin/sed
checking for cc... cc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether cc accepts -g... yes
checking for cc option to accept ISO C89... none needed
checking how to run the C preprocessor... cc -E
checking for icc... no
checking for suncc... no
checking whether cc understands -c and -o together... yes
checking for system library directory... lib
checking if compiler supports -R... no
checking if compiler supports -Wl,-rpath,... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking target system type... x86_64-unknown-linux-gnu
checking for PHP prefix... /usr
checking for PHP includes... -I/usr/include/php -I/usr/include/php/main -I/usr/include/php/TSRM -I/usr/include/php/Zend -I/usr/include/php/ext -I/usr/include/php/ext/date/lib
checking for PHP extension directory... /usr/lib64/php/modules
checking for PHP installed headers prefix... /usr/include/php
checking if debug is enabled... no
checking if zts is enabled... no
checking for re2c... re2c
checking for re2c version... 0.13.5 (ok)
checking for gawk... gawk
checking for gumbo support... yes, shared
checking for gumbo files in default path... found in /usr
checking for gumbo_destroy_output in -lgumbo... yes
checking for xml2-config path... /usr/bin/xml2-config
checking whether libxml build works... yes
checking for a sed that does not truncate output... (cached) /bin/sed
checking for fgrep... /bin/grep -F
checking for ld used by cc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 1966080
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from cc object... ok
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if cc supports -fno-rtti -fno-exceptions... no
checking for cc option to produce PIC... -fPIC -DPIC
checking if cc PIC flag -fPIC -DPIC works... yes
checking if cc static flag -static works... no
checking if cc supports -c -o file.o... yes
checking if cc supports -c -o file.o... (cached) yes
checking whether the cc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... no
configure: creating ./config.status
config.status: creating config.h
config.status: executing libtool commands

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.