Comments (9)
Could you share a piece of code and what you want it to achieve?
from php-readability.
Well, it is Readability.php.
I think the function is clean().
Now if I have this content code in a HTML page:
<p>This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" /></p>
.
I want the function to return
This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" />
.
Instead, now it returns this:
This is my HTML code and this is a link and this is an image
.
from php-readability.
url and image shouldn't be removed.
Could you share your real example so I can try to reproduce what you said?
from php-readability.
For example, this URL: http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php
contains some h3
and some a
, the main goal is to keep them (as well as the formatting) and return a html document not return a plain text file.
from php-readability.
Using that example, I can't get content from that website 😕
<?php
require 'vendor/autoload.php';
use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;
$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);
$logger = new Logger('log');
$logger->pushHandler(new StreamHandler(fopen('php://stderr', 'a+')));
$readability = new Readability($html, $url);
$readability->setLogger($logger);
$result = $readability->init();
if ($result) {
var_export($readability->getContent()->ownerDocument->saveXML($readability->getContent()));
die();
} else {
echo "Looks like we couldn't find the content. :(\n";
}
Did you?
from php-readability.
I use this code, the one in the repository.
<?php
require_once '../Readability.php';
header('Content-Type: text/html; charset=utf-8');
$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);
if (function_exists('tidy_parse_string')) {
$tidy = tidy_parse_string($html, array(), 'UTF8');
$tidy->cleanRepair();
$html = $tidy->value;
}
$readability = new Readability($html, $url);
$readability->debug = true;
$readability->convertLinksToFootnotes = true;
$result = $readability->init();
if ($result) {
echo "== Title =====================================\n";
echo $readability->getTitle()->textContent, "\n\n";
echo "== Body ======================================\n";
$content = $readability->getContent()->textContent;
echo($content);
} else {
echo 'Looks like we couldn\'t find the content. :(';
}
?>
from php-readability.
Thanks!
This is what I asked since my first question.
And look like I still can get the content, your script like mine display "Looks like we couldn't find the content. :("
from php-readability.
I will try to find something.
The main idea is to keep the HTML of the content with the original html tags such as links, images, eventually paragraphs, bolds, etc etc
from php-readability.
Hello there, is this issue still relevant with current master?
from php-readability.
Related Issues (20)
- Missing namespace for HTML5_Parser HOT 4
- outlined words get censored on psychologytoday.com HOT 2
- Preserve newlines? HOT 2
- A lot of warnings (with russian language ?)
- More warnings HOT 4
- Error during cleanup HOT 3
- not working in cronjob HOT 8
- Error: A non well formed numeric value encountered HOT 11
- Consider ignoring empty node contents
- Composer printing warning with php-html5lib HOT 1
- Unable to attach logger for loadHTML HOT 2
- Unexpected title cleaning HOT 2
- Not able to get the full content HOT 3
- Can't install with Laravel 9 HOT 2
- Issue with one URL and his content HOT 2
- Readability 3.0 HOT 2
- Problem with article extraction HOT 4
- how can i get Excerpt, image and Author? HOT 4
- Call to undefined method DOMDocument::hasAttribute() HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from php-readability.