Hi, I have tried my best to adapt the given code in order to keep urls into the do

Using that example, I can't get content from that website 😕 <div class="highlight

I use this code, the one in the repository. <div class="highlight highlight-text-h

Keep URLS into documents about php-readability HOT 9 CLOSED

j0k3r commented on May 29, 2024

Keep URLS into documents

from php-readability.

Comments (9)

j0k3r commented on May 29, 2024

Could you share a piece of code and what you want it to achieve?

from php-readability.

boulama commented on May 29, 2024

Well, it is Readability.php.
I think the function is clean().
Now if I have this content code in a HTML page:
<p>This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" /></p>.

I want the function to return
This is my HTML code and <a href="https://www.mywebsite.com/">this is a link</a> and this is an image <img src="image.png" />.

Instead, now it returns this:
This is my HTML code and this is a link and this is an image.

from php-readability.

j0k3r commented on May 29, 2024

url and image shouldn't be removed.
Could you share your real example so I can try to reproduce what you said?

from php-readability.

boulama commented on May 29, 2024

For example, this URL: http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php contains some h3 and some a, the main goal is to keep them (as well as the formatting) and return a html document not return a plain text file.

from php-readability.

j0k3r commented on May 29, 2024

Using that example, I can't get content from that website 😕

<?php

require 'vendor/autoload.php';

use Readability\Readability;
use Monolog\Logger;
use Monolog\Handler\StreamHandler;

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

$logger = new Logger('log');
$logger->pushHandler(new StreamHandler(fopen('php://stderr', 'a+')));

$readability = new Readability($html, $url);
$readability->setLogger($logger);
$result = $readability->init();

if ($result) {
    var_export($readability->getContent()->ownerDocument->saveXML($readability->getContent()));
    die();
} else {
    echo "Looks like we couldn't find the content. :(\n";
}

Did you?

from php-readability.

boulama commented on May 29, 2024

I use this code, the one in the repository.

<?php
require_once '../Readability.php';
header('Content-Type: text/html; charset=utf-8');

$url = 'http://www.lepoint.fr/high-tech-internet/angry-birds-veut-faire-son-nid-a-la-bourse-d-helsinki-08-09-2017-2155514_47.php';
$html = file_get_contents($url);

if (function_exists('tidy_parse_string')) {
	$tidy = tidy_parse_string($html, array(), 'UTF8');
	$tidy->cleanRepair();
	$html = $tidy->value;
}


$readability = new Readability($html, $url);

$readability->debug = true;

$readability->convertLinksToFootnotes = true;

$result = $readability->init();

if ($result) {
	echo "== Title =====================================\n";
	echo $readability->getTitle()->textContent, "\n\n";
	echo "== Body ======================================\n";
	$content = $readability->getContent()->textContent;
	

	echo($content);
} else {
	echo 'Looks like we couldn\'t find the content. :(';
}
?>

from php-readability.

j0k3r commented on May 29, 2024

Thanks!
This is what I asked since my first question.
And look like I still can get the content, your script like mine display "Looks like we couldn't find the content. :("

from php-readability.

boulama commented on May 29, 2024

I will try to find something.
The main idea is to keep the HTML of the content with the original html tags such as links, images, eventually paragraphs, bolds, etc etc

from php-readability.

Kdecherf commented on May 29, 2024

Hello there, is this issue still relevant with current master?

from php-readability.

Keep URLS into documents about php-readability HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs