GithubHelp home page GithubHelp logo

vanderlee / phpsyllable Goto Github PK

View Code? Open in Web Editor NEW
116.0 8.0 80.0 2.49 MB

PHP Syllable splitter/counter and Hyphenator for text and HTML. Multi-language, customisable, cached and fast!

Home Page: http://vanderlee.github.io/phpSyllable/

PHP 3.63% TeX 96.21% HTML 0.15%
hyphens tex syllables language php split hyphen-marker hyphenation hyphenation-algorithm hyphenation-rules

phpsyllable's Introduction

Syllable

Version 1.7

Tests

Copyright © 2011-2023 Martijn van der Lee. MIT Open Source license applies.

Introduction

PHP Syllable splitting and hyphenation. or rather... PHP Syl-la-ble split-ting and hy-phen-ation.

Based on the work by Frank M. Liang (http://www.tug.org/docs/liang/) and the many volunteers in the TeX community.

Many languages supported. i.e. english (us/uk), spanish, german, french, dutch, italian, romanian, russian, etc. 76 languages in total.

Language sources: http://tug.org/tex-hyphen/#languages

Supports PHP 5.6 and up, so you can use it on older servers.

Installation

Install phpSyllable via Composer

composer require vanderlee/syllable

or simply add phpSyllable to your project and set up the project's autoloader for phpSyllable's src/ directory.

Usage

Instantiate a Syllable object and start hyphenation.

Minimal example:

$syllable = new \Vanderlee\Syllable\Syllable('en-us');
echo $syllable->hyphenateText('Provide a plethora of paragraphs');

Extended example:

use Vanderlee\Syllable\Syllable;
use Vanderlee\Syllable\Hyphen;

// Globally set the directory where Syllable can store cache files.
// By default, this is the cache/ folder in this package, but usually
// you want to have the folder outside the package. Note that the cache
// folder must be created beforehand.
Syllable::setCacheDir(__DIR__ . '/cache');

// Globally set the directory where the .tex files are stored.
// By default, this is the languages/ folder of this package and
// usually does not need to be adapted.
Syllable::setLanguageDir(__DIR__ . '/languages');

// Create a new instance for the language.
$syllable = new Syllable('en-us');

// Set the style of the hyphen. In this case it is the "-" character.
// By default, it is the soft hyphen "­".
$syllable->setHyphen(new Hyphen\Dash());

// Set the minimum word length required for hyphenation.
// By default, all words are hyphenated.
$syllable->setMinWordLength(5);

// Output hyphenated text ..
echo $syllable->hyphenateText('Provide your own paragraphs...');
// .. or hyphenated HTML.
echo $syllable->hyphenateHtmlText('<b>... with highlighted text.</b>');

See the demo.php file for a working example.

Syllable API reference

The following describes the API of the main Syllable class. In most cases, you will not use any other functions. Browse the code under src/ for all available functions.

public __construct($language = 'en-us', string|Hyphen $hyphen = null)

Create a new Syllable class, with defaults.

public static setCacheDir(string $dir)

Set the directory where compiled language files may be stored. Default to the cache subdirectory of the current directory.

public static setEncoding(string|null $encoding = null)

Set the character encoding to use. Specify null encoding to not apply any encoding at all.

public static setLanguageDir(string $dir)

Set the directory where language source files can be found. Default to the languages subdirectory of the current directory.

public setLanguage(string $language)

Set the language whose rules will be used for hyphenation.

public setHyphen(mixed $hyphen)

Set the hyphen text or object to use as a hyphen marker.

public getHyphen(): Hyphen

Get the current hyphen object.

public setCache(Cache $cache = null)

public getCache(): Cache

public setSource($source)

public getSource(): Source

public setMinWordLength(int $length = 0)

Words need to contain at least this many character to be hyphenated.

public getMinWordLength(): int

public setLibxmlOptions(int $libxmlOptions)

Options to use for HTML parsing by libxml. See: https://www.php.net/manual/de/libxml.constants.php.

public excludeAll()

Exclude all elements.

public excludeElement(string|string[] $elements)

Add one or more elements to exclude from HTML.

public excludeAttribute(string|string[] $attributes, $value = null)

Add one or more elements with attributes to exclude from HTML.

public excludeXpath(string|string[] $queries)

Add one or more xpath queries to exclude from HTML.

public includeElement(string|string[] $elements)

Add one or more elements to include from HTML.

public includeAttribute(string|string[] $attributes, $value = null)

Add one or more elements with attributes to include from HTML.

public includeXpath(string|string[] $queries)

Add one or more xpath queries to include from HTML.

public splitWord(string $word): array

Split a single word on where the hyphenation would go. Punctuation is not supported, only simple words. For parsing whole sentences please use Syllable::splitWords() or Syllable::splitText().

public splitWords(string $text): array

Split a text into an array of punctuation marks and words, splitting each word on where the hyphenation would go.

public splitText(string $text): array

Split a text on where the hyphenation would go.

public hyphenateWord(string $word): string

Hyphenate a single word.

public hyphenateText(string $text): string

Hyphenate all words in the plain text.

public hyphenateHtml(string $html): string

Hyphenate all readable text in the HTML, excluding HTML tags and attributes. Deprecated: Use the UTF-8 capable hyphenateHtmlText() instead. This method is kept only for backward compatibility and will be removed in the next major version 2.0.

public hyphenateHtmlText(string $html): string

Hyphenate all readable text in the HTML, excluding HTML tags and attributes. This method is UTF-8 capable and should be preferred over hyphenateHtml().

public histogramText(string $text): array

Count the number of syllables in the text and return a map with syllable count as key and number of words for that syllable count as the value.

public countWordsText(string $text): int

Count the number of words in the text.

public countSyllablesText(string $text): int

Count the number of syllables in the text.

public countPolysyllablesText(string $text): int

Count the number of polysyllables in the text.

Development

Update language files

Run

composer dump-autoload --dev
./build/update-language-files

to fetch the latest language files remotely and optionally use environment variables to customize the update process:

CONFIGURATION_FILE

Specify the absolute path of the configuration file where the language files to be downloaded are defined. The configuration file has the following format:

{
	"files": [
		{
			"_comment": "<comment>",
			"fromUrl": "<absolute-remote-file-url>",
			"toPath": "<relative-local-file-path>",
			"disabled": <true|false>
		}
	]
}

where the attributes are self-explanatory and _comment and disabled are optional. See for example build/update-language-files.json. Default: The build/update-language-files.json file of this package.

MAX_REDIRECTS

Specify the maximum number of URL redirects allowed when retrieving a language file. Default: 1.

WITH_COMMIT

Create (1) or skip (0) a Git commit from the updated language files. Default: 0.

LOG_LEVEL

Set the verbosity of the script to verbose (6), warnings and errors (4), errors only (3) or silent (0). Default: 6.

For example use

composer dump-autoload --dev
LOG_LEVEL=0 ./build/update-language-files

to silently run the script without outputting any logging.

Update API documentation

Run

composer dump-autoload --dev
./build/generate-docs

to update the API documentation in this README.md. This should be done when the Syllable class has been modified. Optionally, you can use environment variables to modify the documentation update process:

WITH_COMMIT

Create (1) or skip (0) a Git commit from the adapted files. Default: 0.

LOG_LEVEL

Set the verbosity of the script to verbose (6), warnings and errors (4), errors only (3) or silent (0). Default: 6.

Create release

Run

composer dump-autoload --dev
./build/create-release

to create a local release of the project by adding a changelog to this README.md. Optionally, you can use environment variables to modify the release process:

RELEASE_TYPE

Set the release type to major (0), minor (1) or patch (2) release. Default: 2.

WITH_COMMIT

Create (1) or skip (0) a Git commit from the adapted files and apply the release tag. Default: 0.

LOG_LEVEL

Set the verbosity of the script to verbose (6), warnings and errors (4), errors only (3) or silent (0). Default: 6.

Tests

Run

composer install
./vendor/bin/phpunit

to execute the tests.

Changes

1.7

  • Use \hyphenations case-insensitive (like \patterns)
  • Correct handling of UTF-8 character sets when hyphenating HTML using the new Syllable::hyphenateHtmlText()
  • Replace invalid "en" with "en-us" as default language of Syllable
  • Update of hyph-de.tex

1.6

  • Revert renaming of API method names
  • Use cache version as string instead of number
  • Cover caching with tests
  • Reduce the PHP test matrix to the latest versions of PHP 5, 7 and 8
  • Check via GitHub Action if the API documentation is up-to-date
  • Update API reference
  • Fix API documentation of an array as parameter default value
  • Satisfy StyleCI
  • Commit changed files of entire working tree in build context
  • Support for generation of API documentation in README.md
  • Add words with reduced hyphenation to collection from PR #26
  • Satisfy StyleCI
  • Add test for collection of words with reduced hyphenation
  • Refactor splitWord(), splitWords() and splitText() of Syllable class
  • Remove @covers annotation in tests
  • Added splitWords and various code quality improvements
  • Update the README.md copyright claim on release
  • Skip GitHub Action scheduler in forks and run tests only in PR context
  • Allow GitHub Action "Update languages" workflow to bypass reviews
  • Use German orthography from 2006 as standard orthography

1.5.5

  • Automatic update of 74 languages

1.5.4

  • Automatically run tests for every push and pull request
  • Automatic monthly update and release of language files
  • Fix small typo in README and add 'use' in example.
  • Use same code format as in src/Source/File.php
  • Fix opening brace
  • Remove whitespace
  • Fix closing brace
  • Use PHP syntax highlighting

1.5.3

  • Fixed PHP 7.4 compatibility (#37) by @Dargmuesli.

1.5.2

  • Fixed bug reverted in refactoring (continue 3) by @Dargmuesli.

1.5.1

  • Fixed bug reverted in refactoring (continue 2).

1.5

  • Refactored for modern PHP and support for current PHP version.

1.4.6

  • Added setMinWordLength($length) and getMinWordLength() to limit hyphenation to words with at least the specified number of characters.

1.4.5

  • Fixes for composer.

1.4.4

  • Composer autoloader added

1.4.3

  • Improved documentation

1.4.2

  • Updated spanish language files.
  • Initial PHPDoc.

1.4.1

  • More fixes for apostrophes in splitting.

1.4

  • Fix for French language handling
  • Refactor .text loading into source class.
  • Massive cache performance increase (excessive writes).

1.3.1

  • Fix slow initial cache writing; too many writes (only one was needed).
  • Removed min_hyphenation; mb_strlen takes more time than hashmap lookup.

1.3

  • Added array histogramText($text), integer countWordsText($text) and integer countPolysyllableText($text) methods.
  • Refactored cache interface.
  • Improved unittests.

1.2

  • Deprecated treshold feature. Was based on misinterpretation of the algorithm. Methods, constants and constructor signature unchanged, although you can now omit the treshold if you want (or leave it in, it's detected as a "fake" treshold).

phpsyllable's People

Contributors

alexander-nitsche avatar blackskyliner avatar curtisgibby avatar dargmuesli avatar jarilehtinen avatar salagir avatar stylecibot avatar telixj avatar vanderlee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

phpsyllable's Issues

Split sentence into array of arrays of syllables of each word

Hi Van der Lee,

Is it possible for any of your functions to return an array of arrays of syllables of each word?

Example: "I am working".
Output:

array(

    0 => array(

        0 => "I"

    ), 

    1 => array(

        0 => "am"

    ), 

    2 => array(

        0 => "work", 

        1 => "ing"

    )

);

Problems with hiatuses, words beginning with a vowel and more in Spanish

I know that Tex files have the information to split the words, but I don't know how fix it and maybe you can do something with them

Simple Hiatus

caoba -> cao-ba instead of ca-o-ba
saeta -> sae-ta instead of sa-e-ta
chiita -> chií-ta instead of chi-i-ta

Acentual hiatus

saúco -> saú-co instead of sa-ú-co.
sabía -> sa-bía instead of sa-bí-a.

Sporadic hiatus

píe -> píe instead of pí-e
río -> río instead of rí-o

  • The other problem is with words beginning with a vowel, don't split it from the other part

azul -> azul instead of a-zul
etéreo -> eté-reo instead of e-té-re-o
iluminado -> ilu-mi-na-do instead of i-lu-mi-na-do
otero -> ote-ro instead of o-te-ro
uniforme -> uni-for-me instead of u-ni-for-me

  • And fínally more rare examples (I have many more)

ábaco -> áb-a-co instead of á-ba-co
emanciparse -> em-an-ci-par-se instead of e-man-ci-par-se
separar -> se-p-a-rar instead of se-pa-rar

Post-processing of first run of language update and tests workflow

The first scheduled run of the new GitHub Action workflows today require some additional fixes:

  1. Reduce the events that trigger GitHub Action runs, e.g. when pushing into a PR, the tests run twice.
  2. Bypass manual review requests on automatic PR merge during language update.
  3. Update year of copyright claim in README.md automatically on language update release.

phpSyllable does not work with Russian language

  1. [:alpha:] not working with cyrillic symbols. You must to use [a-zA-Zа-яА-ЯёЁ]
  2. $subword .= $text[$index];
    output: ��� symbols
  3. utf8_encode output:
    неÑ�Ñ�Ñ�иÑ�Ñ�ваÑ�аколиÑ�еÑ

Min word count after hyphenation

Hello,

at first I'd like to thank you for the great module you've built!

Only one thing bothers me: Is there an opportunity to set a minimum word count after hyphenation in order to avoid hyphenations like that:

This is a very good do-
cument.

Better would be to have:

This is a very good docu-
ment.

So the minWordCountAfterHyphenation would be 4 (or 3) in this case.

Is this already possible or would it be an option for the future?

Thanks in advance!

Bye Defcon0

Seems to be totally buggy in French

Totally buggy in French

$syllable = new Syllable('fr');

//etc...
$syllable->splitWord('constitution'); returns Array ( [0] => constitution )
$syllable->splitWord('alphabet'); returns Array ( [0] => alphabet ) instead of al-pha-bet
$syllable->splitWord('formation'); Array ( [0] => for [1] => mation ) instead of for-ma-tion

Now, what I get when I set to Finnish. new Syllable('fi');
Array ( [0] => cons [1] => ti [2] => tu [3] => tion )
Array ( [0] => alp [1] => ha [2] => bet )
Array ( [0] => for [1] => ma [2] => tion )

How can I fix that ?

Need absolute path to cache language files

I believe this to be an issue, or at least something that should be looked into.

phpSyllable-master/src/Cache/File.php on line 53:

file_put_contents($file, $this->encode(self::$data));

file_put_contents requires an absolute path. Instead, what it gets is a relative path.

Warning: file_put_contents(/home/customer/www/mydomain/apps/phpSyllable-master/src/cache/syllable.en-us.json): failed to open stream: No such file or directory in /home/customer/www/mydomain/apps/phpSyllable-master/src/Cache/File.php on line 53.

I was able to workaround it by defining an absolute path. File.php.

private function filename()
{
    $dir = '/home/customer/www/mydomain/apps/phpSyllable-master/src/Cache';
    return $dir.'/'.$this->getFilename(self::$language);
}

Note: I'm not sure why 'syllable.en-us.json' must be written to. For developing purposes? As far as I know, the file never changes, except the first time when it is generated.

Also "file_put_contents" works fine with a relative path on Windows WAMPServer. Only in a shared server virtual hosting environment I get an error/warning. If you cannot reproduce the issue, feel free to query.

lowercase vs uppercase hyphenation word list

Using the word list \hyphenation{...} in language files works only with words in upper or lower case, but not generalized (e.g. gegenstand does not match with Gegenstand). Maybe the script could just basically match in lowercase the hyphenation{..} word list?

Example: add some custom words in a language-file in a section
\hyphenation{
German
}
to define that "German" should not be splitted.

So only "German" is matched and not split, but "german" is split by regular rules of the language file. Thus, redundant rules for "german" and "German" have to be inserted in language files in the end.

Cleanup

Hi @vanderlee ,

these branches might be ready for deletion as they are already merged or outdated:

and these issues might be ready for closing as they are addressed in the current master branch:

and this pull request might be ready for closing as it is also addressed in the current master branch:

Greetings
Alex

White-/Blacklist certain HTML Elements in hyphenateHtml()

Is it possible to add some kind fo white- and/or blacklist for certain HTML Elements while processing the HTML to hyphenate?

Background: I try yo use this on whole content areas on dynamically generated pages. Those pages may contain <script> Elements and the Javascript will then be destroyed by the hyphening proccess in the current non white-/blacklist behaviour.

I could also write an PR for this feature I guess it should be simple as we would only need to modify the recursive HTML Dom Walker.

Create cache directory if it does not exist

When the configured cache directory does not exist PHP will throw an error. I guess because file_put_contents relies on the file path to exist.

Sure it's not a big deal to create the directory manually. I assume most folks don't want to track the cache files in version control though so they are likely to list the cache directory in .gitignore. Since you can't really track the directory itself while excluding all files within, you have to create the cache directory every time you pull a fresh copy of the project.

Although there are workarounds I think it's slightly more convenient to create the directory when it does not exist. Unless of course there is a specific disadvantage I can't think of right now.

Brazilian Portuguese updates

Hello Vanderlee!
First of all.. Congrats for your project, they save alot of time!
Im including phpSyllable into my personal project to count poetic sylables and check certain types of poems.

I identified the same throubles that the another guy found with Spanish language, I already update tex file on the folder from http://tug.org/tex-hyphen/#languages.

But the situation is the same... any ideas to solve this?

Best regards and congrats again.

Error when trying to use the functions

Hi Van der Lee,

I really appreciate your project, well done!
There's just one problem: whenever I try to use one of the functions of the class Syllable, it gives me this error:
Warning: file_put_contents(/Users/victor/Sites/pyglatin/phpSyllable-master/classes/../cache/syllable.en-us.json): failed to open stream: No such file or directory in /Users/victor/Sites/pyglatin/phpSyllable-master/classes/Syllable_Cache_FileAbstract.php on line 43

hyphenateHtml messes up certain symbols

Hello, when I use this method on some html text, it break certain characters, but when I use hyphenateText it doesn't break these characters (though it obviously breaks html).
Here is an example text:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Here is how it comes out after I use this method on them:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Notice how these long dashes turned into — . (It hyphenates it fine, I just removed it to make it easier to see the problem)
This problem is caused by it only being partial html and loadHTML being unable to tell which encoding it is.
Possible solution would be something like $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8")); //
Or this dirty, dirty hack $dom->loadHTML('<?xml encoding="UTF-8">' . $html) .

P.S. <script> tag should probably be excluded from hyphenation by default.

Showing the stress on syllables

First I would like to thank you for this awesome library,
Second do you think it is possible to also specify the stressed syllable? for many purposes having the stressed syllable is also important and can help a lot,

Thanks a lot

Uppercase words

It seems that phpSyllable can hyphenate Finnish word "kolmivaihekilowattituntimittari" very well, but fails (returns the whole word) if written in uppercase letters ("KOLMIVAIHEKILOWATTITUNTIMITTARI").

I'm using:

$syllable->setTreshold(Syllable::TRESHOLD_MOST);

Ideas?

Replace test execution by Travis CI with GitHub Action

Travis CI has more or less limited its service to paying customers and withdrawn from activities in open source communities.

The current Travis configuration let's GitHub checks fail with

continuous-integration/travis-ci Expected — Waiting for status to be reported

Replace / remove outdated German language file hyph-de.tex

The

  • hyph-en.tex

is 12 years old, unlike the more specific language files

  • hyph-en-1901.tex
  • hyph-de-1996.tex
  • hyph-de-ch-1901.tex

which are 8 years old.

Also, hyph-de.tex is no longer available on https://tug.org/tex-hyphen/#languages or on the corresponding CTAN mirrors. Last but not least, the author of the hyphen patterns, Werner Lemberg, has confirmed that they once replaced the first general one with the more specific ones.

The question is technically whether the hyph-de.tex should be replaced by one of the others, since many projects might rely on it, or whether it should be removed completely?

Request for feedback; deprecated splitWord

Looking into the data splitWord method in Syllable appears to be somewhat useless; it produces incorrect results at times.

When provided with the same single-word input with random punctuation, splitText produces the results expected for splitWord.

Propose to mark splitWord as deprecated and route to splitText internally.

Similarly, the proposed splitWords would be better based on splitText.

Allow auto merge for this repository

Hi Martijn,

for the automatic update of languages to work, this repository must be allowed to automatically merge PRs. It is explained here and can only be allowed by the administrator. Currently the "Update languages" workflow fails with "Pull request - Auto merge is not allowed for this repository".

Greetings
Alex

Reworking API internals

I would like to discuss and track changes to the project in this ticket regarding an update to the internals and adding the possibility for more interchangeable Components like PSR-6 Caching without breaking legacy compatibility as mentioned in #18.

I extracted the public facing API of the Syllable class into an interface which one of the new components would have to adhere to so we keep compatibility.

<?php

namespace Vanderlee\PhpSyllable;

/**
 * This interface describes the Syllable class prior X.Y.Z
 * 
 * To keep compatibility with already existing implementations into applications
 * this interface is a necessary evil. It will make sure that the public facing
 * class will at least adhere to the promise of pre X.Y.Z versions regarding the
 * public API.
 * 
 * The internals will be reworked to a whole new base w/o breaking old usages.
 * PHP 5.6 Compatibility should be kept until EOL or have a version_compare switch.
 */
interface LegacyInterface{
    // Global Configuration statics
    public static function setCacheDir($dir);
    public static function setEncoding($encoding = null);
    public static function setLanguageDir($dir);

    // Configurable Language
    public function setLanguage($language);

    // Hypen-to-use Configuration
    public function setHyphen(Syllable_Hypen_Interface $hyphen);
    public function getHyphen();

    // Cache Configuration
    public function setCache(Syllable_Cache_Interface $Cache = null);
    public function getCache();

    // Source configuration
    public function setSource(Syllable_Source_Interface $Source);
    public function getSource();

    // Hypening Configuration
    public function setMinWordLength($length = 0);
    public function getMinWordLength();

    // HTML Interface
    public function excludeAll();
    public function excludeElement($elements);
    public function excludeAttribute($attributes, $value = null);
    public function excludeXpath($queries);
    public function includeElement($elements);
    public function includeAttribute($attributes, $value = null);
    public function includeXpath($queries);
    public function hyphenateHtml($html);

    // Text interface
    public function splitWord($word);
    public function splitText($text);
    public function hyphenateWord($word);
    public function hyphenateText($text);

    // Stats
    public function histogramText($text);
    public function countWordsText($text);
    public function countSyllablesText($text);
    public function countPolysyllablesText($text);

    // Already deprecated!
    public function setTreshold($treshold);
    public function getTreshold();
}

Each commented block of functions should get it's own implementation handler classes. Like the hyphening algorithm and the html processing etc. So we clean all the logic from that "master class" we currently have.

To tackle the problem with the class namespacing for legacy projects I would suggest modifying the project autoloader. There we could, a bit like the Twig project does in their class files, register class_alias to the old Syllable class names. We also need to add at least PSR-0 or PSR-4 compatible autoloading for those not using composer.

/**
* Classloader for the library
* @param string $class
*/
function Syllable_autoloader($class) {
	// The new classes will reside in PROJECT_ROOT/src/
	// Whereas the \\Vanderlee\\PhpSyllable namespace is the root namespace of src/
	// So a \\Vanderlee\\PhpSyllable\\Hyphen\\Dash would be in src/Hypen/Dash.php

	$classWithoutRootNamespace = str_replace('Vanderlee\\PhpSyllable\\', '', $class);
	$classFile = __DIR__ 
		. DIRECTORY_SEPARATOR . '..' 
		. DIRECTORY_SEPARATOR . 'src' 
		. DIRECTORY_SEPARATOR . str_replace('\\', DIRECTORY_SEPARATOR, $classWithoutRootNamespace).'.php';

	if (file_exists($classFile)) {
		require $classFile;

		return true; // This will help class_exists to work properly
	}
	
	return false; // This will help class_exists to work properly
}

spl_autoload_register('Syllable_autoloader');

// Bind old Class Names to be backwards compatible
// All files in /classes will be deleted and bound to their new equivalent
// which will then reside within /src
class_alias('\\Vanderlee\\PhpSyllable\\Syllable', '\\Syllable');

I would say this whole rework should have multiple stages.

  1. Maybe write some more tests so we get a better coverage and have a more solid refactoring base to test against.
  2. Rewrite the class structure to the new namespacing and greenify all tests w/o adjusting them
  3. Deprecate the usage of the old "master class" interface and document the proper but more configuration-verbose wiring of all the separate classes into a less public interfaced service class. (like it should really only have a hyphenate($text, $language) or so the rest should be initialization work, where as languages would have to be registered to be known in the proccess)

What are your thoughts on this? Did I miss something important? Or would you tackle the whole problem in another way?

Results differ from syllable.toyls.com

Hi, thanks for this library.

I assume that the website https://syllable.toyls.com/ uses the same library underneath? If so I'm not sure why I am getting different (incorrect) results on my end whereas correct results on the website.

I am using the following string:

Muchos años después, frente al pelotón de fusilamiento, el coronel Aureliano Buendía había de recordar aquella tarde remota en que su padre lo llevó a conocer el hielo. Macondo era entonces una aldea de veinte casas de barro y cañabrava construidas a la orilla de un río de aguas diáfanas que se precipitaban por un lecho de piedras pulidas, blancas y enormes como huevos prehistóricos. El mundo era tan reciente, que muchas cosas carecían de nombre, y para mencionarlas había que señalarlas con el dedo. Todos los años, por el mes de marzo, una familia de gitanos desarrapados plantaba su carpa cerca de la aldea, y con un grande alboroto de pitos y timbales daban a conocer los nuevos inventos. Primero llevaron el imán. Un gitano corpulento, de barba montaraz y manos de gorrión, que se presentó con el nombre de Melquíades, hizo una truculenta demostración pública de lo que él mismo llamaba la octava maravilla de los sabios alquimistas de Macedonia.

In code I am getting this result:

Mu-chos años de-s-pués, fre-n-te al pe-lo-tón de fu-si-la-mie-n-to, el co-ro-nel Au-re-liano Bue-n-día ha-bía de re-co-r-dar aque-lla ta-r-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-co-n-do era en-to-n-ces una al-dea de vei-n-te ca-sas de ba-rro y ca-ña-bra-va co-n-s-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, bla-n-cas y eno-r-mes co-mo hue-vos pre-hi-s-tó-ri-cos. El mu-n-do era tan re-cie-n-te, que mu-chas co-sas ca-re-cían de no-m-bre, y pa-ra me-n-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de ma-r-zo, una fa-mi-lia de gi-ta-nos des-arra-pa-dos pla-n-ta-ba su ca-r-pa ce-r-ca de la al-dea, y con un gra-n-de al-bo-ro-to de pi-tos y ti-m-ba-les da-ban a co-no-cer los nue-vos in-ve-n-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano co-r-pu-le-n-to, de ba-r-ba mo-n-ta-raz y ma-nos de go-rrión, que se pre-se-n-tó con el no-m-bre de Me-l-quía-des, hi-zo una tru-cu-le-n-ta de-mo-s-tra-ción pú-bli-ca de lo que él mi-s-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mi-s-tas de Ma-ce-do-nia.

Whereas on the website I am getting:

Mu-chos años des-pués, fren-te al pe-lo-tón de fu-si-la-mien-to, el co-ro-nel Au-re-liano Buen-día ha-bía de re-cor-dar aque-lla tar-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-con-do era en-ton-ces una al-dea de vein-te ca-sas de ba-rro y ca-ña-bra-va cons-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, blan-cas y enor-mes co-mo hue-vos prehis-tó-ri-cos. El mun-do era tan re-cien-te, que mu-chas co-sas ca-re-cían de nom-bre, y pa-ra men-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de mar-zo, una fa-mi-lia de gi-ta-nos desarra-pa-dos plan-ta-ba su car-pa cer-ca de la al-dea, y con un gran-de al-bo-ro-to de pi-tos y tim-ba-les da-ban a co-no-cer los nue-vos in-ven-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano cor-pu-len-to, de bar-ba mon-ta-raz y ma-nos de go-rrión, que se pre-sen-tó con el nom-bre de Mel-quía-des, hi-zo una tru-cu-len-ta de-mos-tra-ción pú-bli-ca de lo que él mis-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mis-tas de Ma-ce-do-nia.

Notice that in code the hyphenation "Ma-co-n-do" and "Bue-n-día" are incorrect whereas on the website they are correct.

Could you help? Am I doing something wrong on my end that's causing incorrect hyphenation?

I am simply doing:

(new Syllable( 'es', '-' ))->hyphenateText($string);

Add a method to count the number of syllables within the text

It is possible to add a method to count the number of syllables within a text?

I solved my particular case this way

foreach ($syllable->histogramText ($input) as $number_of_syllables => $words_with_that_syllables) {
    $this->syllables_count += $words_with_that_syllables * $number_of_syllabes;
}

but maybe with a native method, which obtains the number of syllables, will avoid the extra work

German umlauts broken after using ->hyphenateHtml

I'm trying to use phpSyllable, but have a problem with the German umlauts.

My example
$html = 'Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deutschen</b> umlauten. ä ö ü ß'; $syllable = new Syllable('de'); echo $syllable->hyphenateHtml($html);

results in

<p>Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deut&shy;schen</b> um&shy;lau&shy;ten. ä ö ü ß</p>

Maybe you can help me?

My Environment:
Ubuntu with Apache 2.4 and PHP 5.6.23

Min char count

Hy there

Great project. Thank you!

Is it possible to only hyphenate words with more than X characters?

So with a min length of 10 characters it would be:
"Provide your own paragraphs..." -> "Provide your own pa-ra-graphs..."

Kind regards,
Nico

Cache version in JSON cache file can be infinite decimal

Depending on the PHP version used, the cache version encoding in a JSON cache file may result in 1.399999999999 instead of 1.4. This is due to the internal handling of floating point numbers. It might be best to add the cache version as a string, as this is safe with JSON encoding.

I have observed the wrong behavior in Debian 11, PHP 7.4.33 (cli) (built: Feb 14 2023 18:01:29) ( NTS ).

syllable to its own element?

Is it possible each syllable to pack into its own element?

From this (whith whitespace after each word !!!!)

<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy li</span>
<span class="hyphen">ved in the midst of the gre</span>

to this

<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy </span> (with Whitespace)
<span class="hyphen">li</span>
<span class="hyphen">ved </span> (with Whitespace)
<span class="hyphen">in /span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">midst </span> (with Whitespace)
<span class="hyphen">of </span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">gre</span>
<span class="hyphen">en </span> (with Whitespace)

Processing really slow. How to configure cache correctly?

Processing large texts is incredibly slow for me. Calling hyphenateHtml takes 25 seconds and more. Is this reasonable for text with roughly 3100 characters? If so, any idea how I could speed this up?

I should mention that I'm not quite sure that caching is working properly. Although files are successfully created in the directory I configured...

$syllable = new Syllable();
$syllable->getCache()->setPath('/app/syllable_cache');
$syllable->setLanguage('de-1996');

$someText = $syllable->hyphenateHtml($someText);

... it doesn't seem to make a significant difference. With and without cache files, processing time is approximately the same. Is there something else I have to do to activate the cache? How is cache invalidation triggered? My input text doesn't change.

Update language files

Hi @vanderlee ,

I was wondering if there is the way to update the language files, for example the German language file /languages/hyph-de-1996.tex was last updated in 2016 and there is a 2021 version in its original location http://mirror.ctan.org/language/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-de-1996.tex.

If this work was previously done by hand and you think a script might be helpful to update language files at the push of a button, I would be happy to provide one.

Thanks for this great package!
Alex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.