GithubHelp home page GithubHelp logo

vanderlee / phpsyllable Goto Github PK

View Code? Open in Web Editor NEW
116.0 8.0 80.0 2.49 MB

PHP Syllable splitter/counter and Hyphenator for text and HTML. Multi-language, customisable, cached and fast!

Home Page: http://vanderlee.github.io/phpSyllable/

PHP 3.63% TeX 96.21% HTML 0.15%
hyphens tex syllables language php split hyphen-marker hyphenation hyphenation-algorithm hyphenation-rules

phpsyllable's Issues

Create cache directory if it does not exist

When the configured cache directory does not exist PHP will throw an error. I guess because file_put_contents relies on the file path to exist.

Sure it's not a big deal to create the directory manually. I assume most folks don't want to track the cache files in version control though so they are likely to list the cache directory in .gitignore. Since you can't really track the directory itself while excluding all files within, you have to create the cache directory every time you pull a fresh copy of the project.

Although there are workarounds I think it's slightly more convenient to create the directory when it does not exist. Unless of course there is a specific disadvantage I can't think of right now.

lowercase vs uppercase hyphenation word list

Using the word list \hyphenation{...} in language files works only with words in upper or lower case, but not generalized (e.g. gegenstand does not match with Gegenstand). Maybe the script could just basically match in lowercase the hyphenation{..} word list?

Example: add some custom words in a language-file in a section
\hyphenation{
German
}
to define that "German" should not be splitted.

So only "German" is matched and not split, but "german" is split by regular rules of the language file. Thus, redundant rules for "german" and "German" have to be inserted in language files in the end.

phpSyllable does not work with Russian language

  1. [:alpha:] not working with cyrillic symbols. You must to use [a-zA-Zа-яА-ЯёЁ]
  2. $subword .= $text[$index];
    output: ��� symbols
  3. utf8_encode output:
    неÑ�Ñ�Ñ�иÑ�Ñ�ваÑ�аколиÑ�еÑ

Add a method to count the number of syllables within the text

It is possible to add a method to count the number of syllables within a text?

I solved my particular case this way

foreach ($syllable->histogramText ($input) as $number_of_syllables => $words_with_that_syllables) {
    $this->syllables_count += $words_with_that_syllables * $number_of_syllabes;
}

but maybe with a native method, which obtains the number of syllables, will avoid the extra work

Need absolute path to cache language files

I believe this to be an issue, or at least something that should be looked into.

phpSyllable-master/src/Cache/File.php on line 53:

file_put_contents($file, $this->encode(self::$data));

file_put_contents requires an absolute path. Instead, what it gets is a relative path.

Warning: file_put_contents(/home/customer/www/mydomain/apps/phpSyllable-master/src/cache/syllable.en-us.json): failed to open stream: No such file or directory in /home/customer/www/mydomain/apps/phpSyllable-master/src/Cache/File.php on line 53.

I was able to workaround it by defining an absolute path. File.php.

private function filename()
{
    $dir = '/home/customer/www/mydomain/apps/phpSyllable-master/src/Cache';
    return $dir.'/'.$this->getFilename(self::$language);
}

Note: I'm not sure why 'syllable.en-us.json' must be written to. For developing purposes? As far as I know, the file never changes, except the first time when it is generated.

Also "file_put_contents" works fine with a relative path on Windows WAMPServer. Only in a shared server virtual hosting environment I get an error/warning. If you cannot reproduce the issue, feel free to query.

Brazilian Portuguese updates

Hello Vanderlee!
First of all.. Congrats for your project, they save alot of time!
Im including phpSyllable into my personal project to count poetic sylables and check certain types of poems.

I identified the same throubles that the another guy found with Spanish language, I already update tex file on the folder from http://tug.org/tex-hyphen/#languages.

But the situation is the same... any ideas to solve this?

Best regards and congrats again.

Reworking API internals

I would like to discuss and track changes to the project in this ticket regarding an update to the internals and adding the possibility for more interchangeable Components like PSR-6 Caching without breaking legacy compatibility as mentioned in #18.

I extracted the public facing API of the Syllable class into an interface which one of the new components would have to adhere to so we keep compatibility.

<?php

namespace Vanderlee\PhpSyllable;

/**
 * This interface describes the Syllable class prior X.Y.Z
 * 
 * To keep compatibility with already existing implementations into applications
 * this interface is a necessary evil. It will make sure that the public facing
 * class will at least adhere to the promise of pre X.Y.Z versions regarding the
 * public API.
 * 
 * The internals will be reworked to a whole new base w/o breaking old usages.
 * PHP 5.6 Compatibility should be kept until EOL or have a version_compare switch.
 */
interface LegacyInterface{
    // Global Configuration statics
    public static function setCacheDir($dir);
    public static function setEncoding($encoding = null);
    public static function setLanguageDir($dir);

    // Configurable Language
    public function setLanguage($language);

    // Hypen-to-use Configuration
    public function setHyphen(Syllable_Hypen_Interface $hyphen);
    public function getHyphen();

    // Cache Configuration
    public function setCache(Syllable_Cache_Interface $Cache = null);
    public function getCache();

    // Source configuration
    public function setSource(Syllable_Source_Interface $Source);
    public function getSource();

    // Hypening Configuration
    public function setMinWordLength($length = 0);
    public function getMinWordLength();

    // HTML Interface
    public function excludeAll();
    public function excludeElement($elements);
    public function excludeAttribute($attributes, $value = null);
    public function excludeXpath($queries);
    public function includeElement($elements);
    public function includeAttribute($attributes, $value = null);
    public function includeXpath($queries);
    public function hyphenateHtml($html);

    // Text interface
    public function splitWord($word);
    public function splitText($text);
    public function hyphenateWord($word);
    public function hyphenateText($text);

    // Stats
    public function histogramText($text);
    public function countWordsText($text);
    public function countSyllablesText($text);
    public function countPolysyllablesText($text);

    // Already deprecated!
    public function setTreshold($treshold);
    public function getTreshold();
}

Each commented block of functions should get it's own implementation handler classes. Like the hyphening algorithm and the html processing etc. So we clean all the logic from that "master class" we currently have.

To tackle the problem with the class namespacing for legacy projects I would suggest modifying the project autoloader. There we could, a bit like the Twig project does in their class files, register class_alias to the old Syllable class names. We also need to add at least PSR-0 or PSR-4 compatible autoloading for those not using composer.

/**
* Classloader for the library
* @param string $class
*/
function Syllable_autoloader($class) {
	// The new classes will reside in PROJECT_ROOT/src/
	// Whereas the \\Vanderlee\\PhpSyllable namespace is the root namespace of src/
	// So a \\Vanderlee\\PhpSyllable\\Hyphen\\Dash would be in src/Hypen/Dash.php

	$classWithoutRootNamespace = str_replace('Vanderlee\\PhpSyllable\\', '', $class);
	$classFile = __DIR__ 
		. DIRECTORY_SEPARATOR . '..' 
		. DIRECTORY_SEPARATOR . 'src' 
		. DIRECTORY_SEPARATOR . str_replace('\\', DIRECTORY_SEPARATOR, $classWithoutRootNamespace).'.php';

	if (file_exists($classFile)) {
		require $classFile;

		return true; // This will help class_exists to work properly
	}
	
	return false; // This will help class_exists to work properly
}

spl_autoload_register('Syllable_autoloader');

// Bind old Class Names to be backwards compatible
// All files in /classes will be deleted and bound to their new equivalent
// which will then reside within /src
class_alias('\\Vanderlee\\PhpSyllable\\Syllable', '\\Syllable');

I would say this whole rework should have multiple stages.

  1. Maybe write some more tests so we get a better coverage and have a more solid refactoring base to test against.
  2. Rewrite the class structure to the new namespacing and greenify all tests w/o adjusting them
  3. Deprecate the usage of the old "master class" interface and document the proper but more configuration-verbose wiring of all the separate classes into a less public interfaced service class. (like it should really only have a hyphenate($text, $language) or so the rest should be initialization work, where as languages would have to be registered to be known in the proccess)

What are your thoughts on this? Did I miss something important? Or would you tackle the whole problem in another way?

Post-processing of first run of language update and tests workflow

The first scheduled run of the new GitHub Action workflows today require some additional fixes:

  1. Reduce the events that trigger GitHub Action runs, e.g. when pushing into a PR, the tests run twice.
  2. Bypass manual review requests on automatic PR merge during language update.
  3. Update year of copyright claim in README.md automatically on language update release.

Showing the stress on syllables

First I would like to thank you for this awesome library,
Second do you think it is possible to also specify the stressed syllable? for many purposes having the stressed syllable is also important and can help a lot,

Thanks a lot

Replace test execution by Travis CI with GitHub Action

Travis CI has more or less limited its service to paying customers and withdrawn from activities in open source communities.

The current Travis configuration let's GitHub checks fail with

continuous-integration/travis-ci Expected — Waiting for status to be reported

hyphenateHtml messes up certain symbols

Hello, when I use this method on some html text, it break certain characters, but when I use hyphenateText it doesn't break these characters (though it obviously breaks html).
Here is an example text:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Here is how it comes out after I use this method on them:

<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>

Notice how these long dashes turned into — . (It hyphenates it fine, I just removed it to make it easier to see the problem)
This problem is caused by it only being partial html and loadHTML being unable to tell which encoding it is.
Possible solution would be something like $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8")); //
Or this dirty, dirty hack $dom->loadHTML('<?xml encoding="UTF-8">' . $html) .

P.S. <script> tag should probably be excluded from hyphenation by default.

syllable to its own element?

Is it possible each syllable to pack into its own element?

From this (whith whitespace after each word !!!!)

<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy li</span>
<span class="hyphen">ved in the midst of the gre</span>

to this

<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy </span> (with Whitespace)
<span class="hyphen">li</span>
<span class="hyphen">ved </span> (with Whitespace)
<span class="hyphen">in /span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">midst </span> (with Whitespace)
<span class="hyphen">of </span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">gre</span>
<span class="hyphen">en </span> (with Whitespace)

German umlauts broken after using ->hyphenateHtml

I'm trying to use phpSyllable, but have a problem with the German umlauts.

My example
$html = 'Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deutschen</b> umlauten. ä ö ü ß'; $syllable = new Syllable('de'); echo $syllable->hyphenateHtml($html);

results in

<p>Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deut&shy;schen</b> um&shy;lau&shy;ten. ä ö ü ß</p>

Maybe you can help me?

My Environment:
Ubuntu with Apache 2.4 and PHP 5.6.23

Uppercase words

It seems that phpSyllable can hyphenate Finnish word "kolmivaihekilowattituntimittari" very well, but fails (returns the whole word) if written in uppercase letters ("KOLMIVAIHEKILOWATTITUNTIMITTARI").

I'm using:

$syllable->setTreshold(Syllable::TRESHOLD_MOST);

Ideas?

Min word count after hyphenation

Hello,

at first I'd like to thank you for the great module you've built!

Only one thing bothers me: Is there an opportunity to set a minimum word count after hyphenation in order to avoid hyphenations like that:

This is a very good do-
cument.

Better would be to have:

This is a very good docu-
ment.

So the minWordCountAfterHyphenation would be 4 (or 3) in this case.

Is this already possible or would it be an option for the future?

Thanks in advance!

Bye Defcon0

Allow auto merge for this repository

Hi Martijn,

for the automatic update of languages to work, this repository must be allowed to automatically merge PRs. It is explained here and can only be allowed by the administrator. Currently the "Update languages" workflow fails with "Pull request - Auto merge is not allowed for this repository".

Greetings
Alex

Min char count

Hy there

Great project. Thank you!

Is it possible to only hyphenate words with more than X characters?

So with a min length of 10 characters it would be:
"Provide your own paragraphs..." -> "Provide your own pa-ra-graphs..."

Kind regards,
Nico

Update language files

Hi @vanderlee ,

I was wondering if there is the way to update the language files, for example the German language file /languages/hyph-de-1996.tex was last updated in 2016 and there is a 2021 version in its original location http://mirror.ctan.org/language/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-de-1996.tex.

If this work was previously done by hand and you think a script might be helpful to update language files at the push of a button, I would be happy to provide one.

Thanks for this great package!
Alex

Error when trying to use the functions

Hi Van der Lee,

I really appreciate your project, well done!
There's just one problem: whenever I try to use one of the functions of the class Syllable, it gives me this error:
Warning: file_put_contents(/Users/victor/Sites/pyglatin/phpSyllable-master/classes/../cache/syllable.en-us.json): failed to open stream: No such file or directory in /Users/victor/Sites/pyglatin/phpSyllable-master/classes/Syllable_Cache_FileAbstract.php on line 43

Split sentence into array of arrays of syllables of each word

Hi Van der Lee,

Is it possible for any of your functions to return an array of arrays of syllables of each word?

Example: "I am working".
Output:

array(

    0 => array(

        0 => "I"

    ), 

    1 => array(

        0 => "am"

    ), 

    2 => array(

        0 => "work", 

        1 => "ing"

    )

);

Results differ from syllable.toyls.com

Hi, thanks for this library.

I assume that the website https://syllable.toyls.com/ uses the same library underneath? If so I'm not sure why I am getting different (incorrect) results on my end whereas correct results on the website.

I am using the following string:

Muchos años después, frente al pelotón de fusilamiento, el coronel Aureliano Buendía había de recordar aquella tarde remota en que su padre lo llevó a conocer el hielo. Macondo era entonces una aldea de veinte casas de barro y cañabrava construidas a la orilla de un río de aguas diáfanas que se precipitaban por un lecho de piedras pulidas, blancas y enormes como huevos prehistóricos. El mundo era tan reciente, que muchas cosas carecían de nombre, y para mencionarlas había que señalarlas con el dedo. Todos los años, por el mes de marzo, una familia de gitanos desarrapados plantaba su carpa cerca de la aldea, y con un grande alboroto de pitos y timbales daban a conocer los nuevos inventos. Primero llevaron el imán. Un gitano corpulento, de barba montaraz y manos de gorrión, que se presentó con el nombre de Melquíades, hizo una truculenta demostración pública de lo que él mismo llamaba la octava maravilla de los sabios alquimistas de Macedonia.

In code I am getting this result:

Mu-chos años de-s-pués, fre-n-te al pe-lo-tón de fu-si-la-mie-n-to, el co-ro-nel Au-re-liano Bue-n-día ha-bía de re-co-r-dar aque-lla ta-r-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-co-n-do era en-to-n-ces una al-dea de vei-n-te ca-sas de ba-rro y ca-ña-bra-va co-n-s-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, bla-n-cas y eno-r-mes co-mo hue-vos pre-hi-s-tó-ri-cos. El mu-n-do era tan re-cie-n-te, que mu-chas co-sas ca-re-cían de no-m-bre, y pa-ra me-n-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de ma-r-zo, una fa-mi-lia de gi-ta-nos des-arra-pa-dos pla-n-ta-ba su ca-r-pa ce-r-ca de la al-dea, y con un gra-n-de al-bo-ro-to de pi-tos y ti-m-ba-les da-ban a co-no-cer los nue-vos in-ve-n-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano co-r-pu-le-n-to, de ba-r-ba mo-n-ta-raz y ma-nos de go-rrión, que se pre-se-n-tó con el no-m-bre de Me-l-quía-des, hi-zo una tru-cu-le-n-ta de-mo-s-tra-ción pú-bli-ca de lo que él mi-s-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mi-s-tas de Ma-ce-do-nia.

Whereas on the website I am getting:

Mu-chos años des-pués, fren-te al pe-lo-tón de fu-si-la-mien-to, el co-ro-nel Au-re-liano Buen-día ha-bía de re-cor-dar aque-lla tar-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-con-do era en-ton-ces una al-dea de vein-te ca-sas de ba-rro y ca-ña-bra-va cons-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, blan-cas y enor-mes co-mo hue-vos prehis-tó-ri-cos. El mun-do era tan re-cien-te, que mu-chas co-sas ca-re-cían de nom-bre, y pa-ra men-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de mar-zo, una fa-mi-lia de gi-ta-nos desarra-pa-dos plan-ta-ba su car-pa cer-ca de la al-dea, y con un gran-de al-bo-ro-to de pi-tos y tim-ba-les da-ban a co-no-cer los nue-vos in-ven-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano cor-pu-len-to, de bar-ba mon-ta-raz y ma-nos de go-rrión, que se pre-sen-tó con el nom-bre de Mel-quía-des, hi-zo una tru-cu-len-ta de-mos-tra-ción pú-bli-ca de lo que él mis-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mis-tas de Ma-ce-do-nia.

Notice that in code the hyphenation "Ma-co-n-do" and "Bue-n-día" are incorrect whereas on the website they are correct.

Could you help? Am I doing something wrong on my end that's causing incorrect hyphenation?

I am simply doing:

(new Syllable( 'es', '-' ))->hyphenateText($string);

Seems to be totally buggy in French

Totally buggy in French

$syllable = new Syllable('fr');

//etc...
$syllable->splitWord('constitution'); returns Array ( [0] => constitution )
$syllable->splitWord('alphabet'); returns Array ( [0] => alphabet ) instead of al-pha-bet
$syllable->splitWord('formation'); Array ( [0] => for [1] => mation ) instead of for-ma-tion

Now, what I get when I set to Finnish. new Syllable('fi');
Array ( [0] => cons [1] => ti [2] => tu [3] => tion )
Array ( [0] => alp [1] => ha [2] => bet )
Array ( [0] => for [1] => ma [2] => tion )

How can I fix that ?

Request for feedback; deprecated splitWord

Looking into the data splitWord method in Syllable appears to be somewhat useless; it produces incorrect results at times.

When provided with the same single-word input with random punctuation, splitText produces the results expected for splitWord.

Propose to mark splitWord as deprecated and route to splitText internally.

Similarly, the proposed splitWords would be better based on splitText.

White-/Blacklist certain HTML Elements in hyphenateHtml()

Is it possible to add some kind fo white- and/or blacklist for certain HTML Elements while processing the HTML to hyphenate?

Background: I try yo use this on whole content areas on dynamically generated pages. Those pages may contain <script> Elements and the Javascript will then be destroyed by the hyphening proccess in the current non white-/blacklist behaviour.

I could also write an PR for this feature I guess it should be simple as we would only need to modify the recursive HTML Dom Walker.

Cleanup

Hi @vanderlee ,

these branches might be ready for deletion as they are already merged or outdated:

and these issues might be ready for closing as they are addressed in the current master branch:

and this pull request might be ready for closing as it is also addressed in the current master branch:

Greetings
Alex

Cache version in JSON cache file can be infinite decimal

Depending on the PHP version used, the cache version encoding in a JSON cache file may result in 1.399999999999 instead of 1.4. This is due to the internal handling of floating point numbers. It might be best to add the cache version as a string, as this is safe with JSON encoding.

I have observed the wrong behavior in Debian 11, PHP 7.4.33 (cli) (built: Feb 14 2023 18:01:29) ( NTS ).

Replace / remove outdated German language file hyph-de.tex

The

  • hyph-en.tex

is 12 years old, unlike the more specific language files

  • hyph-en-1901.tex
  • hyph-de-1996.tex
  • hyph-de-ch-1901.tex

which are 8 years old.

Also, hyph-de.tex is no longer available on https://tug.org/tex-hyphen/#languages or on the corresponding CTAN mirrors. Last but not least, the author of the hyphen patterns, Werner Lemberg, has confirmed that they once replaced the first general one with the more specific ones.

The question is technically whether the hyph-de.tex should be replaced by one of the others, since many projects might rely on it, or whether it should be removed completely?

Processing really slow. How to configure cache correctly?

Processing large texts is incredibly slow for me. Calling hyphenateHtml takes 25 seconds and more. Is this reasonable for text with roughly 3100 characters? If so, any idea how I could speed this up?

I should mention that I'm not quite sure that caching is working properly. Although files are successfully created in the directory I configured...

$syllable = new Syllable();
$syllable->getCache()->setPath('/app/syllable_cache');
$syllable->setLanguage('de-1996');

$someText = $syllable->hyphenateHtml($someText);

... it doesn't seem to make a significant difference. With and without cache files, processing time is approximately the same. Is there something else I have to do to activate the cache? How is cache invalidation triggered? My input text doesn't change.

Problems with hiatuses, words beginning with a vowel and more in Spanish

I know that Tex files have the information to split the words, but I don't know how fix it and maybe you can do something with them

Simple Hiatus

caoba -> cao-ba instead of ca-o-ba
saeta -> sae-ta instead of sa-e-ta
chiita -> chií-ta instead of chi-i-ta

Acentual hiatus

saúco -> saú-co instead of sa-ú-co.
sabía -> sa-bía instead of sa-bí-a.

Sporadic hiatus

píe -> píe instead of pí-e
río -> río instead of rí-o

  • The other problem is with words beginning with a vowel, don't split it from the other part

azul -> azul instead of a-zul
etéreo -> eté-reo instead of e-té-re-o
iluminado -> ilu-mi-na-do instead of i-lu-mi-na-do
otero -> ote-ro instead of o-te-ro
uniforme -> uni-for-me instead of u-ni-for-me

  • And fínally more rare examples (I have many more)

ábaco -> áb-a-co instead of á-ba-co
emanciparse -> em-an-ci-par-se instead of e-man-ci-par-se
separar -> se-p-a-rar instead of se-pa-rar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.