vanderlee / phpsyllable Goto Github PK
View Code? Open in Web Editor NEWPHP Syllable splitter/counter and Hyphenator for text and HTML. Multi-language, customisable, cached and fast!
Home Page: http://vanderlee.github.io/phpSyllable/
PHP Syllable splitter/counter and Hyphenator for text and HTML. Multi-language, customisable, cached and fast!
Home Page: http://vanderlee.github.io/phpSyllable/
When the configured cache directory does not exist PHP will throw an error. I guess because file_put_contents
relies on the file path to exist.
Sure it's not a big deal to create the directory manually. I assume most folks don't want to track the cache files in version control though so they are likely to list the cache directory in .gitignore
. Since you can't really track the directory itself while excluding all files within, you have to create the cache directory every time you pull a fresh copy of the project.
Although there are workarounds I think it's slightly more convenient to create the directory when it does not exist. Unless of course there is a specific disadvantage I can't think of right now.
Using the word list \hyphenation{...} in language files works only with words in upper or lower case, but not generalized (e.g. gegenstand does not match with Gegenstand). Maybe the script could just basically match in lowercase the hyphenation{..} word list?
Example: add some custom words in a language-file in a section
\hyphenation{
German
}
to define that "German" should not be splitted.
So only "German" is matched and not split, but "german" is split by regular rules of the language file. Thus, redundant rules for "german" and "German" have to be inserted in language files in the end.
It is possible to add a method to count the number of syllables within a text?
I solved my particular case this way
foreach ($syllable->histogramText ($input) as $number_of_syllables => $words_with_that_syllables) {
$this->syllables_count += $words_with_that_syllables * $number_of_syllabes;
}
but maybe with a native method, which obtains the number of syllables, will avoid the extra work
I believe this to be an issue, or at least something that should be looked into.
phpSyllable-master/src/Cache/File.php on line 53:
file_put_contents($file, $this->encode(self::$data));
file_put_contents requires an absolute path. Instead, what it gets is a relative path.
Warning: file_put_contents(/home/customer/www/mydomain/apps/phpSyllable-master/src/cache/syllable.en-us.json): failed to open stream: No such file or directory in /home/customer/www/mydomain/apps/phpSyllable-master/src/Cache/File.php on line 53.
I was able to workaround it by defining an absolute path. File.php.
private function filename()
{
$dir = '/home/customer/www/mydomain/apps/phpSyllable-master/src/Cache';
return $dir.'/'.$this->getFilename(self::$language);
}
Note: I'm not sure why 'syllable.en-us.json' must be written to. For developing purposes? As far as I know, the file never changes, except the first time when it is generated.
Also "file_put_contents" works fine with a relative path on Windows WAMPServer. Only in a shared server virtual hosting environment I get an error/warning. If you cannot reproduce the issue, feel free to query.
Hello Vanderlee!
First of all.. Congrats for your project, they save alot of time!
Im including phpSyllable into my personal project to count poetic sylables and check certain types of poems.
I identified the same throubles that the another guy found with Spanish language, I already update tex file on the folder from http://tug.org/tex-hyphen/#languages.
But the situation is the same... any ideas to solve this?
Best regards and congrats again.
I would like to discuss and track changes to the project in this ticket regarding an update to the internals and adding the possibility for more interchangeable Components like PSR-6 Caching without breaking legacy compatibility as mentioned in #18.
I extracted the public facing API of the Syllable class into an interface which one of the new components would have to adhere to so we keep compatibility.
<?php
namespace Vanderlee\PhpSyllable;
/**
* This interface describes the Syllable class prior X.Y.Z
*
* To keep compatibility with already existing implementations into applications
* this interface is a necessary evil. It will make sure that the public facing
* class will at least adhere to the promise of pre X.Y.Z versions regarding the
* public API.
*
* The internals will be reworked to a whole new base w/o breaking old usages.
* PHP 5.6 Compatibility should be kept until EOL or have a version_compare switch.
*/
interface LegacyInterface{
// Global Configuration statics
public static function setCacheDir($dir);
public static function setEncoding($encoding = null);
public static function setLanguageDir($dir);
// Configurable Language
public function setLanguage($language);
// Hypen-to-use Configuration
public function setHyphen(Syllable_Hypen_Interface $hyphen);
public function getHyphen();
// Cache Configuration
public function setCache(Syllable_Cache_Interface $Cache = null);
public function getCache();
// Source configuration
public function setSource(Syllable_Source_Interface $Source);
public function getSource();
// Hypening Configuration
public function setMinWordLength($length = 0);
public function getMinWordLength();
// HTML Interface
public function excludeAll();
public function excludeElement($elements);
public function excludeAttribute($attributes, $value = null);
public function excludeXpath($queries);
public function includeElement($elements);
public function includeAttribute($attributes, $value = null);
public function includeXpath($queries);
public function hyphenateHtml($html);
// Text interface
public function splitWord($word);
public function splitText($text);
public function hyphenateWord($word);
public function hyphenateText($text);
// Stats
public function histogramText($text);
public function countWordsText($text);
public function countSyllablesText($text);
public function countPolysyllablesText($text);
// Already deprecated!
public function setTreshold($treshold);
public function getTreshold();
}
Each commented block of functions should get it's own implementation handler classes. Like the hyphening algorithm and the html processing etc. So we clean all the logic from that "master class" we currently have.
To tackle the problem with the class namespacing for legacy projects I would suggest modifying the project autoloader. There we could, a bit like the Twig project does in their class files, register class_alias
to the old Syllable class names. We also need to add at least PSR-0 or PSR-4 compatible autoloading for those not using composer.
/**
* Classloader for the library
* @param string $class
*/
function Syllable_autoloader($class) {
// The new classes will reside in PROJECT_ROOT/src/
// Whereas the \\Vanderlee\\PhpSyllable namespace is the root namespace of src/
// So a \\Vanderlee\\PhpSyllable\\Hyphen\\Dash would be in src/Hypen/Dash.php
$classWithoutRootNamespace = str_replace('Vanderlee\\PhpSyllable\\', '', $class);
$classFile = __DIR__
. DIRECTORY_SEPARATOR . '..'
. DIRECTORY_SEPARATOR . 'src'
. DIRECTORY_SEPARATOR . str_replace('\\', DIRECTORY_SEPARATOR, $classWithoutRootNamespace).'.php';
if (file_exists($classFile)) {
require $classFile;
return true; // This will help class_exists to work properly
}
return false; // This will help class_exists to work properly
}
spl_autoload_register('Syllable_autoloader');
// Bind old Class Names to be backwards compatible
// All files in /classes will be deleted and bound to their new equivalent
// which will then reside within /src
class_alias('\\Vanderlee\\PhpSyllable\\Syllable', '\\Syllable');
I would say this whole rework should have multiple stages.
hyphenate($text, $language)
or so the rest should be initialization work, where as languages would have to be registered to be known in the proccess)What are your thoughts on this? Did I miss something important? Or would you tackle the whole problem in another way?
The first scheduled run of the new GitHub Action workflows today require some additional fixes:
First I would like to thank you for this awesome library,
Second do you think it is possible to also specify the stressed syllable? for many purposes having the stressed syllable is also important and can help a lot,
Thanks a lot
Travis CI has more or less limited its service to paying customers and withdrawn from activities in open source communities.
The current Travis configuration let's GitHub checks fail with
continuous-integration/travis-ci Expected — Waiting for status to be reported
Hello, when I use this method on some html text, it break certain characters, but when I use hyphenateText it doesn't break these characters (though it obviously breaks html).
Here is an example text:
<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>
Here is how it comes out after I use this method on them:
<p>When Revolution Medicines absorbed fellow Third Rock startup Warp Drive Bio into its operations last October — then newly transitioned from antifungal to oncology — the exec team was still reviewing options for the genome mining platform, which was the subject of a deal with Roche.</p>
Notice how these long dashes turned into — . (It hyphenates it fine, I just removed it to make it easier to see the problem)
This problem is caused by it only being partial html and loadHTML
being unable to tell which encoding it is.
Possible solution would be something like $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8")); //
Or this dirty, dirty hack $dom->loadHTML('<?xml encoding="UTF-8">' . $html)
.
P.S. <script>
tag should probably be excluded from hyphenation by default.
Is it possible each syllable to pack into its own element?
From this (whith whitespace after each word !!!!)
<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy li</span>
<span class="hyphen">ved in the midst of the gre</span>
to this
<span class="hyphen">Do</span>
<span class="hyphen">ro</span>
<span class="hyphen">thy </span> (with Whitespace)
<span class="hyphen">li</span>
<span class="hyphen">ved </span> (with Whitespace)
<span class="hyphen">in /span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">midst </span> (with Whitespace)
<span class="hyphen">of </span> (with Whitespace)
<span class="hyphen">the </span> (with Whitespace)
<span class="hyphen">gre</span>
<span class="hyphen">en </span> (with Whitespace)
I'm trying to use phpSyllable, but have a problem with the German umlauts.
My example
$html = 'Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deutschen</b> umlauten. ä ö ü ß'; $syllable = new Syllable('de'); echo $syllable->hyphenateHtml($html);
results in
<p>Dies ist ein <a href="http://example.com" target="_blank">Text</a> mit <b>Deut­schen</b> um­lau­ten. ä ö ü ß</p>
Maybe you can help me?
My Environment:
Ubuntu with Apache 2.4 and PHP 5.6.23
It seems that phpSyllable can hyphenate Finnish word "kolmivaihekilowattituntimittari" very well, but fails (returns the whole word) if written in uppercase letters ("KOLMIVAIHEKILOWATTITUNTIMITTARI").
I'm using:
$syllable->setTreshold(Syllable::TRESHOLD_MOST);
Ideas?
abeyant => abeyant
pipeline => pipeline
abradant => abradant
abraxas => abraxas
Beautiful => Beau-ti-ful
Engineering => En-gi-neer-ing
Doctor => Doc-tor
trade => trade
business => busi-ness
abortion gets presented as abor-tion when I was expecting a-bor-tion.
Hello,
at first I'd like to thank you for the great module you've built!
Only one thing bothers me: Is there an opportunity to set a minimum word count after hyphenation in order to avoid hyphenations like that:
This is a very good do-
cument.
Better would be to have:
This is a very good docu-
ment.
So the minWordCountAfterHyphenation
would be 4 (or 3) in this case.
Is this already possible or would it be an option for the future?
Thanks in advance!
Bye Defcon0
is there a single include file that can be used to load all of the classes?
Hi Martijn,
for the automatic update of languages to work, this repository must be allowed to automatically merge PRs. It is explained here and can only be allowed by the administrator. Currently the "Update languages" workflow fails with "Pull request - Auto merge is not allowed for this repository".
Greetings
Alex
Hy there
Great project. Thank you!
Is it possible to only hyphenate words with more than X characters?
So with a min length of 10 characters it would be:
"Provide your own paragraphs..." -> "Provide your own pa-ra-graphs..."
Kind regards,
Nico
Hi @vanderlee ,
I was wondering if there is the way to update the language files, for example the German language file /languages/hyph-de-1996.tex was last updated in 2016 and there is a 2021 version in its original location http://mirror.ctan.org/language/hyph-utf8/tex/generic/hyph-utf8/patterns/tex/hyph-de-1996.tex.
If this work was previously done by hand and you think a script might be helpful to update language files at the push of a button, I would be happy to provide one.
Thanks for this great package!
Alex
Hi Van der Lee,
I really appreciate your project, well done!
There's just one problem: whenever I try to use one of the functions of the class Syllable, it gives me this error:
Warning: file_put_contents(/Users/victor/Sites/pyglatin/phpSyllable-master/classes/../cache/syllable.en-us.json): failed to open stream: No such file or directory in /Users/victor/Sites/pyglatin/phpSyllable-master/classes/Syllable_Cache_FileAbstract.php on line 43
Hi Van der Lee,
Is it possible for any of your functions to return an array of arrays of syllables of each word?
Example: "I am working".
Output:
array(
0 => array(
0 => "I"
),
1 => array(
0 => "am"
),
2 => array(
0 => "work",
1 => "ing"
)
);
Hi, thanks for this library.
I assume that the website https://syllable.toyls.com/ uses the same library underneath? If so I'm not sure why I am getting different (incorrect) results on my end whereas correct results on the website.
I am using the following string:
Muchos años después, frente al pelotón de fusilamiento, el coronel Aureliano Buendía había de recordar aquella tarde remota en que su padre lo llevó a conocer el hielo. Macondo era entonces una aldea de veinte casas de barro y cañabrava construidas a la orilla de un río de aguas diáfanas que se precipitaban por un lecho de piedras pulidas, blancas y enormes como huevos prehistóricos. El mundo era tan reciente, que muchas cosas carecían de nombre, y para mencionarlas había que señalarlas con el dedo. Todos los años, por el mes de marzo, una familia de gitanos desarrapados plantaba su carpa cerca de la aldea, y con un grande alboroto de pitos y timbales daban a conocer los nuevos inventos. Primero llevaron el imán. Un gitano corpulento, de barba montaraz y manos de gorrión, que se presentó con el nombre de Melquíades, hizo una truculenta demostración pública de lo que él mismo llamaba la octava maravilla de los sabios alquimistas de Macedonia.
In code I am getting this result:
Mu-chos años de-s-pués, fre-n-te al pe-lo-tón de fu-si-la-mie-n-to, el co-ro-nel Au-re-liano Bue-n-día ha-bía de re-co-r-dar aque-lla ta-r-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-co-n-do era en-to-n-ces una al-dea de vei-n-te ca-sas de ba-rro y ca-ña-bra-va co-n-s-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, bla-n-cas y eno-r-mes co-mo hue-vos pre-hi-s-tó-ri-cos. El mu-n-do era tan re-cie-n-te, que mu-chas co-sas ca-re-cían de no-m-bre, y pa-ra me-n-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de ma-r-zo, una fa-mi-lia de gi-ta-nos des-arra-pa-dos pla-n-ta-ba su ca-r-pa ce-r-ca de la al-dea, y con un gra-n-de al-bo-ro-to de pi-tos y ti-m-ba-les da-ban a co-no-cer los nue-vos in-ve-n-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano co-r-pu-le-n-to, de ba-r-ba mo-n-ta-raz y ma-nos de go-rrión, que se pre-se-n-tó con el no-m-bre de Me-l-quía-des, hi-zo una tru-cu-le-n-ta de-mo-s-tra-ción pú-bli-ca de lo que él mi-s-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mi-s-tas de Ma-ce-do-nia.
Whereas on the website I am getting:
Mu-chos años des-pués, fren-te al pe-lo-tón de fu-si-la-mien-to, el co-ro-nel Au-re-liano Buen-día ha-bía de re-cor-dar aque-lla tar-de re-mo-ta en que su pa-dre lo lle-vó a co-no-cer el hie-lo. Ma-con-do era en-ton-ces una al-dea de vein-te ca-sas de ba-rro y ca-ña-bra-va cons-trui-das a la ori-lla de un río de aguas diá-fa-nas que se pre-ci-pi-ta-ban por un le-cho de pie-dras pu-li-das, blan-cas y enor-mes co-mo hue-vos prehis-tó-ri-cos. El mun-do era tan re-cien-te, que mu-chas co-sas ca-re-cían de nom-bre, y pa-ra men-cio-nar-las ha-bía que se-ña-lar-las con el de-do. To-dos los años, por el mes de mar-zo, una fa-mi-lia de gi-ta-nos desarra-pa-dos plan-ta-ba su car-pa cer-ca de la al-dea, y con un gran-de al-bo-ro-to de pi-tos y tim-ba-les da-ban a co-no-cer los nue-vos in-ven-tos. Pri-me-ro lle-va-ron el imán. Un gi-tano cor-pu-len-to, de bar-ba mon-ta-raz y ma-nos de go-rrión, que se pre-sen-tó con el nom-bre de Mel-quía-des, hi-zo una tru-cu-len-ta de-mos-tra-ción pú-bli-ca de lo que él mis-mo lla-ma-ba la oc-ta-va ma-ra-vi-lla de los sa-bios al-qui-mis-tas de Ma-ce-do-nia.
Notice that in code the hyphenation "Ma-co-n-do" and "Bue-n-día" are incorrect whereas on the website they are correct.
Could you help? Am I doing something wrong on my end that's causing incorrect hyphenation?
I am simply doing:
(new Syllable( 'es', '-' ))->hyphenateText($string);
Totally buggy in French
$syllable = new Syllable('fr');
//etc...
$syllable->splitWord('constitution'); returns Array ( [0] => constitution )
$syllable->splitWord('alphabet'); returns Array ( [0] => alphabet ) instead of al-pha-bet
$syllable->splitWord('formation'); Array ( [0] => for [1] => mation ) instead of for-ma-tion
Now, what I get when I set to Finnish. new Syllable('fi');
Array ( [0] => cons [1] => ti [2] => tu [3] => tion )
Array ( [0] => alp [1] => ha [2] => bet )
Array ( [0] => for [1] => ma [2] => tion )
How can I fix that ?
Looking into the data splitWord
method in Syllable appears to be somewhat useless; it produces incorrect results at times.
When provided with the same single-word input with random punctuation, splitText
produces the results expected for splitWord
.
Propose to mark splitWord
as deprecated and route to splitText
internally.
Similarly, the proposed splitWords
would be better based on splitText
.
Is it possible to add some kind fo white- and/or blacklist for certain HTML Elements while processing the HTML to hyphenate?
Background: I try yo use this on whole content areas on dynamically generated pages. Those pages may contain <script>
Elements and the Javascript will then be destroyed by the hyphening proccess in the current non white-/blacklist behaviour.
I could also write an PR for this feature I guess it should be simple as we would only need to modify the recursive HTML Dom Walker.
Hi @vanderlee ,
these branches might be ready for deletion as they are already merged or outdated:
and these issues might be ready for closing as they are addressed in the current master branch:
and this pull request might be ready for closing as it is also addressed in the current master branch:
Greetings
Alex
PHP 7.4. I installed the composer version of the library.
I'm getting the "Array and string offset access syntax with curly braces is deprecated" warning for /vanderlee/syllable/classes/Syllable.php:483
Depending on the PHP version used, the cache version encoding in a JSON cache file may result in 1.399999999999 instead of 1.4. This is due to the internal handling of floating point numbers. It might be best to add the cache version as a string, as this is safe with JSON encoding.
I have observed the wrong behavior in Debian 11, PHP 7.4.33 (cli) (built: Feb 14 2023 18:01:29) ( NTS ).
The
is 12 years old, unlike the more specific language files
which are 8 years old.
Also, hyph-de.tex is no longer available on https://tug.org/tex-hyphen/#languages or on the corresponding CTAN mirrors. Last but not least, the author of the hyphen patterns, Werner Lemberg, has confirmed that they once replaced the first general one with the more specific ones.
The question is technically whether the hyph-de.tex should be replaced by one of the others, since many projects might rely on it, or whether it should be removed completely?
Processing large texts is incredibly slow for me. Calling hyphenateHtml
takes 25 seconds and more. Is this reasonable for text with roughly 3100 characters? If so, any idea how I could speed this up?
I should mention that I'm not quite sure that caching is working properly. Although files are successfully created in the directory I configured...
$syllable = new Syllable();
$syllable->getCache()->setPath('/app/syllable_cache');
$syllable->setLanguage('de-1996');
$someText = $syllable->hyphenateHtml($someText);
... it doesn't seem to make a significant difference. With and without cache files, processing time is approximately the same. Is there something else I have to do to activate the cache? How is cache invalidation triggered? My input text doesn't change.
I know that Tex files have the information to split the words, but I don't know how fix it and maybe you can do something with them
Simple Hiatus
caoba -> cao-ba instead of ca-o-ba
saeta -> sae-ta instead of sa-e-ta
chiita -> chií-ta instead of chi-i-ta
Acentual hiatus
saúco -> saú-co instead of sa-ú-co.
sabía -> sa-bía instead of sa-bí-a.
Sporadic hiatus
píe -> píe instead of pí-e
río -> río instead of rí-o
azul -> azul instead of a-zul
etéreo -> eté-reo instead of e-té-re-o
iluminado -> ilu-mi-na-do instead of i-lu-mi-na-do
otero -> ote-ro instead of o-te-ro
uniforme -> uni-for-me instead of u-ni-for-me
ábaco -> áb-a-co instead of á-ba-co
emanciparse -> em-an-ci-par-se instead of e-man-ci-par-se
separar -> se-p-a-rar instead of se-pa-rar
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.