Normalizing allomorphs (?) so that BPE can find identical morphemes across words.

Should we keep the plus sign? No, <code c

Yes, the s for Indonesian is at <a href="https://github.com/justhalf/bpe_analysi

Maybe this one? <a href="http://wiki.apertium.org/wiki/Lttoolbox" rel="no

Running morphological analyzers in English and Japanese about bpe_analysis HOT 13 OPEN

justhalf commented on August 18, 2024

Running morphological analyzers in English and Japanese

from bpe_analysis.

Comments (13)

notani commented on August 18, 2024 1

How about this?
https://github.com/knowitall/morpha

from bpe_analysis.

justhalf commented on August 18, 2024

Should we keep the plus sign?
In my Indonesian morphology normalizer script I didn't. Also, are you working on this?

from bpe_analysis.

notani commented on August 18, 2024

Should we keep the plus sign?

No, + is just for an illustration purpose.

I will do English and Japanese normalization. Were you also working on this?

from bpe_analysis.

justhalf commented on August 18, 2024

Yes, the scripts for Indonesian is at https://github.com/justhalf/bpe_analysis/blob/master/morphind/process_txt.py
It uses MorphInd for the morphology analyzer.

No, + is just for an illustration purpose.

I asked because in the Indonesian one I explicitly remove +. I think we should remove the plus sign, yeah.

from bpe_analysis.

notani commented on August 18, 2024

Did you start English and Japanese normalization, too?

from bpe_analysis.

justhalf commented on August 18, 2024

Did you start English and Japanese normalization, too?

No, I haven't started. I didn't know which morphology analyzer to use. But if we have them, we can simply replace the subprocess call with the corresponding call.

from bpe_analysis.

justhalf commented on August 18, 2024

In this paper it says the lexicon (19MB) is large:

Does anyone know a good English morphology analyzer? I was surprised to find none, only Morfessor, which was automatic.

from bpe_analysis.

justhalf commented on August 18, 2024

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

from bpe_analysis.

notani commented on August 18, 2024

Maybe this one? http://wiki.apertium.org/wiki/Lttoolbox

Does this output surface forms of morphemes?

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

We can search for similar studies by "morphological segmentation" rather than "morphological analysis"

from bpe_analysis.

justhalf commented on August 18, 2024

Based on my cursory look, it seems so.

This statistical morphological segmenter can generate normalized surface forms like un+test+able+ly:
https://github.com/ryancotterell/treeseg

That's a good one, since it is modern. I was looking at more that has more manual analysis, since it will be less automatic, e.g., FST. But couldn't find FST for English.

from bpe_analysis.

justhalf commented on August 18, 2024

That looks good. You have one for Japanese as well? (I guess we don't need this for Chinese?)

from bpe_analysis.

notani commented on August 18, 2024

Fortunately, Japanese segmentation by UDPipe is already morpheme segmentation and has normalized forms. I think we don't need normalization for Chinese.

Can you do English normalization?

from bpe_analysis.

justhalf commented on August 18, 2024

I am trying. Morpha apparently only handles plural nouns and verb inflections, but not derivations. So happiness stays as is.

from bpe_analysis.

Running morphological analyzers in English and Japanese about bpe_analysis HOT 13 OPEN

Comments (13)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs