Simple python 3 class for matching a strings that have letters that only
look the same as original string. unicode.org provides a nice list of
"confusable" letters.
This class uses that info to turn a string into a regular expression
pattern that includes all these confusable variations.
E.g. "đâŽđĨ1āŗĻ" matches "Hello"
"Hello" gets turned into the following regex of character classes:
[H\īŧ¨\â\â\â\đ\đģ\đ¯\đ\đŗ\đ§\đ\đ\đ\đˇ\Î\đŽ\đ¨\đĸ\đ\đ\â˛\Đ\áģ\áŧ\ę§\đ\⹧\Ōĸ\ÄĻ\Ķ\Ķ]
[e\âŽ\īŊ
\â¯\â
\đ\đ\đ\đŽ\đĸ\đ\đ\đž\đ˛\đĻ\đ\đ\ęŦ˛\Đĩ\ŌŊ\É\Ōŋ]
[l\â\|\âŖ\âŊ\īŋ¨1\â\Ûą\đ \â\đ\đ\đŖ\đ\đˇI\īŧŠ\â
\â\â\đ\đŧ\đ°\đ\đ\đ´\đ¨\đ\đ\đ\đ¸\Æ\īŊ\â
ŧ\â\đĨ\đ\đ\đ\đĩ\đŠ\đ\đ\đ
\đš\đ\đĄ\đ\Į\Î\đ°\đĒ\đ¤\đ\đ\â˛\Đ\Ķ\â\â\â\â\â\â\â\â\âĩ\á\ę˛\đŧ¨\đ\đ\â\â\Å\É\Æ\Æ\ÉĢ\â\â\â\â\Å\Äŋ\áˇ\đ\â\â\â\ãĢ\ã\ã¤\â\ãŦ\ãĨ\â\ã\ãĻ\â\ãŽ\ã§\â\ã¯\ã¨\â\ã°\ãŠ\â\ãą\ãĒ\â\ã˛\ãĢ\Į\IJ\â\âĨ\â
Ą\Į\â\đ\â\â
ĸ\đ\ãĒ\ã\ãŖ\ĐŽ\â\ãŠ\ã\ãĸ\ĘĒ\âļ\â
Ŗ\â
¨\ÉŽ\ĘĢ\ã \ã\ã]
[l\â\|\âŖ\âŊ\īŋ¨1\â\Ûą\đ \â\đ\đ\đŖ\đ\đˇI\īŧŠ\â
\â\â\đ\đŧ\đ°\đ\đ\đ´\đ¨\đ\đ\đ\đ¸\Æ\īŊ\â
ŧ\â\đĨ\đ\đ\đ\đĩ\đŠ\đ\đ\đ
\đš\đ\đĄ\đ\Į\Î\đ°\đĒ\đ¤\đ\đ\â˛\Đ\Ķ\â\â\â\â\â\â\â\â\âĩ\á\ę˛\đŧ¨\đ\đ\â\â\Å\É\Æ\Æ\ÉĢ\â\â\â\â\Å\Äŋ\áˇ\đ\â\â\â\ãĢ\ã\ã¤\â\ãŦ\ãĨ\â\ã\ãĻ\â\ãŽ\ã§\â\ã¯\ã¨\â\ã°\ãŠ\â\ãą\ãĒ\â\ã˛\ãĢ\Į\IJ\â\âĨ\â
Ą\Į\â\đ\â\â
ĸ\đ\ãĒ\ã\ãŖ\ĐŽ\â\ãŠ\ã\ãĸ\ĘĒ\âļ\â
Ŗ\â
¨\ÉŽ\ĘĢ\ã \ã\ã]
[o\ā°\ā˛\ā´\āļ\āĨĻ\āŠĻ\āĢĻ\ā¯Ļ\āąĻ\āŗĻ\āĩĻ\āš\āģ\á\â\Ûĩ\īŊ\â´\đ¨\đ\đ\đ¸\đŦ\đ \đ\đ\đŧ\đ°\đ¤\đ\á´\á´\ęŦŊ\Îŋ\đ\đ\đ\đž\đ¸\Ī\đ\đ\đ\đ\đŧ\â˛\Đž\áŋ\Ö
\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\â\ā´ \á\đĒ\đŖ\đŖ\đŦ\â\ø\ęŦž\Éĩ\ę\ĶŠ\Ņŗ\ęŽ\ęŽģ\ę´\â\ÆĄ\Å\Éļ\â\ę\ę\āĩ\á]
Note: Some characters above may not render in your browser correctly.
Probably best to combine this with removing accented characters in the text to be searched. Several ways explained here: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string