codebox / homoglyph Goto Github PK
View Code? Open in Web Editor NEWA big list of homoglyphs and some code to detect them
Home Page: https://codebox.net/pages/homoglyph-detection
License: MIT License
A big list of homoglyphs and some code to detect them
Home Page: https://codebox.net/pages/homoglyph-detection
License: MIT License
There are other libraries like https://github.com/life4/homoglyphs that does something similar, but there is a lack of standards for what constitutes "homoglyph".
Similar character tables for reference: https://github.com/life4/homoglyphs/tree/master/homoglyphs
Hello there,
do you guys know a php counterpart of this project? This is indeed very useful and I'm searching this for a new project of mine..
Thanks in advance!
The method HomoglyphBuilder.build
should close the created BufferedReader
in a finally block.
Hi, first off, amazing library! thanks for putting this together
The test code from the README works perfectly, but I have an issue when trying it with some real user input:
class Scratch {
public static void main(String[] args) throws IOException {
String textToSearch = "ᑕᒪᑌᗷ";
String[] bannedWords = new String[]{"club"};
Homoglyph homoglyph = HomoglyphBuilder.build();
List<Homoglyph.SearchResult> results = homoglyph.search(textToSearch, bannedWords);
System.out.println(results.size());
}
}
(in case it's not visible in github: the letters should be: \N{CANADIAN SYLLABICS TA}\N{CANADIAN SYLLABICS MA}\N{CANADIAN SYLLABICS TE}\N{CANADIAN SYLLABICS CARRIER KHE}
)
This returns 0 matches, even though I can find all the characters in https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt
What am I missing?
homoglyph.search("1", "i")
and homoglyph.search("i", "1")
return 1 result because they are homoglyphs.
homoglyph.search("1", "i")
returns 1 search result '1' at position 0 matches 'i'
.
homoglyph.search("i", "1")
returns nothing.
As you can see in the test below the same expectation works well for O
& 0
.
import io.kotest.assertions.assertSoftly
import io.kotest.core.spec.style.DescribeSpec
import io.kotest.matchers.collections.shouldHaveSize
import net.codebox.homoglyph.HomoglyphBuilder
class HomoglyphTest : DescribeSpec({
describe("Homoglyph.search() is symmetric") {
it("for characters 1 & i") {
// Arrange
val homoglyph = HomoglyphBuilder.build()
// Act
val actual1 = homoglyph.search("1", "i")
val actual2 = homoglyph.search("i", "1")
// Assert
assertSoftly {
actual1 shouldHaveSize 1
actual2 shouldHaveSize 1
}
}
it("for characters 0 & O") {
// Arrange
val homoglyph = HomoglyphBuilder.build()
// Act
val actual1 = homoglyph.search("0", "O")
val actual2 = homoglyph.search("O", "0")
// Assert
assertSoftly {
actual1 shouldHaveSize 1
actual2 shouldHaveSize 1
}
}
}
})
It would be nice if there were some options for the search() function in the npm package. As when searching it would be nice to be able to have it ignore whitespace. I am aware it can be easily done on my end and am currently doing so but it would definitely be a nice feature to have as part of the package!
Thank you for providing a dataset for homoglyphs. I notice that unicode's confusables
has multi-letter homoglyphs listed, as well. I envisage converting your list into JSON or some sort of comma-separated format to accomodate typical cases (proverbial rn / m and cl / d).
Should the following two generated lists be merged?
x×хᕁᕽ᙮ⅹ⤫⤬⨯x𝐱𝑥𝒙𝓍𝔁𝔵𝕩𝖝𝗑𝘅𝘹𝙭𝚡
XΧХ᙭ᚷⅩ╳ⲬⵝꓫꞳX𐊐𐊴𐌗𐌢𐔧𑣬𝐗𝑋𝑿𝒳𝓧𝔛𝕏𝖃𝖷𝗫𝘟𝙓𝚇𝚾𝛸𝜲𝝬𝞦
Please provide a toASCII
API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:
Homoglyph homoglyph = HomoglyphBuilder.build();
assertEquals("The quick brown fox jumps over the lazy dog",
homoglyph.toASCII("Τһе ԛυіϲκ Ьгоѡɴ ғох јυⅿрѕ оⅴег τһе ⅼаzу ԁоɡ"));
It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).
In the generated confusables table there is conversion from lowercase L to capital I. In for example Intel vs lntel. I couldn't figure out how to modify the generator to include this homoglyph. Any ideas?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.