GithubHelp home page GithubHelp logo

codebox / homoglyph Goto Github PK

View Code? Open in Web Editor NEW
531.0 531.0 65.0 384 KB

A big list of homoglyphs and some code to detect them

Home Page: https://codebox.net/pages/homoglyph-detection

License: MIT License

Python 4.72% JavaScript 84.63% Java 9.83% HTML 0.83%
homoglyphs

homoglyph's People

Contributors

codebox avatar emschorsch avatar jlleitschuh avatar phax avatar vilisimo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

homoglyph's Issues

php counterpart?

Hello there,

do you guys know a php counterpart of this project? This is indeed very useful and I'm searching this for a new project of mine..

Thanks in advance!

no matches, but characters are in `chars.txt` ? [java]

Hi, first off, amazing library! thanks for putting this together

The test code from the README works perfectly, but I have an issue when trying it with some real user input:

class Scratch {
    public static void main(String[] args) throws IOException {
        String textToSearch = "ᑕᒪᑌᗷ";
        String[] bannedWords = new String[]{"club"};
        Homoglyph homoglyph = HomoglyphBuilder.build();
        List<Homoglyph.SearchResult> results = homoglyph.search(textToSearch, bannedWords);
        System.out.println(results.size());
    }
}

(in case it's not visible in github: the letters should be: \N{CANADIAN SYLLABICS TA}\N{CANADIAN SYLLABICS MA}\N{CANADIAN SYLLABICS TE}\N{CANADIAN SYLLABICS CARRIER KHE})

This returns 0 matches, even though I can find all the characters in https://github.com/codebox/homoglyph/blob/master/raw_data/chars.txt

What am I missing?

Homoglyph#search() is not symmetric for 1 & i characters

Expectation

homoglyph.search("1", "i") and homoglyph.search("i", "1") return 1 result because they are homoglyphs.

Reality

homoglyph.search("1", "i") returns 1 search result '1' at position 0 matches 'i'.
homoglyph.search("i", "1") returns nothing.

As you can see in the test below the same expectation works well for O & 0.

Test

import io.kotest.assertions.assertSoftly
import io.kotest.core.spec.style.DescribeSpec
import io.kotest.matchers.collections.shouldHaveSize
import net.codebox.homoglyph.HomoglyphBuilder

class HomoglyphTest : DescribeSpec({
    describe("Homoglyph.search() is symmetric") {
        it("for characters 1 & i") {
            // Arrange
            val homoglyph = HomoglyphBuilder.build()

            // Act
            val actual1 = homoglyph.search("1", "i")
            val actual2 = homoglyph.search("i", "1")

            // Assert
            assertSoftly {
                actual1 shouldHaveSize 1
                actual2 shouldHaveSize 1
            }
        }

        it("for characters 0 & O") {
            // Arrange
            val homoglyph = HomoglyphBuilder.build()

            // Act
            val actual1 = homoglyph.search("0", "O")
            val actual2 = homoglyph.search("O", "0")

            // Assert
            assertSoftly {
                actual1 shouldHaveSize 1
                actual2 shouldHaveSize 1
            }
        }
    }
})

Ignore whitespaces in npm search() function

It would be nice if there were some options for the search() function in the npm package. As when searching it would be nice to be able to have it ignore whitespace. I am aware it can be easily done on my end and am currently doing so but it would definitely be a nice feature to have as part of the package!

Multi-char homoglyphs?

Thank you for providing a dataset for homoglyphs. I notice that unicode's confusables has multi-letter homoglyphs listed, as well. I envisage converting your list into JSON or some sort of comma-separated format to accomodate typical cases (proverbial rn / m and cl / d).

Merge 'x' lists

Should the following two generated lists be merged?

x×хᕁᕽ᙮ⅹ⤫⤬⨯x𝐱𝑥𝒙𝓍𝔁𝔵𝕩𝖝𝗑𝘅𝘹𝙭𝚡
XΧХ᙭ᚷⅩ╳ⲬⵝꓫꞳX𐊐𐊴𐌗𐌢𐔧𑣬𝐗𝑋𝑿𝒳𝓧𝔛𝕏𝖃𝖷𝗫𝘟𝙓𝚇𝚾𝛸𝜲𝝬𝞦

API request: String Homoglyph.toASCII(String)

Please provide a toASCII API which tries to fit the character in ASCII range and returns a string. For example, the following holds true:

Homoglyph homoglyph = HomoglyphBuilder.build();
assertEquals("The quick brown fox jumps over the lazy dog", 
    homoglyph.toASCII("Τһе ԛυіϲκ Ьгоѡɴ ғох јυⅿрѕ оⅴег τһе ⅼаzу ԁоɡ"));

It is useful in the scenarios where we want to run complex REGEX rules on (approximate) ASCII representation. Building complex regex tree equivalent with Homoglyph.search() API is not convenient (at least in certain cases).

lowercase L to capital i homoglyph

In the generated confusables table there is conversion from lowercase L to capital I. In for example Intel vs lntel. I couldn't figure out how to modify the generator to include this homoglyph. Any ideas?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.