GithubHelp home page GithubHelp logo

radically / radically Goto Github PK

View Code? Open in Web Editor NEW
8.0 3.0 1.0 9.79 MB

A component-based CJK character search engine

Home Page: https://radically.bryankok.com

License: GNU General Public License v2.0

JavaScript 6.44% HTML 0.81% CSS 1.13% TypeScript 91.51% Shell 0.12%
cjk chinese-characters japanese-characters pwa react offline-app search-algorithm

radically's Introduction

Radically

You may also like

Lookup Preprocessing Algorithm

A basic implementation of a lookup system from constituent radical to character is simple - depth first search starting from the root node, adding every encountered character in every IDS string into a hash table / dictionary / similar data structure. Indeed, this is done in this project.

車 -> 蓮,櫣,輿,轉 ...
车 -> 莲,舆,转 ...

The complication arises if we want to include radical frequency in our search - e.g. "find all characters with at least 3 occurrences of 人" should return 傘, 众, and 齒, but not 从, 纵, 齿, etc.

Decomposition of 蓮

  • There may be more than one IDS string, and thus decomposition per character.
  • The total number of characters in each head to leaf node path varies, as does the number of unique characters. We may (and I have!) run into a character with a very elaborate decomposition if we take a particular branch, and a decomposition that terminates early on another1.

The approach taken is to generate a list of the powerset2 of radicals for every possible head to leaf node path, EXCLUDING THE ROOT ITSELF (important!), e.g. the expected output would be

[
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 1, '丨': 1 },
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 2 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 1, '丨': 1 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 2 }
]

where there are 4 sets, or rather hash tabless, depicted by the 4 different colors above. The one in green corresponds to the 1st item in the list.

Define a recursive function, rec(char), where char is a single character and the output is as described above.

Taking the left branch from the root, 艹 can be decomposed into either 十十 or 十丨, hence rec(艹) -> [ { '十': 1, '丨': 1 }, { '十': 2 } ], and similarly, rec(連) -> [ { '辶': 1, '車': 1 } ].

The key observation is that for each character in an IDS string, we must pick 1 set from its list of sets before merging 1 set per character into a bigger set, to avoid double-counting radicals. Ergo, we must find the cartesian product between the powersets of each constituent character, then merge each resulting n-tuple.

Hence, the powersets of the IDS string 十連, although not explicitly defined here, are [ { '十': 1, '丨': 1, '辶': 1, '車': 1 }, { '十': 2, '辶': 1, '車': 1 } ].

Define another function mergeTwoFreqs(powersetA, powersetB) which merges two powersets together, adding the frequencies together if they are in both sets. Merging 3 or more sets is done by mergeTwoFreqs(powersetA, mergeTwoFreqs(powersetB, powersetC)...

Note how the first item of rec(艹), and then the second item, is merged with the only item of rec(連) in turn. In other words, for a given IDS string ABC, where rec(A) -> [ A0, A1, A2 ], rec(B) -> [ B0, B1, B2 ], and rec(C) -> [ C0, C1, C2 ], i.e. this is a permutation generation problem where we need to generate [ mergeFreqs((A0, B0, C0)), mergeFreqs((A0, B0, C1)), ... mergeFreqs((A2, B2, C2)) ].

Define another function permGen(n-tuple[]) which takes the length of each n-tuple and outputs int[] describing each permutation by their indices, e.g. [ [0,0,0], [0,0,1], ... [2,2,2] ] (keep on reading!)

Now, the only thing left to do is to merge the powersets of the IDS string, with the frequencies of the IDS string itself.

Going back to our example where the two powersets of 艹蓮 are [ { '十': 1, '丨': 1, '辶': 1, '車': 1 }, { '十': 2, '辶': 1, '車': 1 } ], we need to merge each of the powersets with { '艹': 1, '連': 1 } using mergeTwoFreqs to get two possible powersets for 蓮, namely,

[
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 1, '丨': 1 },
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 2 },
]

We do the same for the right subtree, to get the blue and green powersets.

[
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 1, '丨': 1 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 2 }
]

Finally, we append the left and right subtree results immediately above into a list, and flatten it to obtain our expected result far above.

The termination condition, i.e. a leaf node is considered to be a leaf node when the IDS string(s) contain only itself, i.e. there are no constituent characters, and thus we return an empty list of powersets.

Once again, in pseudocode:

def mergeTwoFreqs(freqsA, freqsB) {
    res = { ...freqsB };
    for (let char of Object.keys(freqsA)) {
        if (!res[char]) res[char] = 0;
        res[char] += freqsA[char];
    }
    return res;
};

// given [ [ A0, A1, A2 ], [ B0, B1, B2 ], [ C0, C1, C2 ] ], return [ mergeFreqs((A0, B0, C0)), mergeFreqs((A0, B0, C1)), ... mergeFreqs((A2, B2, C2)) ]
// A0 .. C2 are *powersets*.
def permuteAndMerge(freqsArr) {
    // [ [0,0,0], [0,0,1], ... [2,2,2] ]
    permutations = permGen(freqsArr.map(getSize))

    res = []
    for permutation in permutations {
        mergedPowerset = {}
        // take one from each powerset
        elements = permutation.map((elem, i) => freqsArr[i][elem])
        for element in elements {
            mergedPowerset = mergeTwoFreqs(mergedPowerset, element)
        }
        res.append(mergedPowerset)
    }
    return res
}

def rec(char) {
    // each index on this array corresponds to a fork in the black arrow, i.e. a different decomposition in the picture above
    freqsAtThisNode = []

    // ['⿰木⿺辶莗', '⿰木蓮']
    idsStrings = getIDSStrings(char)

    // record the frequencies of each IDS string
    // itself in the array
    for i = 0; i < idsStrings.length; i++ {
        freqsAtThisNode.append({});
        for idsChar in idsStrings[i] {
            if idsChar !== char and isValidHanChar(char) {
                if (!freqsAtThisNode[i][idsChar]) {
                    freqsAtThisNode[i][idsChar] = 0;
                }
                freqsAtThisNode[i][idsChar] += 1;
            }
        }
    }

    res = []
    for i = 0; i < idsStrings.length; i++ {
        // eureka
        freqs = permuteAndMerge(Object.keys(freqsAtThisNode[i]).map((key) => rec(key)))
        // merge the powersets with the freqs of the corresponding IDS string
        freqs = freqs.map((freq) => mergeTwoFreqs(freq, freqsAtThisNode[i]))
        // flatten
        res = res.concat(freqs)
    }
    return res
}

N.B. if there are no valid characters in the IDS string, freqs in the 2nd for loop will be empty, and it will vanish during flattening.

[1] AFAIK, there is a lot of human subjectivity involved in the CHISE IDS dataset.

[2] "Powerset" here refers to a hash table, the keys of which are ALL the constituent radicals of a particular character as we traverse the tree, and the values of which are their frequencies. I refer to this concept as freqs in the code often.

Generated Datasets

These JSON datasets are generated by npm run etl using my preprocessed IDS data and output into the public/json folder for consumption by the frontend.

baseRadicals.json: string[], a list of radicals which cannot be further decomposed.

forwardMap.json: { [key: string]: string[]; }, a map of component to characters that it is present in.

variantsIslandsLookup.json: { islands: string[][]; chars: { [key: string]: number[] }; }, islands is a list of lists of related characters. chars is a map of individual character to the index in islands of the island containing itself.

variantsMap.json: { [key: string]: number[]; }, map of character to list of types that it is classified as (e.g. Shinjitai, Simplified Chinese)

readings.json: { [key: string]: { [key: string]: string; }; }, map of character to Unihan reading fields.

reverseMap.json (large! >150MB. Not actually used by the frontend directly.):

{
  [key: string]: {
    utf_code: string;
    ids_strings: {
      ids: string;
      locales: string;
    }[];
    charFreqs?: powerset[];
  };
}

ids_strings is a list of IDSs and the locales that this IDS decomposition corresponds to.

Refer above2 for what a "powerset" means here.

reverseMapCharFreqsOnly.json and reverseMapIDSOnly.json are optimized versions of reverseMap.json in order to fit within Apple's 50MB service worker cache limit for PWAs.

reverseMap.json is gzipped and served as reverseMap.json.gz in the deployed version.

Rules of thumb

The IDS sequences as provided by CHISE use Kangxi radicals and their actual CJK character counterparts interchangeably.

Decomposition which uses U+2E81 (Kangxi radical)

Decomposition which uses U+20086 (CJK character)

All Kangxi radicals are converted to their corresponding CJK characters as part of the ETL process.

All variants of a character, including transitive ones, should be retrivable from any of the characters involved. e.g. searching for 发 (SC) should return 發 (TC), 髮 (TC), 発 (JA), and searching for 鄕 (KR) should return 郷 (JA), 鄉 (TC), and 乡 (SC), along with other less-commonly used variants.

Licenses

SPDX-License-Identifier: GPL-2.0-or-later

  • The products of npm run etl (most of the files in the public/json folder) are licensed under the GNU General Public License v2.0 or later.

  • I have plans to dual-license Radically under the MIT license.

    • The main obstacles are the lack of a non-copyleft, small, 100% CJK Unicode coverage webfont, and most of the etl folder relying on said GPLv2 licensed data.
    • You are more than welcome to contact me via email, the chat links above, or opening an issue if you have a need to use (parts of) it, or the algorithm in non-copyleft settings.
    • If you wish to contribute to Radically, please keep this in consideration.

radically's People

Contributors

transfusion avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

transfusion

radically's Issues

Improve the quality / comprehensiveness of variants

斥 and 叱
http://ccamc.org/cjkv.php?cjkv=%E5%8F%B1

康熙字典
【丑集上】【口字部】
【唐韻】【集韻】【韻會】【正韻】𠀤尺栗切,音𩾔。【說文】也。【倉頡篇】大訶爲叱。

漢語大字典
1.大聲呵斥、責罵。
2.呼喊;吆喝。

叱 has connotations of loudness, in addition to being used in compounds that mean rebuking, scolding, etc, e.g. 斥/叱責,怒斥/叱,斥/叱罵 etc

https://www.kanjipedia.jp/kanji/0002920800
斥 in Japanese appears to be primarily used in the "repulsion" sense, e.g. 排斥, 斥力, 斥/退ける, in addition to other classical phrases like 斥候, the only semantic overlap with 叱 being 指斥, 叱 being used otherwise

Identify variants which aren't used in any decompositions / which people are more likely to use

The premise of this tool is to leverage knowledge of character variants, which includes the source characters of various radicals, whether dictionary-indexed, simplified, or otherwise transformed in a consistent matter.

The right part of 価, the top part of 要, 栗, the top-right part of 湮, etc is semantically a 西, however, there are two identical (to the eye) "覀"s that are shown.

image

The approach taken to handle issue #2 is to convert all IDS decompositions to Unihan characters outside of the CJK Radicals Supplement in the ETL process, hence these characters are retrievable by U+8980 and not by U+2EC3.

Identify Japanese and Korean unsimplified, canonical characters

image

卫、衛、衞󠄀

Note that in Japan, https://www.kanjipedia.jp/kanji/0000403800 衞󠄀 is the 旧字 of 衛 (!!)

One cannot go hunting in the Unihan database directly since they are preexisting variants in G sources too - https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=U%2B885E

「說文解字」has https://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QTAyNzY4 眞, and furthermore goes on to say: 僊人變形而登天也。从从目从乚。 Korea and Japan consider this variant to be canonically traditional.

image

The case of 既 and 即 is strange in Japanese: they are 既 and 卽 respectively.

image

image

Experiment with react-router instead of the current snap scrolling experience

Mobile browsers already use edge swiping for navigating forward and backward (!)

Background routes should have no problem being unmounted because all the information, besides scroll positions, perhaps, is in the DataContextProvider.

It would avoid all of the issues associated with Chrome mobile jumping back to the left of the StickyOutputBarWrapper.

Implement scripts to fetch and preprocess raw Unihan data

Processing Unihan data (Unihan_IRGSources.txt, Unihan_Readings.txt, and Unihan_Variants.txt) into the ReverseMap, VariantRadicals, and BaseRadicals in the browser already takes more than 10 seconds, which will only get worse with the upcoming implementation of the "find all islands" algorithm and addition of Ext. G characters

Include all 1st level orthographic variants in the base radicals list

艹 (U+8279, aka radical variant of 艸) and 艹 (U+FA5E, aka, compatibility orthographic variant explicitly made up of two 十 十) should both be base radicals. As of 418fe90 , the far more frequently-used radical, 艹 (U+8279), is NOT included in the base radical list (!!!)

image

The IDS database records most common characters with the 艸 radical using U+8279, even in locales use the 十 十 orthographic style, e.g. 華、艺 (SC)、藝 (TC)、芸(JP - simplified)、満 (JP - simplified), etc

https://www.kanjipedia.jp/kanji/0000684000

The only characters which have U+FA5E in their IDS strings are "艹":["劳","𦬼","荣","蓳","𧏊","𧆾","𮓖"]

To complicate matters,

CJK Compatibility Ideographs | 艹 (FA5D)
CJK Compatibility Ideographs | 艹 (FA5E)
CJK Radicals Supplement | ⺾ (2EBE)
CJK Radicals Supplement | ⺿ (2EBF)
CJK Radicals Supplement | ⻀ (2EC0)

End-usage suggestions

Hi,

I like this project, and I think it deserves more recognition.

I have a few suggestions to help those without a technical background, which I hope would give it a wider audience. I appreciate it is only at version 0.1.0.

  1. A 'guide' / 'help' button? Perhaps beneath 'Radically A component-based CJK character search engine', or beneath the Search button. To show a pop-up usage guide, or to scroll the page down to the usage guide below. I know the usage guide is just there if you scroll down, but when you are tired or very much not at your best, and in a rush, as I was...
  2. How about variants being on a separate line to the decomposition? The demarcation is obvious to experienced readers, but a tad confusing to beginners to Asian languages.
  3. A few ?duplicates, e.g. searching for 一 shows, under 3 strokes: 兀 and 兀. You probably spotted these already.
  4. Usage guide could be rewritten for quicker understanding by non-technical users. I could contribute a few sentences.

Best wishes,

L

Load JSON into IndexedDB in JS if we exceed 50 MB

json]$ ls -lah
total 30M
drwxrwxr-x. 2 transfusion transfusion 4.0K Jan 30 19:51 .
drwxrwxr-x. 4 transfusion transfusion 4.0K Jan 27 03:13 ..
-rw-rw-r--. 1 transfusion transfusion 3.2K Jan 30 19:50 baseRadicals.json
-rw-rw-r--. 1 transfusion transfusion 1.5M Jan 30 19:50 forwardMap.json
-rw-rw-r--. 1 transfusion transfusion   73 Jan 30 19:50 processedIDSMetadata.json
-rw-rw-r--. 1 transfusion transfusion 5.5M Jan 30 19:50 readings.json
-rw-rw-r--. 1 transfusion transfusion  21M Jan 30 19:50 reverseMap.json
-rw-rw-r--. 1 transfusion transfusion 873K Jan 30 19:50 strokeCount.json
-rw-rw-r--. 1 transfusion transfusion 576K Jan 30 19:50 variantsIslandsLookup.json
-rw-rw-r--. 1 transfusion transfusion 298K Jan 30 19:50 variantsMap.json

https://stackoverflow.com/questions/60376429/where-in-my-pwa-should-i-be-loading-a-somewhat-large-json-file
iOS has a 50 MiB cache limit

https://developers.google.com/web/ilt/pwa/live-data-in-the-service-worker

For example HTML, CSS, and JS files should be stored in the cache, while JSON data should be stored in IndexedDB.

Bug in the regex-based IDCs matching step

Searching for 產亇 and ⿰ alone returns 𬎻 (U+2C3BB, 14 strokes), where the decomposition is ⿰產亇, but searching for 產 and ⿰ alone does not return U+2C3BB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.