radically / radically Goto Github PK

View Code? Open in Web Editor NEW

8.0 3.0 1.0 9.79 MB

A component-based CJK character search engine

Home Page: https://radically.bryankok.com

License: GNU General Public License v2.0

JavaScript 6.44% HTML 0.81% CSS 1.13% TypeScript 91.51% Shell 0.12%

cjk chinese-characters japanese-characters pwa react offline-app search-algorithm

radically's Introduction

Radically

Lookup Preprocessing Algorithm

A basic implementation of a lookup system from constituent radical to character is simple - depth first search starting from the root node, adding every encountered character in every IDS string into a hash table / dictionary / similar data structure. Indeed, this is done in this project.

車 -> 蓮,櫣,輿,轉 ...
车 -> 莲,舆,转 ...

The complication arises if we want to include radical frequency in our search - e.g. "find all characters with at least 3 occurrences of 人" should return 傘, 众, and 齒, but not 从, 纵, 齿, etc.

There may be more than one IDS string, and thus decomposition per character.
The total number of characters in each head to leaf node path varies, as does the number of unique characters. We may (and I have!) run into a character with a very elaborate decomposition if we take a particular branch, and a decomposition that terminates early on another¹.

The approach taken is to generate a list of the powerset² of radicals for every possible head to leaf node path, EXCLUDING THE ROOT ITSELF (important!), e.g. the expected output would be

[
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 1, '丨': 1 },
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 2 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 1, '丨': 1 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 2 }
]

where there are 4 sets, or rather hash tabless, depicted by the 4 different colors above. The one in green corresponds to the 1st item in the list.

Define a recursive function, rec(char), where char is a single character and the output is as described above.

Taking the left branch from the root, 艹 can be decomposed into either 十十 or 十丨, hence rec(艹) -> [ { '十': 1, '丨': 1 }, { '十': 2 } ], and similarly, rec(連) -> [ { '辶': 1, '車': 1 } ].

The key observation is that for each character in an IDS string, we must pick 1 set from its list of sets before merging 1 set per character into a bigger set, to avoid double-counting radicals. Ergo, we must find the cartesian product between the powersets of each constituent character, then merge each resulting n-tuple.

Hence, the powersets of the IDS string 十連, although not explicitly defined here, are [ { '十': 1, '丨': 1, '辶': 1, '車': 1 }, { '十': 2, '辶': 1, '車': 1 } ].

Define another function mergeTwoFreqs(powersetA, powersetB) which merges two powersets together, adding the frequencies together if they are in both sets. Merging 3 or more sets is done by mergeTwoFreqs(powersetA, mergeTwoFreqs(powersetB, powersetC)...

Note how the first item of rec(艹), and then the second item, is merged with the only item of rec(連) in turn. In other words, for a given IDS string ABC, where rec(A) -> [ A0, A1, A2 ], rec(B) -> [ B0, B1, B2 ], and rec(C) -> [ C0, C1, C2 ], i.e. this is a permutation generation problem where we need to generate [ mergeFreqs((A0, B0, C0)), mergeFreqs((A0, B0, C1)), ... mergeFreqs((A2, B2, C2)) ].

Define another function permGen(n-tuple[]) which takes the length of each n-tuple and outputs int[] describing each permutation by their indices, e.g. [ [0,0,0], [0,0,1], ... [2,2,2] ] (keep on reading!)

Now, the only thing left to do is to merge the powersets of the IDS string, with the frequencies of the IDS string itself.

Going back to our example where the two powersets of 艹蓮 are [ { '十': 1, '丨': 1, '辶': 1, '車': 1 }, { '十': 2, '辶': 1, '車': 1 } ], we need to merge each of the powersets with { '艹': 1, '連': 1 } using mergeTwoFreqs to get two possible powersets for 蓮, namely,

[
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 1, '丨': 1 },
  { '艹': 1, '連': 1, '辶': 1, '車': 1, '十': 2 },
]

We do the same for the right subtree, to get the blue and green powersets.

[
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 1, '丨': 1 },
  { '辶': 1, '莗': 1, '艹': 1, '車': 1, '十': 2 }
]

Finally, we append the left and right subtree results immediately above into a list, and flatten it to obtain our expected result far above.

The termination condition, i.e. a leaf node is considered to be a leaf node when the IDS string(s) contain only itself, i.e. there are no constituent characters, and thus we return an empty list of powersets.

Once again, in pseudocode:

def mergeTwoFreqs(freqsA, freqsB) {
    res = { ...freqsB };
    for (let char of Object.keys(freqsA)) {
        if (!res[char]) res[char] = 0;
        res[char] += freqsA[char];
    }
    return res;
};

// given [ [ A0, A1, A2 ], [ B0, B1, B2 ], [ C0, C1, C2 ] ], return [ mergeFreqs((A0, B0, C0)), mergeFreqs((A0, B0, C1)), ... mergeFreqs((A2, B2, C2)) ]
// A0 .. C2 are *powersets*.
def permuteAndMerge(freqsArr) {
    // [ [0,0,0], [0,0,1], ... [2,2,2] ]
    permutations = permGen(freqsArr.map(getSize))

    res = []
    for permutation in permutations {
        mergedPowerset = {}
        // take one from each powerset
        elements = permutation.map((elem, i) => freqsArr[i][elem])
        for element in elements {
            mergedPowerset = mergeTwoFreqs(mergedPowerset, element)
        }
        res.append(mergedPowerset)
    }
    return res
}

def rec(char) {
    // each index on this array corresponds to a fork in the black arrow, i.e. a different decomposition in the picture above
    freqsAtThisNode = []

    // ['⿰木⿺辶莗', '⿰木蓮']
    idsStrings = getIDSStrings(char)

    // record the frequencies of each IDS string
    // itself in the array
    for i = 0; i < idsStrings.length; i++ {
        freqsAtThisNode.append({});
        for idsChar in idsStrings[i] {
            if idsChar !== char and isValidHanChar(char) {
                if (!freqsAtThisNode[i][idsChar]) {
                    freqsAtThisNode[i][idsChar] = 0;
                }
                freqsAtThisNode[i][idsChar] += 1;
            }
        }
    }

    res = []
    for i = 0; i < idsStrings.length; i++ {
        // eureka
        freqs = permuteAndMerge(Object.keys(freqsAtThisNode[i]).map((key) => rec(key)))
        // merge the powersets with the freqs of the corresponding IDS string
        freqs = freqs.map((freq) => mergeTwoFreqs(freq, freqsAtThisNode[i]))
        // flatten
        res = res.concat(freqs)
    }
    return res
}

N.B. if there are no valid characters in the IDS string, freqs in the 2nd for loop will be empty, and it will vanish during flattening.

[1] AFAIK, there is a lot of human subjectivity involved in the CHISE IDS dataset.

[2] "Powerset" here refers to a hash table, the keys of which are ALL the constituent radicals of a particular character as we traverse the tree, and the values of which are their frequencies. I refer to this concept as freqs in the code often.

Generated Datasets

These JSON datasets are generated by npm run etl using my preprocessed IDS data and output into the public/json folder for consumption by the frontend.

baseRadicals.json: string[], a list of radicals which cannot be further decomposed.

forwardMap.json: { [key: string]: string[]; }, a map of component to characters that it is present in.

variantsIslandsLookup.json: { islands: string[][]; chars: { [key: string]: number[] }; }, islands is a list of lists of related characters. chars is a map of individual character to the index in islands of the island containing itself.

variantsMap.json: { [key: string]: number[]; }, map of character to list of types that it is classified as (e.g. Shinjitai, Simplified Chinese)

readings.json: { [key: string]: { [key: string]: string; }; }, map of character to Unihan reading fields.

reverseMap.json (large! >150MB. Not actually used by the frontend directly.):

{
  [key: string]: {
    utf_code: string;
    ids_strings: {
      ids: string;
      locales: string;
    }[];
    charFreqs?: powerset[];
  };
}

ids_strings is a list of IDSs and the locales that this IDS decomposition corresponds to.

Refer above² for what a "powerset" means here.

reverseMapCharFreqsOnly.json and reverseMapIDSOnly.json are optimized versions of reverseMap.json in order to fit within Apple's 50MB service worker cache limit for PWAs.

reverseMap.json is gzipped and served as reverseMap.json.gz in the deployed version.

Rules of thumb

The IDS sequences as provided by CHISE use Kangxi radicals and their actual CJK character counterparts interchangeably.

Decomposition which uses U+2E81 (Kangxi radical)

Decomposition which uses U+20086 (CJK character)

All Kangxi radicals are converted to their corresponding CJK characters as part of the ETL process.

All variants of a character, including transitive ones, should be retrivable from any of the characters involved. e.g. searching for 发 (SC) should return 發 (TC), 髮 (TC), 発 (JA), and searching for 鄕 (KR) should return 郷 (JA), 鄉 (TC), and 乡 (SC), along with other less-commonly used variants.

Licenses

SPDX-License-Identifier: GPL-2.0-or-later

The products of npm run etl (most of the files in the public/json folder) are licensed under the GNU General Public License v2.0 or later.
- This applies to the currently deployed version.
- They are derived from CHISE and the Kanji Database Project, which use the GPLv2, via my preprocessed IDS data.
I have plans to dual-license Radically under the MIT license.
- The main obstacles are the lack of a non-copyleft, small, 100% CJK Unicode coverage webfont, and most of the etl folder relying on said GPLv2 licensed data.
- You are more than welcome to contact me via email, the chat links above, or opening an issue if you have a need to use (parts of) it, or the algorithm in non-copyleft settings.
- If you wish to contribute to Radically, please keep this in consideration.

radically's People

Contributors

Stargazers

Watchers

Forkers

transfusion

radically's Issues

Improve the quality / comprehensiveness of variants

斥 and 叱
http://ccamc.org/cjkv.php?cjkv=%E5%8F%B1

康熙字典
【丑集上】【口字部】
【唐韻】【集韻】【韻會】【正韻】𠀤尺栗切，音𩾔。【說文】訶也。【倉頡篇】大訶爲叱。

漢語大字典
1.大聲呵斥、責罵。
2.呼喊；吆喝。

叱 has connotations of loudness, in addition to being used in compounds that mean rebuking, scolding, etc, e.g. 斥/叱責，怒斥/叱，斥/叱罵 etc

https://www.kanjipedia.jp/kanji/0002920800
斥 in Japanese appears to be primarily used in the "repulsion" sense, e.g. 排斥, 斥力, 斥/退ける, in addition to other classical phrases like 斥候, the only semantic overlap with 叱 being 指斥, 叱 being used otherwise

U+FFFD should not be in the components list

Identify variants which aren't used in any decompositions / which people are more likely to use

The premise of this tool is to leverage knowledge of character variants, which includes the source characters of various radicals, whether dictionary-indexed, simplified, or otherwise transformed in a consistent matter.

The right part of 価, the top part of 要, 栗, the top-right part of 湮, etc is semantically a 西, however, there are two identical (to the eye) "覀"s that are shown.

The approach taken to handle issue #2 is to convert all IDS decompositions to Unihan characters outside of the CJK Radicals Supplement in the ETL process, hence these characters are retrievable by U+8980 and not by U+2EC3.

Landscape mode warning tooltip is triggered when the keyboard is open in portrait mode

Identify Japanese and Korean unsimplified, canonical characters

卫、衛、衞󠄀

Note that in Japan, https://www.kanjipedia.jp/kanji/0000403800 衞󠄀 is the 旧字 of 衛 (!!)

One cannot go hunting in the Unihan database directly since they are preexisting variants in G sources too - https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=U%2B885E

「說文解字」has https://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QTAyNzY4 眞, and furthermore goes on to say: 僊人變形而登天也。从匕从目从乚。 Korea and Japan consider this variant to be canonically traditional.

The case of 既 and 即 is strange in Japanese: they are 既 and 卽 respectively.

Experiment with react-router instead of the current snap scrolling experience

Mobile browsers already use edge swiping for navigating forward and backward (!)

Background routes should have no problem being unmounted because all the information, besides scroll positions, perhaps, is in the DataContextProvider.

It would avoid all of the issues associated with Chrome mobile jumping back to the left of the StickyOutputBarWrapper.

Characters with high stroke count are illegible

This issue is particularly pronounced on non-retina displays.

Chromium 86, Fedora

Add an indicator that the Readings pane can be scrolled up and down

Take into account IDS decompositions which use the CJK Radicals Supplement

𠂆 (U+20086)
⺁ (U+2E81) Part of the CJK Radicals Supplement block!

Graphical bug due to unwrappable line in definitions

Character in question: 毌

U+1b0a6, a hentaigana, happens to be a base radical

https://glyphwiki.org/wiki/u1b0a6

Decide how to display this character (perhaps by including the single-character TTF ...)

Consider adding Babelstone Han PUA

https://babelstone.co.uk/Fonts/PUA.html

WOFF2 fonts, IDS, and PUA codepoints are provided

kTotalStrokes is occasionally wrong

Implement scripts to fetch and preprocess raw Unihan data

Processing Unihan data (Unihan_IRGSources.txt, Unihan_Readings.txt, and Unihan_Variants.txt) into the ReverseMap, VariantRadicals, and BaseRadicals in the browser already takes more than 10 seconds, which will only get worse with the upcoming implementation of the "find all islands" algorithm and addition of Ext. G characters

Light mode install snackbar buttons

Allow manual addition of variant rules

𭕄（2D544） is a variant of 小

Might also be useful for handling structurally similar characters like the many variants of 龜

JMJ source variant characters

https://mojikiban.ipa.go.jp/search/detail/MJ059638
𭟄 is an actual variant of 戀 which supposed appeared in the KangXi dictionary

https://unicode.org/L2/L2015/15110-n4663-pdam22-disc.pdf

Allow multiple concurrent searches in progress

Include all 1st level orthographic variants in the base radicals list

艹 (U+8279, aka radical variant of 艸) and 艹 (U+FA5E, aka, compatibility orthographic variant explicitly made up of two 十十) should both be base radicals. As of 418fe90 , the far more frequently-used radical, 艹 (U+8279), is NOT included in the base radical list (!!!)

The IDS database records most common characters with the 艸 radical using U+8279, even in locales use the 十十 orthographic style, e.g. 華、艺 (SC)、藝 (TC)、芸(JP - simplified)、満 (JP - simplified), etc

https://www.kanjipedia.jp/kanji/0000684000

The only characters which have U+FA5E in their IDS strings are "艹":["劳","𦬼","荣","蓳","𧏊","𧆾","𮓖"]

To complicate matters,

Persist the scroll position of the About page too

favicon.ico is not pre-cached by the SW

Keyboard shortcuts for the actions one can take with each component

End-usage suggestions

Hi,

I like this project, and I think it deserves more recognition.

I have a few suggestions to help those without a technical background, which I hope would give it a wider audience. I appreciate it is only at version 0.1.0.

A 'guide' / 'help' button? Perhaps beneath 'Radically A component-based CJK character search engine', or beneath the Search button. To show a pop-up usage guide, or to scroll the page down to the usage guide below. I know the usage guide is just there if you scroll down, but when you are tired or very much not at your best, and in a rush, as I was...
How about variants being on a separate line to the decomposition? The demarcation is obvious to experienced readers, but a tad confusing to beginners to Asian languages.
A few ?duplicates, e.g. searching for 一 shows, under 3 strokes: 兀 and 兀. You probably spotted these already.
Usage guide could be rewritten for quicker understanding by non-technical users. I could contribute a few sentences.

Best wishes,

Load JSON into IndexedDB in JS if we exceed 50 MB

json]$ ls -lah
total 30M
drwxrwxr-x. 2 transfusion transfusion 4.0K Jan 30 19:51 .
drwxrwxr-x. 4 transfusion transfusion 4.0K Jan 27 03:13 ..
-rw-rw-r--. 1 transfusion transfusion 3.2K Jan 30 19:50 baseRadicals.json
-rw-rw-r--. 1 transfusion transfusion 1.5M Jan 30 19:50 forwardMap.json
-rw-rw-r--. 1 transfusion transfusion   73 Jan 30 19:50 processedIDSMetadata.json
-rw-rw-r--. 1 transfusion transfusion 5.5M Jan 30 19:50 readings.json
-rw-rw-r--. 1 transfusion transfusion  21M Jan 30 19:50 reverseMap.json
-rw-rw-r--. 1 transfusion transfusion 873K Jan 30 19:50 strokeCount.json
-rw-rw-r--. 1 transfusion transfusion 576K Jan 30 19:50 variantsIslandsLookup.json
-rw-rw-r--. 1 transfusion transfusion 298K Jan 30 19:50 variantsMap.json

https://stackoverflow.com/questions/60376429/where-in-my-pwa-should-i-be-loading-a-somewhat-large-json-file
iOS has a 50 MiB cache limit

https://developers.google.com/web/ilt/pwa/live-data-in-the-service-worker

For example HTML, CSS, and JS files should be stored in the cache, while JSON data should be stored in IndexedDB.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.