GithubHelp home page GithubHelp logo

chinese2pinyin's Introduction

Chinese2Pinyin

借鉴使用Redis字符操作库SDS,简化程序编写,提高安全性和可移植性。

本汉字转拼音程序仅支持输入字符串是utf-8编码的情况,其它编码,程序会不做处理

原文输出。

将常见20902个汉字的拼音按照unicode编码的大小排序写到文件PinyinData.txt中,

因为汉字的拼音最长占6个字节,如chuang等,故每个拼音在文件中占6个字节,所以

有的拼音后面会含有空格。

汉字最小unicode码是19968, 将一个汉字的unicode码减去19968,乘以6,

得到其在PinyinData.txt中的偏移量,然后通过fseek定位到该位置,读取6个字节,

去掉后面可能含有的空格,就得到该汉字的拼音。

    //tmp是汉字的unicode码,start=19968,MAX_PINYIN_LEN=6
    int offset = (tmp - start) * MAX_PINYIN_LEN;
    //fp是PinyinData.txt的描述符
    fseek(fp, offset, SEEK_SET);
    fread(buf, MAX_PINYIN_LEN, 1, fp);

如何得到一个汉字的unicode码呢?

unicode编码中所有的字符都占两个字节,这种编码方法使得保存纯英文文本时会多占

用一倍的空间。utf-8是unicode编码的一种最常见的网络传输标准。

utf-8编码每次检查一个字节:

值在0000000001111111(0127)范围,确定是单字节字符, 指针要向后移动一个位置;

值在1100000011011111(192223)范围,确定是2字节字符,指针要向后移动两个位置;

值在1110000011101111(224239)范围,确定是3字节字符,指针要向后移动三个位置;

值在1111000011110111(240247)范围,确定是4字节字符,指针要向后移动四个位置;

值在1111100011111011(248251)范围,确定是5字节字符,指针要向后移动五个位置;

值在1111110011111101(252253)范围,确定是6字节字符,指针要向后移动六个位置;

10000000010111111(128191)标识的字节,不可能是字符的首字节,只可能是后续字节。

汉字在utf-8中占用三个字节。所以如果发现字节值在224~239之间,可以向后多读取两个字节,

这三个字节标识一个汉字。

我们先看unicode编码到utf-8的转换规则:

    0000 – 007F
    0xxxxxxx
    0080 – 07FF
    110xxxxx 10xxxxxx
    0800 – FFFF
    1110xxxx 10xxxxxx 10xxxxxx

例如“汉”字的unicode编码是0x6C49, 0x6C49在0800~FFFF之间,所用要用3字节模板:

1110xxxx 10xxxxxx 10xxxxxx, 将6C49写成二进制是:0110 1100 0100 1001, 按照三字节模板的

分段方法:0110 110001 001001, 依次替换模板中的x, 得到:1110-0110 10-110001 10-001001,

即E6 B1 89, 这就是“汉”字的utf-8编码。

反之,我们得到“汉”的utf-8编码11100110 10110001 10001001,按照下面的规则就可以得到其unicode编码:

/*
 * words指向当前读到的字节,并且字节范围在224~239之间。
 * ((int)(*words & 0x0F)) << 12) => 0110 0000 0000 0000
 * (((int)(*(words+1) & 0x3F)) << 6) => 1100 0100 0000
 * (*(words+2) & 0x3F) => 1001
 * 0110 0000 0000 0000 | 1100 0100 0000 | 0010 0100 => 0110 1100 0100 1001 => 6C49
 */
(((int)(*words & 0x0F)) << 12) | (((int)(*(words+1) & 0x3F)) << 6) | (*(words+2) & 0x3F);

得到“汉”字的unicode码是0x6C49, 转为10进制是27721, (27721 - 19968) * 6 = 46518, 使用fseek函数

将fp定位到46518,读取6个字节,去掉后面空格后,得到"han", 获取拼音成功。

综上,本程序采用文件定位的方法转换拼音,占用内存少,并且速度快,应该是一个很不错的方案。

chinese2pinyin's People

Contributors

jishipu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

chinese2pinyin's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.