GithubHelp home page GithubHelp logo

modified-pmi-be's Introduction

Unsupervised Chinese terminology detection based on real-world statistics

we propose a new method which is based on Point Mutually Information, Branch Entropy and real-world data with over 16 billion Chinese characters. Our method has achieved better results in both precision and recall

Data

  • Real-World data: 8 Gigabytes news data contains over 1.6 billion words which covers sports, entertainments, political, science, arts and culture from Sina, China news, Tencent, Baidu and People's Daily
  • Test Data: Drug name data which is collected from 38 3A-grade hospitals.

Method

PMI is
$$
\begin{align*}
  pmi(x,y) = \log_2{\frac{p(x,y)}{p((x)p(y)}}
\end{align*}
$$
Define $\theta$ as modification factor
$$
\begin{equation}
  \theta = \frac{p(x',y')}{p(x')p(y')}
\end{equation}
$$
PMI' can be define as
$$
pmi' = \log_2{\{\frac{p(x,y)}{p(x)p(y)}}\frac{1}{\theta}\}
$$
Normalized pmi'
$$
\mathcal{N}(pmi') = \frac{{pmi'}-\min{(pmi')}}{\max{(pmi')}-\min{(pmi')}}
$$
Branch Entropy
$$
\mbox{H}(x) = -\sum\limits_{i} p(x_i)\log_{2}{p(x_i)}
$$

$$
\mathcal{H}_{outer}(x) = \min \lbrace \mbox{H}_r(x),\mbox{H}_l(x) \rbrace \\
 \mathcal{H}_{outer}(x) = \min \lbrace \mbox{H}_r(x),\mbox{H}_l(x) \rbrace \\
 \mbox{BE}(x) = -\mathcal{H}_{outer}(x)+\mathcal{H}_{inner}(x) \\
$$

Normalization 
$$
\mathcal{N}(\mbox{BE}(x)) = \frac{\mbox{BE}(x)-\overline{\mbox{BE}(x)}}{\sigma\mbox{(BE(x)}}
$$
Final Score
$$
\mbox{Score}(x) = \lambda\mathcal{N}(\mbox{BE}(x)) + (1-\lambda)\mathcal{N}(pmi')
$$

Result

Modifired PMI+BE PMI
Detected words 1003 271
Correctly detected words 586 105
Words under criterion 390 87
Detected of Words under criterion 322 60
Precision 58.4% 38.7%
Recall 82.6% 70.0%

Environment

  • Real world data: ~8GB news data
  • Test file: ~700KB drug name data
  • Experiment IDE: RStudio Version 1.1.414
  • OS: ubuntu 14.04

modified-pmi-be's People

Contributors

schirp avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.