GithubHelp home page GithubHelp logo

laomagic / bert-utils Goto Github PK

View Code? Open in Web Editor NEW

This project forked from terrifyzhao/bert-utils

0.0 1.0 0.0 5.44 MB

一行代码使用BERT生成句向量,BERT做文本分类、文本相似度计算

License: Apache License 2.0

Python 100.00%

bert-utils's Introduction

bert-utils

本文基于Google开源的BERT代码进行了进一步的简化,方便生成句向量与做文本分类


***** New July 1st, 2019 *****

  • 修改句向量graph文件的生成方式,提升句向量启动速度。不再每次以临时文件的方式生成,首次执行extract_feature.py时会创建tmp/result/graph, 再次执行时直接读取该文件,如果args.py文件内容有修改,需要删除tmp/result/graph文件
  • 修复同时启动两个进程生成句向量时代码报错的bug
  • 修改文本匹配数据集为QA_corpus,该份数据相比于蚂蚁金服的数据更有权威性

1、下载BERT中文模型

下载地址: https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip

2、把下载好的模型添加到当前目录下

3、句向量生成

生成句向量不需要做fine tune,使用预先训练好的模型即可,可参考extract_feature.pymain方法,注意参数必须是一个list。

首次生成句向量时需要加载graph,并在output_dir路径下生成一个新的graph文件,因此速度比较慢,再次调用速度会很快

from bert.extrac_feature import BertVector
bv = BertVector()
bv.encode(['今天天气不错'])

4、文本分类

文本分类需要做fine tune,首先把数据准备好存放在data目录下,训练集的名字必须为train.csv,验证集的名字必须为dev.csv,测试集的名字必须为test.csv, 必须先调用set_mode方法,可参考similarity.pymain方法,

训练:

from similarity import BertSim
import tensorflow as tf

bs = BertSim()
bs.set_mode(tf.estimator.ModeKeys.TRAIN)
bs.train()

验证:

from similarity import BertSim
import tensorflow as tf

bs = BertSim()
bs.set_mode(tf.estimator.ModeKeys.EVAL)
bs.eval()

测试:

from similarity import BertSim
import tensorflow as tf

bs = BertSim()
bs.set_mode(tf.estimator.ModeKeys.PREDICT)
bs.test()

5、DEMO中自带了QA_corpus数据集,这里给出地址, 该份数据的生成方式请参阅附件中的论文The BQ Corpus.pdf

bert-utils's People

Contributors

terrifyzhao avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.