rokid / elmo-chinese Goto Github PK

View Code? Open in Web Editor NEW

153.0 19.0 44.0 3.73 MB

Deep contextualized word representations for Chinese

Python 100.00%

tensorflow word-embedding nlp wordvectors

elmo-chinese's Introduction

ELMo-chinese

Deep contextualized word representations for Chinese.

本仓库只是输出上下文无关的 word embedding。

依赖

python3
tensorflow >= 1.10
jieba

使用方法

准备数据，参考 data 和 vocab 目录，可用 pre_data/vocab.py 处理出词典（每个 data 文件不能太大，否则内存不够）
训练模型 train_elmo.py
输出模型 dump_weights.py
把 options.json 里的 261 改成 262
输出 word embedding 到 hdf5 文件 usage_token.py

实验结果

用可视化工具看合理，textmatch 任务提升 AUC 1-2。

License

MIT

elmo-chinese's People

Contributors

Stargazers

Watchers

elmo-chinese's Issues

为什么chinese字符也encode成255个

UnicodeCharsVocabulary类中_convert_word_to_char_ids函数您注释是chinese也可以成为255个ids，但汉字字符应该明显多于255个呀，所以用utf解码变成ids是不是不合适？

请教使用方法

您好！
请问 options.json 里面的 261 改成 262 是什么原因？我看第五步输出word embedding到hdf5文件 usage_token.py 好像没有用到这个 options.json 文件
多谢了！

76, in <module> main(args) File "train_elmo.py", line 66, in main train(options, data, n_gpus, tf_save_dir, tf_log_dir) File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bilm/training.py", line 766, in train allow_soft_placement=True)) as sess: File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bin/testenv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1494, in __init__ super(Session, self).__init__(target, graph, config=config) File "/data/sde/jiaxin_hu/git_project/ELMo-chinese/bin/testenv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 626, in __init__ self._session = tf_session.TF_NewSession(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

Traceback (most recent call last):
  File "train_elmo.py", line 72, in <module>
    main(args)
  File "train_elmo.py", line 62, in main
    train(options, data, n_gpus, tf_save_dir, tf_log_dir)
  File "/mnt/disk2/data/wp/Word2vec/model/ELMo-chinese/bin/bilm/training.py", line 838, in train
    for batch_no, batch in enumerate(data_gen, start=1):
  File "/mnt/disk2/data/wp/Word2vec/model/ELMo-chinese/bin/bilm/data.py", line 469, in iter_batches
    num_steps, max_word_length)
  File "/mnt/disk2/data/wp/Word2vec/model/ELMo-chinese/bin/bilm/data.py", line 311, in _get_batch
    :how_many]
ValueError: could not broadcast input array from shape (18,50) into shape (19,50)

最后发现是 bilm/data.py 下的 get_bacth 函数的:

  inputs[i, cur_pos:next_pos] = cur_stream[i][0][:how_many]
                if max_word_length is not None:
                    char_inputs[i, cur_pos:next_pos] = cur_stream[i][1][:
                                                                        how_many]

这一段报错。
由于没看懂 get_batch 的逻辑，自己不会改，请问能指点一下吗，谢谢

rokid / elmo-chinese Goto Github PK

elmo-chinese's Introduction

ELMo-chinese

依赖

使用方法

实验结果

License

elmo-chinese's People

Contributors

Stargazers

Watchers

Forkers

elmo-chinese's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs