您好！请问 options.json 里面的 261 改成 262 是什么原因？我看第五步输出word embedding到hdf5文件 <code class

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

请教使用方法 about elmo-chinese HOT 11 OPEN

Decalogue commented on June 9, 2024

请教使用方法

from elmo-chinese.

Comments (11)

guotong1988 commented on June 9, 2024

这个文件是自动生成的

from elmo-chinese.

Decalogue commented on June 9, 2024

多谢了！请问跑这个显存至少需要多少？我的显存不够。

from elmo-chinese.

guotong1988 commented on June 9, 2024

减少batch_size试试？

from elmo-chinese.

gailysun commented on June 9, 2024

请教各位大神，对于中文，不用char-cnn, 只用基于词的lstm结构，大家有跑过吗？
我用分词后的数据训练bilm, 训练完模型后，在usage_token_embedding.py那部分使用就遇到问题，dump_token_embeddings.py遇到问题，这个函数里头还是用到基于char的vocab = UnicodeCharsVocabulary(vocab_file, max_word_length)
求跑通不用char-cnn的大神帮忙解答，感谢感谢

from elmo-chinese.

guotong1988 commented on June 9, 2024

@gailysun 跑过，纯词的

from elmo-chinese.

gailysun commented on June 9, 2024

@guotong1988 你的词表多大？embedding size是多少，我词表120W+， embedding size 512, 然后token embedding 没有用dump_token_embedding.py那个脚本，是从模型里头单独抽取的，代码如下：
12 option_dir = "./out_models/options.json"
13 ckpt_dir = "./out_models"
14 token_embed_file="./hdf5_model/vocab_embedding.hdf5"
15
16 def dump_token_embedding(options, ckpt_dir, token_embed_file,batch_size=1):
17 bidirectional = options.get('bidirectional', False)
18 char_inputs = 'char_cnn' in options
19 if char_inputs:
20 max_chars = options['char_cnn']['max_characters_per_token']
21
22 config = tf.ConfigProto(allow_soft_placement=True)
23 with tf.Session(config=config) as sess:
24 with tf.device('/gpu:0'), tf.variable_scope('lm'):
25 test_options = dict(options)
26 # NOTE: the number of tokens we skip in the last incomplete
27 # batch is bounded above batch_size * unroll_steps
28 test_options['batch_size'] = batch_size
29 test_options['unroll_steps'] = 1
30 model = LanguageModel(test_options, False)
31 # we use the "Saver" class to load the variables
32 ckpt_file = tf.train.latest_checkpoint(ckpt_dir)
33 print("ckpt_file: {0}".format(ckpt_file))
34 loader = tf.train.Saver()
35 loader.restore(sess, ckpt_file)
36 embed = sess.run(model.embedding_weights)
37 print("embedding shape: {0}".format(embed.shape))
38 with h5py.File(token_embed_file,"w") as fout:
39 fout.create_dataset("embedding",embed.shape, dtype="float32", data=embed)
保存后的词的也就是token_embedding 大小2.4G
大小：2.4G Mar 27 10:38 vocab_embedding.hdf5

然后修改了usage_token.py的代码，主要就是给了已经存的token_embedding代码，41行那里：
24 # Our small dataset.
25 raw_context = ['[ 三生三世 ] 缘绾君心作者 : 冷月撩人类型 : 衍生 - 言情 - 架空历史 - 东方衍生进度 : 已完成风格 : 正剧视角 : 女主字数 :254156 字收藏']
26 raw_question = ['《 [ 三生三世 ] 缘绾君心》冷月撩人 _ 【衍生小说 | 言情小说】']
27 tokenized_context = [sentence.split() for sentence in raw_context]
28 tokenized_question = [sentence.split() for sentence in raw_question]
29
30 # Create the vocabulary file with all unique tokens and
31 # the special , tokens (case sensitive).
32 vocab_file = './my_data/vocab/vocab.txt'
33
34
35 # Location of pretrained LM. Here we use the test fixtures.
36 modeldir = "./out_models"
37 options_file = os.path.join(modeldir, 'options.json')
38 weightdir = "./hdf5_model"
39 weight_file = os.path.join(weightdir, 'weights.hdf5')
40 # token embedding
41 token_embedding_file="./hdf5_model/vocab_embedding.hdf5"
42
43 with h5py.File(token_embedding_file, "r") as fin:
44 token_embedding_mat = fin["embedding"]
45 print("token_embed shape: {0}".format(token_embedding_mat.shape))
46
47 ## Now we can do inference.
48 # Create a TokenBatcher to map text to token ids.
49 batcher = TokenBatcher(vocab_file)
50
51 # Input placeholders to the biLM.
52 context_token_ids = tf.placeholder('int32', shape=(None, None))
53 question_token_ids = tf.placeholder('int32', shape=(None, None))
54
55 # Build the biLM graph.
56 bilm = BidirectionalLanguageModel(
57 options_file,
58 weight_file,
59 use_character_inputs=False,
60 embedding_weight_file=token_embedding_file
61 )
62 print("finish building bilm")
63
64 # Get ops to compute the LM embeddings.
65 context_embeddings_op = bilm(context_token_ids)
66 question_embeddings_op = bilm(question_token_ids)
67 print("finish get opts")
68
69 # Get an op to compute ELMo (weighted average of the internal biLM layers)
70 # Our SQuAD model includes ELMo at both the input and output layers
71 # of the task GRU, so we need 4x ELMo representations for the question
72 # and context at each of the input and output.
73 # We use the same ELMo weights for both the question and context
74 # at each of the input and output.
75 elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)
76 print(elmo_context_input)
77 with tf.variable_scope('', reuse=True):
78 # the reuse=True scope reuses weights from the context for the question
79 elmo_question_input = weight_layers(
80 'input', question_embeddings_op, l2_coef=0.0
81 )
82 elmo_question_embedding = weight_layers(
83 'embedding', question_embeddings_op, l2_coef=0.0
84 )
85
86 elmo_context_output = weight_layers(
87 'output', context_embeddings_op, l2_coef=0.0
88 )

其他跟原始代码一样，运行报错，
File "usage_token.py", line 65, in
context_embeddings_op = bilm(context_token_ids)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 105, in call
max_batch_size=self._max_batch_size)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 288, in init
self._build()
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 294, in _build
self._build_word_embeddings()
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 481, in _build_word_embeddings
dtype=DTYPE,
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 277, in custom_getter
return getter(name, *args, **kwargs)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
validate_shape=validate_shape)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 200, in init
expected_shape=expected_shape)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 278, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 676, in convert_to_tensor
as_ref=False)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 741, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 437, in make_tensor_proto
"Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

请教下，你有遇到这个问题吗？大的token_embedding 这里出错，有啥好办法吗，万分感谢

from elmo-chinese.

guotong1988 commented on June 9, 2024

@gailysun 我用的字级别的，只有3W

from elmo-chinese.

guotong1988 commented on June 9, 2024

@gailysun 你最好format一下你贴的东西啊

from elmo-chinese.

gailysun commented on June 9, 2024

@guotong1988 sorry, 格式有点乱，你用的3W 用作者给的代码跑没问题是吧？
我现在问题就是分词后的，词表embedding超过2G了，在build_word_embedding那里出问题，不知道该怎么把大embedding传进去，大神有啥解决办法吗

Traceback (most recent call last):
File "usage_token.py", line 65, in

context_embeddings_op = bilm(context_token_ids)

File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 105, in call
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 288, in init
```
self._build()
```
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 294, in _build
```
self._build_word_embeddings()
```
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 481, in _build_word_embeddings
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

from elmo-chinese.

guotong1988 commented on June 9, 2024

@gailysun 你应该只保留一些高频词

from elmo-chinese.

gailysun commented on June 9, 2024

@guotong1988 我按词频30卡的，我把hidden size改小点试试吧，谢谢哈

from elmo-chinese.

请教使用方法 about elmo-chinese HOT 11 OPEN

Comments (11)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs