GithubHelp home page GithubHelp logo

请教使用方法 about elmo-chinese HOT 11 OPEN

Decalogue avatar Decalogue commented on June 9, 2024
请教使用方法

from elmo-chinese.

Comments (11)

guotong1988 avatar guotong1988 commented on June 9, 2024

这个文件是自动生成的

from elmo-chinese.

Decalogue avatar Decalogue commented on June 9, 2024

多谢了!请问跑这个显存至少需要多少?我的显存不够。

from elmo-chinese.

guotong1988 avatar guotong1988 commented on June 9, 2024

减少batch_size试试?

from elmo-chinese.

gailysun avatar gailysun commented on June 9, 2024

请教各位大神,对于中文,不用char-cnn, 只用基于词的lstm结构,大家有跑过吗?
我用分词后的数据训练bilm, 训练完模型后,在usage_token_embedding.py那部分使用就遇到问题,dump_token_embeddings.py遇到问题,这个函数里头还是用到基于char的vocab = UnicodeCharsVocabulary(vocab_file, max_word_length)
求跑通不用char-cnn的大神帮忙解答,感谢感谢

from elmo-chinese.

guotong1988 avatar guotong1988 commented on June 9, 2024

@gailysun 跑过,纯词的

from elmo-chinese.

gailysun avatar gailysun commented on June 9, 2024

@guotong1988 你的词表多大?embedding size是多少,我词表120W+, embedding size 512, 然后token embedding 没有用dump_token_embedding.py那个脚本,是从模型里头单独抽取的,代码如下:
12 option_dir = "./out_models/options.json"
13 ckpt_dir = "./out_models"
14 token_embed_file="./hdf5_model/vocab_embedding.hdf5"
15
16 def dump_token_embedding(options, ckpt_dir, token_embed_file,batch_size=1):
17 bidirectional = options.get('bidirectional', False)
18 char_inputs = 'char_cnn' in options
19 if char_inputs:
20 max_chars = options['char_cnn']['max_characters_per_token']
21
22 config = tf.ConfigProto(allow_soft_placement=True)
23 with tf.Session(config=config) as sess:
24 with tf.device('/gpu:0'), tf.variable_scope('lm'):
25 test_options = dict(options)
26 # NOTE: the number of tokens we skip in the last incomplete
27 # batch is bounded above batch_size * unroll_steps
28 test_options['batch_size'] = batch_size
29 test_options['unroll_steps'] = 1
30 model = LanguageModel(test_options, False)
31 # we use the "Saver" class to load the variables
32 ckpt_file = tf.train.latest_checkpoint(ckpt_dir)
33 print("ckpt_file: {0}".format(ckpt_file))
34 loader = tf.train.Saver()
35 loader.restore(sess, ckpt_file)
36 embed = sess.run(model.embedding_weights)
37 print("embedding shape: {0}".format(embed.shape))
38 with h5py.File(token_embed_file,"w") as fout:
39 fout.create_dataset("embedding",embed.shape, dtype="float32", data=embed)
保存后的词的也就是token_embedding 大小2.4G
大小:2.4G Mar 27 10:38 vocab_embedding.hdf5

然后修改了usage_token.py的代码,主要就是给了已经存的token_embedding代码,41行那里:
24 # Our small dataset.
25 raw_context = ['[ 三生 三 世 ] 缘 绾 君 心 作者 : 冷月 撩人 类型 : 衍生 - 言情 - 架空 历史 - 东方 衍生 进度 : 已 完成 风格 : 正剧 视角 : 女主 字数 :254156 字 收藏']
26 raw_question = ['《 [ 三生 三 世 ] 缘 绾 君 心 》 冷月 撩人 _ 【 衍生 小说 | 言情 小说 】']
27 tokenized_context = [sentence.split() for sentence in raw_context]
28 tokenized_question = [sentence.split() for sentence in raw_question]
29
30 # Create the vocabulary file with all unique tokens and
31 # the special , tokens (case sensitive).
32 vocab_file = './my_data/vocab/vocab.txt'
33
34
35 # Location of pretrained LM. Here we use the test fixtures.
36 modeldir = "./out_models"
37 options_file = os.path.join(modeldir, 'options.json')
38 weightdir = "./hdf5_model"
39 weight_file = os.path.join(weightdir, 'weights.hdf5')
40 # token embedding
41 token_embedding_file="./hdf5_model/vocab_embedding.hdf5"
42
43 with h5py.File(token_embedding_file, "r") as fin:
44 token_embedding_mat = fin["embedding"]
45 print("token_embed shape: {0}".format(token_embedding_mat.shape))
46
47 ## Now we can do inference.
48 # Create a TokenBatcher to map text to token ids.
49 batcher = TokenBatcher(vocab_file)
50
51 # Input placeholders to the biLM.
52 context_token_ids = tf.placeholder('int32', shape=(None, None))
53 question_token_ids = tf.placeholder('int32', shape=(None, None))
54
55 # Build the biLM graph.
56 bilm = BidirectionalLanguageModel(
57 options_file,
58 weight_file,
59 use_character_inputs=False,
60 embedding_weight_file=token_embedding_file
61 )
62 print("finish building bilm")
63
64 # Get ops to compute the LM embeddings.
65 context_embeddings_op = bilm(context_token_ids)
66 question_embeddings_op = bilm(question_token_ids)
67 print("finish get opts")
68
69 # Get an op to compute ELMo (weighted average of the internal biLM layers)
70 # Our SQuAD model includes ELMo at both the input and output layers
71 # of the task GRU, so we need 4x ELMo representations for the question
72 # and context at each of the input and output.
73 # We use the same ELMo weights for both the question and context
74 # at each of the input and output.
75 elmo_context_input = weight_layers('input', context_embeddings_op, l2_coef=0.0)
76 print(elmo_context_input)
77 with tf.variable_scope('', reuse=True):
78 # the reuse=True scope reuses weights from the context for the question
79 elmo_question_input = weight_layers(
80 'input', question_embeddings_op, l2_coef=0.0
81 )
82 elmo_question_embedding = weight_layers(
83 'embedding', question_embeddings_op, l2_coef=0.0
84 )
85
86 elmo_context_output = weight_layers(
87 'output', context_embeddings_op, l2_coef=0.0
88 )

其他跟原始代码一样,运行报错,
File "usage_token.py", line 65, in
context_embeddings_op = bilm(context_token_ids)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 105, in call
max_batch_size=self._max_batch_size)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 288, in init
self._build()
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 294, in _build
self._build_word_embeddings()
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 481, in _build_word_embeddings
dtype=DTYPE,
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
use_resource=use_resource, custom_getter=custom_getter)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 360, in get_variable
validate_shape=validate_shape, use_resource=use_resource)
File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 277, in custom_getter
return getter(name, *args, **kwargs)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
use_resource=use_resource)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
validate_shape=validate_shape)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 200, in init
expected_shape=expected_shape)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/variables.py", line 278, in _init_from_args
initial_value(), name="initial_value", dtype=dtype)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 676, in convert_to_tensor
as_ref=False)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 741, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/data1/gailsun/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/tensor_util.py", line 437, in make_tensor_proto
"Cannot create a tensor proto whose content is larger than 2GB.")
ValueError: Cannot create a tensor proto whose content is larger than 2GB.

请教下,你有遇到这个问题吗?大的token_embedding 这里出错,有啥好办法吗,万分感谢

from elmo-chinese.

guotong1988 avatar guotong1988 commented on June 9, 2024

@gailysun 我用的字级别的,只有3W

from elmo-chinese.

guotong1988 avatar guotong1988 commented on June 9, 2024

@gailysun 你最好format一下你贴的东西啊

from elmo-chinese.

gailysun avatar gailysun commented on June 9, 2024

@guotong1988 sorry, 格式有点乱,你用的3W 用作者给的代码跑没问题是吧?
我现在问题就是分词后的,词表embedding超过2G了,在build_word_embedding那里出问题,不知道该怎么把大embedding传进去,大神有啥解决办法吗

  • Traceback (most recent call last):

  • File "usage_token.py", line 65, in

  • context_embeddings_op = bilm(context_token_ids)
    
  • File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 105, in call

  • File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 288, in init

  • self._build()
    
  • File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 294, in _build

  • self._build_word_embeddings()
    
  • File "/data1/gailsun/transfer_learning/bilm-tf-revised/bilm/model.py", line 481, in _build_word_embeddings

  • ValueError: Cannot create a tensor proto whose content is larger than 2GB.

from elmo-chinese.

guotong1988 avatar guotong1988 commented on June 9, 2024

@gailysun 你应该只保留一些高频词

from elmo-chinese.

gailysun avatar gailysun commented on June 9, 2024

@guotong1988 我按词频30卡的,我把hidden size改小点试试吧,谢谢哈

from elmo-chinese.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.