GithubHelp home page GithubHelp logo

codexglue's Introduction

Introduction

According to Evans Data Corporation, there are 23.9 million professional developers in 2019, and the population is expected to reach 28.7 million in 2024. With the growing population of developers, code intelligence, which aims to leverage AI to help software developers improve the productivity of the development process, is growing increasingly important in both communities of software engineering and artificial intelligence.

When developers want to find code written by others with the same intent, code search systems can help automatically retrieve semantically relevant code given natural language queries. When developers are confused about what to write next, code completion systems can help by automatically completing the following tokens given the context of the edits being made. When developers want to implement Java code with the same function of some existing body of Python code, code-to-code translation systems can help translate from one programming language (Python) to another (Java).

Code intelligence therefore plays a vital role in Microsoft’s mission to empower developers. As highlighted by Microsoft CEO Satya Nadella at Microsoft Build 2020, the role of developers is more important than ever. GitHub is increasingly the default home for source code, and Visual Studio Code is the most popular code editor. Microsoft offers the most complete toolchain for developers, bringing together the best of GitHub, Visual Studio, and Microsoft Azure to help developers to go from idea to code and code to cloud.

Recent years have seen a surge of applying of statistical models, including neural nets, to code intelligence tasks. Very recently, pre-trained models learned from big programming language data have been inspired by the great success of large pre-trained models like BERT and GPT in natural language processing (NLP). These models, including IntelliCode and CodeBERT, obtain further improvements on code understanding and generation problems. However, the area of code intelligence lacks a benchmark suite that covers a wide range of tasks. We have seen that a diversified benchmark dataset is significant for the growth of an area of applied AI research, like ImageNet for computer vision and GLUE for NLP.

To address this, researchers from Microsoft Research Asia, Developer Division, and Bing introduce CodeXGLUE, a benchmark dataset and open challenge for code intelligence. It includes a collection of code intelligence tasks and a platform for model evaluation and comparison. CodeXGLUE stands for General Language Understanding Evaluation benchmark for CODE. It includes 14 datasets for 10 diversified code intelligence tasks covering the following scenarios:

  • code-code (clone detection, defect detection, cloze test, code completion, code repair, and code-to-code translation)
  • text-code (natural language code search, text-to-code generation)
  • code-text (code summarization)
  • text-text (documentation translation)

A brief summary of CodeXGLUE is given below, including tasks, datasets, language, sizes in various states, baseline systems, providers, and short definitions of each task. Datasets highlighted in BLUE are newly introduced. A brief summary of CodeXGLUE, including tasks, datasets, baseline systems, etc.

To make it easy for participants, we provide three baseline models to support these tasks, including a BERT-style pre-trained model (in this case, CodeBERT), which is good at understanding problems. We also include a GPT-style pre-trained model, which we call CodeGPT, to support completion and generation problems. Finally, we include an Encoder-Decoder framework that supports sequence-to-sequence generation problems.

Three pipelines including CodeBERT, CodeGPT, and Encoder-Decoder are provided to make it easy for participants. baselines

With CodeXGLUE, we seek to support the development of models that can be applied to various code intelligence problems, with the goal of increasing the productivity of software developers. We encourage researchers to participate in the open challenges to continue progress in code intelligence. Moving forward, we’ll extend CodeXGLUE to more programming languages and downstream tasks while continuing to push forward pre-trained models by exploring new model structures, introducing new pre-training tasks, using different types of data, and more.

Relevant Links

Leaderboard | CodeXGLUE paper | Access from HuggingFace datasets Hugging Face Datasets

Tasks and Datasets

Below, we elaborate on the task definition for each task and newly introduced datasets that are highlighted in the table above.

  1. Clone detection (BigCloneBench, POJ-104). A model is tasked with measure the semantic similarity between codes. Two existing datasets are included. One is for binary classification between code and the other is for retrieving semantically similar code given code as the query.
  2. Defect detection (Devign). A model is tasked with identifying whether a body of source code contains defects that may be used to attack software systems, such as resource leaks, use-after-free vulnerabilities and DoS attack. An existing dataset is included.
  3. Cloze test (CT-all, CT-max/min). A model is tasked with predicting the masked token from code, formulated as a multi-choice classification problem. The two datasets are newly created, one with candidates from the (filtered) vocabulary and the other with candidates among “max” and “min”.
  4. Code completion (PY150, GitHub Java Corpus). A model is tasked with predicting following tokens given a code context. Both token-level and line-level completion are covered. The token-level task is analogous to language modeling, and we include two influential datasets here. Line-level datasets are newly created to test a model’s ability to autocomplete a line.
  5. Code translation (CodeTrans). A model is tasked with translating the code in one programming language to the code in another one. A dataset between Java and C# is newly created.
  6. Code search (CodeSearchNet, AdvTest; CodeSearchNet, WebQueryTest). ). A model is given the task of measuring semantic similarity between text and code. In the retrieval scenario, a test set is newly created where function names and variables in test sets are replaced to test the generalization ability of a model. In text-code classification scenario, a test set where natural language queries come from Bing query log is created to test on real user queries.
  7. Code repair (Bugs2Fix). A model is tasked with trying to automatically refine the code, which could be buggy or complex. An existing dataset is included.
  8. Text-to-code generation (CONCODE). A model is given the task to generate code given natural language description. An existing dataset is included.
  9. Code summarization (CodeSearchNet). A model is given the task to generate natural language comments for a code. Existing datasets are included.
  10. Documentation translation (Microsoft Docs). A model is given the task to translate code documentation between human languages. A dataset, focusing on low-resource multilingual translation, is newly created.

Submission Instructions

Once you have built a model that meets your expectations on evaluation with the dev set, you can submit your test results to get official evaluation on the test set. To ensure the integrity of the official test results, we do not release the correct answers for test set to the public. To submit your model for official evaluation on the test set, follow the below steps:

  1. Generate your prediction output for the dev set.
  2. Run the official evaluation methodologies found in the task specific git repo and verify your systems are running as expected.
  3. Generate your prediction output for the test set.
  4. Submit the following information by emailing to [email protected].

Your email should include:

  1. Prediction results on test set. [Required]
  2. Prediction results on dev set. [Recommended]
  3. Individual/Team Name: Name of the individual or the team to appear in the leaderboard. [Required]
  4. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard. [Optional]
  5. Model code: Training code for the model. [Recommended]
  6. Model information: Name of the model/technique to appear in the leaderboard. [Required]
  7. Paper Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard. [Optional]

To avoid "P-hacking" we discourage too many submissions from the same group in a short period of time.

Training and Inference Time Cost

We calculate the training and inference time cost for each dataset with 2 P100 GPUs. Results are shared in the following table. time-cost

LICENSE

Our codes follow MIT License.

Our datasets follow Computational Use of Data Agreement (C-UDA) License.

Reference

If you use this code or CodeXGLUE, please consider citing us.

@article{DBLP:journals/corr/abs-2102-04664,
  author    = {Shuai Lu and
               Daya Guo and
               Shuo Ren and
               Junjie Huang and
               Alexey Svyatkovskiy and
               Ambrosio Blanco and
               Colin B. Clement and
               Dawn Drain and
               Daxin Jiang and
               Duyu Tang and
               Ge Li and
               Lidong Zhou and
               Linjun Shou and
               Long Zhou and
               Michele Tufano and
               Ming Gong and
               Ming Zhou and
               Nan Duan and
               Neel Sundaresan and
               Shao Kun Deng and
               Shengyu Fu and
               Shujie Liu},
  title     = {CodeXGLUE: {A} Machine Learning Benchmark Dataset for Code Understanding
               and Generation},
  journal   = {CoRR},
  volume    = {abs/2102.04664},
  year      = {2021}
}

This research was conducted by Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Daya Guo, Duyu Tang, Junjie Huang, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, Shuai Lu, Shujie Liu, and Shuo Ren.

codexglue's People

Contributors

celbree avatar daiki-skm avatar dependabot[bot] avatar edwardqin-creator avatar flt4613 avatar gompyn avatar graykode avatar guoday avatar guody5 avatar imagist-shuo avatar jun-jie-huang avatar kentaro avatar kevinjesse avatar liangqingyuan avatar marscod avatar michaelfu1998-create avatar microsoft-github-operations[bot] avatar microsoftopensource avatar paulguo avatar rajarshiroychoudhury avatar rvandernoort avatar tangduyu avatar wszlong avatar xu-zhiwei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codexglue's Issues

Answer set for WebQuery (codeserach)

Hi,
I failed to see the answer set for WebQueryTest. Could you please let me access that. I already got new sota results on Advert. MRR 32.88, would like to see the performance on WebQueryTest.
Thanks,

The loss function of code search

I am trying to write about the code search task. But I did not understand the following part:

         scores=(nl_vec[:,None,:]*code_vec[None,:,:]).sum(-1)  
         loss_fct = CrossEntropyLoss()  
         loss = loss_fct(scores, torch.arange(bs, device=scores.device))  

In:

scores=(nl_vec[:,None,:]*code_vec[None,:,:]).sum(-1)

Can you please explain what type of similarity is this? I this cosine similarity? And why are you comparing the result with a ranking?

Thanks is advance

Problem about calculating the MAP score in CodeXGLUE/Code-Code/Clone-detection-POJ-104/

I am following your work CodeXGLUE/Code-Code/Clone-detection-POJ-104/

I notice that the MAP scores are calculated as

def calculate_scores(answers,predictions):
scores=[]
for key in answers:
if key not in predictions:
logging.error("Missing prediction for index {}.".format(key))
sys.exit()
a=set(answers[key])
p=set(predictions[key])
if len(a)!=len(p):
logging.error("Mismatch the number of answers for index {}.".format(key))
sys.exit()
scores.append(len(set(a&p))/len(a))
result={}
result['MAP']=round(np.mean(scores),4)
return result

It seems that the function calculates the mean of precision@500, instead of Mean Average Precision

image
image

https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision

So I want to know is current implementation incorrect?
Or it is correct and I will appreciate that you can provide me some materials and references to understand the source code.

Thank you.

Text-to-Text:How to use right fine-tuned or pre-trained model translate English annotation into chinese annotation?

Hi,I have two questions to ask you, please help me to how to solve them, I would really appreciate your help!
(1) I use model xlm-roberta-base provided in Hugggingface wesite(https://huggingface.co/xlm-roberta-base) that in order to translate English code comments to chinese code comments, but result is not good(the results are in the test_0_output.txt) that i don't know why it turned out like this. if this model need to use train.all.src and train.all.tgt finue-tune or not use this model? Please tell me which model to use and whether I need to fine tune it, and provide a download link.

(2) I use CodeXGLUE project in this github, the directory code(run.py) that how to translate Chinese annotation into English annotation, I didn't see the specified parameters in the code to translate Chinese into English or is it based on the data provided by the test data to carry out the specific two language conversion?

CodeSearch Pretrained Models

Hi y'all, really enjoy the amazing work that went into this repository and uploading the pretrained models for many of the tasks. I was wondering, do you have the code search models available for people to use and if so where?

Additionally, I noticed that some of the models are available via HuggingFace's hub, but not all of them such as the code search or clone detection models. Is there any plans to move some more of these models to the HF hub? If not or if more work is needed, I'd be extremely excited to help out with a PR that makes it easier to put the models in the hub and pull them down 🤓.

Baseline in Text-to-Code task

Text-to-Code reports 2 results (i.e. CodeGPT, and CodeGPT-adapted). Can you give me details (or refer me to the relevant description) of these two models?

The information that I want to know about these models

  1. Are these the original GPT model adapted for the code? Is there any architectural changes from the GPT?
  2. How are these models pretrained?
  3. What is the difference between CodeGPT and CodeGPT-adapted?

Thanks in advance.

How does codegpt's BPE tokenizer process whitespaces in code completion task?

Hi there,

I'm trying to implement code-gpt in code completion task. I tried BPE tokenizer, but I found that BPE separates the raw source code by whitespace:
https://github.com/rsennrich/subword-nmt/blob/823c880e4bfc4fce5359b8ea87cc14fcf8a60dc7/subword_nmt/get_vocab.py#L40

In source code, there are more separators, such as ., ;, etc.

So my question is, did you consider whitespace as the only separator in code-gpt? If not, does whitespace is regarded as a single token? just like <s>, <EOL>

Errors in computing Code Bleu with some small code instances

When I try to run the script for computing code bleu with one instance of code only, I sometimes get stuck in a ZeroDivisionError; for example the following one:

target code: private Map<String, ArrayList<Order>> getBuyOrders() { return buyOrders; }
predicted code: private HashMap<String, ArrayList<Order>> getBuyOrders() { return buyOrders; }

trying to run : python calc_code_bleu.py --refs target.txt --hyp prediction.txt --lang java --params 0.25,0.25,0.25,0.25

I get this :

Traceback (most recent call last):
  File "calc_code_bleu.py", line 64, in <module>
    dataflow_match_score = dataflow_match.corpus_dataflow_match(references, hypothesis, args.lang)
  File "/content/CodeXGLUE/Code-Code/code-to-code-trans/evaluator/CodeBLEU/dataflow_match.py", line 58, in corpus_dataflow_match
    score = match_count / total_count
ZeroDivisionError: division by zero

Evironment :

sentencepiece==0.1.94
torch==1.4.0 
transformers==3.5.0 

codecompletion model and input

Hello, thanks for your work. I have two questions.
For line-level code completion, I am trying to use the finetuned model (CodeGPT-small-py-adaptedGPT2) to test based on the provided data in the evaluator folder, but i cannot get the same results as yours. Could you provide more information about how to use the model for code generation?
Another question is about the input for the token-level completion. In the evaluator folder, the file(answers.txt) can be regard as input and ground truth, right? If so, we need use the first token to predict the next token,use the first two tokens to predict the third token, is it right?

Creating own Data set for code-refinement

Hi Team,

I have my own dataset for buggy, fixed pair.
I want to understand what pre-processing steps were performed on functions.

To be specidic, it is mentioned in Readme
"All the function and variable names are normalized"

Do you already have some script to create the same .fixed / .buggy format.

Details of CodeGPT pre-training on CodeSearchNet

Hi @celbree I'm trying to replicate CodeGPT java (adapted, so starting from openai gpt-2 checkpoint) pre-training on CodeSearchNet and would like to clarify some aspects:

  1. Did you use TextDataset class from CodeXGLUE/Code-Code/CodeCompletion-token/code/dataset.py and the run_lm.py script as described here without the --not_pretrain argument for pre-training?
  2. Codesearchnet has bimodal (~0.5M NL-PL pairs) and unimodal (~1.1M PL only examples) data. Did you use only PL from these two sources or did you use the NL-PL pairs too? From the TextDataset class it seems you've used all the 1.6M samples, but only PL part, but just wanted to confirm. :)
  3. Did you split the java_dedupe_definitions_v2.pkl file from CodeSearchNet into train, test, val parts or use the entire set?
  4. Are <s> </s> <EOL> tokens used in pre-training? if yes, how?
    Specifically:
  5. If answer to 3 is yes and if you used NL-PL, how was the pre-processing done? (One way could be <s> + nl_string + <EOL> + pl_string + </s>)
  6. How was the preprocessing done for PL only? One way could be<s> + pl_string + </s> where pl_string is processed so it contains <EOL> at line breaks?

Question regarding "model_type" parameter in finetuning code2Text task

Hi Team,

I had a small confusion, its mentioned in read me that finetune pipeline is for CodeBERT model

"We also provide a pipeline that fine-tunes CodeBERT on this task"

However, when we finetune we provide model_type as "roberta"

python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model

Can we pass codebert as well for finetuning ? or does "roberta" refers to codebert itself ?

Inconsistent DFG among different types of literals

I noticed that the DFG implementation for C# treats string literals differently from integer literals. Here are the outputs of the function get_data_flow from code-to-code-trans/evaluator/CodeBLEU/dataflow.py on two lines of code:

int val= 1;
Output: [('val', 1, 'comesFrom', ['1'], [3]), ('1', 3, 'comesFrom', [], [])]

string val= "a";
Output: []

The correct output for the string case should instead be, [('val', 1, 'comesFrom', ['"a"'], [3]), ('"a"', 3, 'comesFrom', [], [])], to be consistent with the output for the integer case.

I think the problem might be coming from functions, tree_to_token_index, tree_to_variable_index in parser/utils.py and DFG_csharp in parser/DFG.py.

if (len(root_node.children)==0 or root_node.type=='string') and root_node.type!='comment':
idx,code=index_to_code[(root_node.start_point,root_node.end_point)]
if root_node.type==code:
return [],states

Notice that when deciding whether to terminate on line 366, we check if a node has no children or if its type is string. According to tree-sitter's grammar, the type for string nodes is actually string_literal not string.

This would fail for char as well. Please let me know if I made a mistake or if this is done intentionally.

Thanks!

Spandan

[Code-To-Text] Issues loading a re-trained model

Hello,

I run the model training as described in the repository and everything is working fine.
But once I try to do the inference with my newly trained model, i get the following error:

01/14/2021 11:32:57 - INFO - main -   Namespace(adam_epsilon=1e-08, beam_size=10, config_name='', dev_filename=None, do_eval=False, do_lower_case=False, do_test=True, do_train=False, eval_batch_size=32, eval_steps=-1, gradient_accumulation_steps=1, learning_rate=5e-05, load_model_path=None, local_rank=-1, max_grad_norm=1.0, max_source_length=256, max_steps=-1, max_target_length=128, model_name_or_path='/models/pytorch_model.bin', model_type='roberta', no_cuda=True, num_train_epochs=10, output_dir='/experiment/output', seed=42, test_filename='/dataset/test.jsonl', tokenizer_name='', train_batch_size=32, train_filename=None, train_steps=-1, warmup_steps=0, weight_decay=0.0)
01/14/2021 11:32:57 - WARNING - main -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False
Traceback (most recent call last):
  File "./run.py", line 518, in <module>
    main()
  File "./run.py", line 255, in main
    config = config_class.from_pretrained(args.config_name if args.config_name else args.model_name_or_path)
  File "/root/anaconda/envs/code-to-text/lib/python3.7/site-packages/transformers/configuration_utils.py", line 347, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/root/anaconda/envs/code-to-text/lib/python3.7/site-packages/transformers/configuration_utils.py", line 391, in get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/root/anaconda/envs/code-to-text/lib/python3.7/site-packages/transformers/configuration_utils.py", line 474, in _dict_from_json_file
    text = reader.read()
  File "/root/anaconda/envs/code-to-text/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
ERROR conda.cli.main_run:execute(33): Subprocess for 'conda run ['python3.7', './run.py', '--do_test', '--model_type', 'roberta', '--model_name_or_path', '/models/pytorch_model.bin', '--test_filename', '/dataset/test.jsonl', '--output_dir', '/experiment/output', '--max_source_length', '256', '--no_cuda','--max_target_length', '128', '--beam_size', '10', '--train_batch_size', '32', '--eval_batch_size', '32', '--learning_rate', '5e-5', '--num_train_epochs', '10']' command failed.  (See above for error)

At the moment I am only trying to run the java-specific model.

Inferrence using the microsoft\codebert-base works fine.

I am using the following anaconda environment, maybe something is wrong there:

name: code-to-text
channels:
  - conda-forge
  - defaults
dependencies:
  - _r-xgboost-mutex=2.0=cpu_0
  - idna=2.10
  - pip=20.3
  - pycparser=2.20
  - pyopenssl=20.0.0
  - python_abi=3.7
  - requests=2.25.0
  - six=1.15.0
  - tqdm=4.51.0
  - wheel=0.35.1
  - pytorch=1.4.0
  - pip:
    - click==7.1.2
    - filelock==3.0.12
    - joblib==0.17.0
    - numpy==1.19.3
    - packaging==20.4
    - protobuf==3.14.0
    - pyparsing==2.4.7
    - regex==2020.11.13
    - sacremoses==0.0.43
    - sentencepiece==0.1.91
    - tokenizers==0.9.3
    - transformers==3.5.0
    - urllib3==1.26.2

I would appreciate if something is wrong there.
If you want, I can make a PR for the conda environment if it seems appropriate.

Issues in replicating text-to-code performance with CodeGPT (RuntimeError: Could not infer dtype of NoneType)

Hi! @celbree When I run CodeXGLUE/Text-Code/text-to-code/code/run.py as per the instructions in the readme here https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/text-to-code#fine-tune, I encounter this error (full log below)

File "/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in __getitem__
    return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType

It seems that the tokenizer (corresponding to the pretrained ckpt: microsoft/CodeGPT-small-java-adaptedGPT2) returns None when given the concode_elem_sep token. I have tried to fix this by adding

special_tokens_dict = {'sep_token': "concode_elem_sep"}     
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)     

here: https://github.com/microsoft/CodeXGLUE/blob/main/Text-Code/text-to-code/code/run.py#L595
With this change I can get the training of CodeGPT (adapted) to run, but the validation performance I get is quite worse than reported in the readme dev bleu: 26.29, dev EM: 15.7. Could you please help me understand if there is anything wrong in my replication attempt?

Full log:

02/03/2021 17:38:51 - INFO - __main__ -   ***** Running training *****
02/03/2021 17:38:51 - INFO - __main__ -     Num examples = 100000
02/03/2021 17:38:51 - INFO - __main__ -     Num epoch = 29
02/03/2021 17:38:51 - INFO - __main__ -     Instantaneous batch size per GPU = 6
02/03/2021 17:38:51 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 24
02/03/2021 17:38:51 - INFO - __main__ -     Gradient Accumulation steps = 2
02/03/2021 17:38:51 - INFO - __main__ -     Total optimization steps = 124980
Traceback (most recent call last):
  File "run.py", line 634, in <module>
Traceback (most recent call last):
  File "run.py", line 634, in <module>
    main()
  File "run.py", line 621, in main
    main()
  File "run.py", line 621, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer, fh, pool)
  File "run.py", line 165, in train
    for step, (batch, token_labels) in enumerate(train_dataloader):    
global_step, tr_loss = train(args, train_dataset, model, tokenizer, fh, pool)
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
  File "run.py", line 165, in train
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    for step, (batch, token_labels) in enumerate(train_dataloader):
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = self._next_data()
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in __getitem__
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/CodeXGLUE/Text-Code/text-to-code/code/dataset.py", line 116, in __getitem__
    return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType
    return torch.tensor(self.inputs[item]), torch.tensor(self.token_labels[item])
RuntimeError: Could not infer dtype of NoneType
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'run.py', '--local_rank=1', '--data_dir=/export/share/akhilesh-gotmare/CodeXGLUE/Text-Code/text-to-code/dataset/concode', '--langs=java', '--output_dir=../save/concode_2gpu_adaptedGPT2', '--pretrain_dir=microsoft/CodeGPT-small-java-adaptedGPT2', '--log_file=text2code_concode_2gpu_adaptedGPT2.log', '--model_type=gpt2', '--block_size=512', '--do_train', '--node_index', '0', '--gpu_per_node', '2', '--learning_rate=5e-5', '--weight_decay=0.01', '--evaluate_during_training', '--per_gpu_train_batch_size=6', '--per_gpu_eval_batch_size=12', '--gradient_accumulation_steps=2', '--num_train_epochs=30', '--logging_steps=100', '--save_steps=5000', '--overwrite_output_dir', '--seed=42']' returned non-zero exit status 1.

Max lengths for code documentation

Hi, thanks a lot for the wonderful repository and paper.

In the paper section B.3, it is mentioned that "We set the max length of input and inference as 256 and 64, respectively". However, on checking the code provided in the repository I find that inputs longer than 256 are truncated to a max length of 256, and output are truncated to a max length of 128.

So I was confused about what maximum length of the output should I use to reproduce the results, as the results provided in the paper and on the repository are the same, despite the difference in output lengths. I am specifically working on JAVA language.

Question regarding Licences

Hi there,

for getting code-to-text running I had to make changes to the run.py which is licenced under Apache 2 by NVIDIA.
I do not know where to submit my changes to comply with the licence.

Where do I have to open a pull request to reach them?

The changes are for the validation and test steps, where no matter if one has cuda or not the cuda longtensors are used.

Using codesearch for my own queries. (How to instanciate model.py Class)

Hello I managed to finetune codebert for the codesearch task, and I was wondering how I could use the model.bin I just created to perform codesearch on my own natural langage queries.

So the model.bin is a state dictionnary containing all the weight but I don't get how I am supposed to go from that to a working model.

I tried to instanciate it with

model = RobertaModel.from_pretrained("microsoft/codebert-base")

but it returns the following error

RuntimeError: Error(s) in loading state_dict for RobertaModel:
Missing key(s) in state_dict: "embeddings.position_ids", "embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "

etc,etc

And I noticed that there is a Model class in model.py so it is probably the one I mused use but I don't get how am I supposed to instanciate it in order for the model to work ? Can you show me an example or explain me a little bit ?

Thank you very much

Questions on the models for code-to-text generation task

Hi, In the code to text generation task (https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text), there are baselines called as, Seq2Seq and Transformer. I have a few questions regarding them.

  1. Is Seq2Seq is an RNN-based sequence to sequence model? Does it include a copy mechanism?
  2. Does the Transformer baseline use a copy mechanism?
  3. For both the baselines, do you use any special tokenization/vocabulary (bpe, sentencepiece) or just use the top k words as the vocabulary?

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #3 'index' in call to _th_index_select

Hello, I read the paper and thanks a lot for the wonderful repos.
I follow the script to fine-tune the model (https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text)
And I occur the problem and there are the output

(codebert) wangz@RJZLS:~/GenLC/CodeXGLUE/Code-Text/code-to-text/code$ python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs
11/04/2020 23:20:44 - INFO - __main__ -   Namespace(adam_epsilon=1e-08, beam_size=10, config_name='', dev_filename='../dataset/ruby/valid.jsonl', do_eval=True, do_lower_case=False, do_test=False, do_train=True, eval_batch_size=32, eval_steps=-1, gradient_accumulation_steps=1, learning_rate=5e-05, load_model_path=None, local_rank=-1, max_grad_norm=1.0, max_source_length=256, max_steps=-1, max_target_length=128, model_name_or_path='microsoft/codebert-base', model_type='roberta', no_cuda=False, num_train_epochs=10, output_dir='model/ruby', seed=42, test_filename=None, tokenizer_name='', train_batch_size=32, train_filename='../dataset/ruby/train.jsonl', train_steps=-1, warmup_steps=0, weight_decay=0.0)
11/04/2020 23:20:44 - WARNING - __main__ -   Process rank: -1, device: cuda, n_gpu: 1, distributed training: False
11/04/2020 23:20:46 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/config.json from cache at /home/wangz/.cache/torch/transformers/1b62771d5f5169b34713b0af1ab85d80e11f7b1812fbf3ee7d03a866c5f58e72.06eb31f0a63f4e8a136733ccac422f0abf9ffa87c3e61104b57e7075a704d008
11/04/2020 23:20:46 - INFO - transformers.configuration_utils -   Model config RobertaConfig {
  "architectures": [
    "RobertaModel"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

11/04/2020 23:20:46 - INFO - transformers.tokenization_utils_base -   Model name 'microsoft/codebert-base' not found in model shortcut name list (roberta-base, roberta-large, roberta-large-mnli, distilroberta-base, roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'microsoft/codebert-base' is a path, a model identifier, or url to a directory containing tokenizer files.
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/vocab.json from cache at /home/wangz/.cache/torch/transformers/aca4dbdf4f074d4e071c2664901fec33c8aa69c35aa0101bc669ed4b44d1f6c3.6a4061e8fc00057d21d80413635a86fdcf55b6e7594ad9e25257d2f99a02f4be
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/merges.txt from cache at /home/wangz/.cache/torch/transformers/779a2f0c38ba2ff65d9a3ee23e58db9568f44a20865c412365e3dc540f01743f.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/added_tokens.json from cache at None
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/special_tokens_map.json from cache at /home/wangz/.cache/torch/transformers/5a191080da4f00859b5d3d29529f57894583e00ab07b7c940d65c33db4b25d4d.16f949018cf247a2ea7465a74ca9a292212875e5fd72f969e0807011e7f192e4
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/tokenizer_config.json from cache at /home/wangz/.cache/torch/transformers/1b4723c5fb2d933e11c399450ea233aaf33f093b5cbef3ec864624735380e490.70b5dbd5d3b9b4c9bfb3d1f6464291ff52f6a8d96358899aa3834e173b45092d
11/04/2020 23:20:53 - INFO - transformers.tokenization_utils_base -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/microsoft/codebert-base/tokenizer.json from cache at None
11/04/2020 23:20:54 - INFO - transformers.modeling_utils -   loading weights file https://cdn.huggingface.co/microsoft/codebert-base/pytorch_model.bin from cache at /home/wangz/.cache/torch/transformers/0f2ecc21b21d43a029e718179cee640eb64cca32a1f2159703ea36f4a50bdd3e.96251fe4478bac0cff9de8ae3201e5847cee59aebbcafdfe6b2c361f9398b349
11/04/2020 23:20:58 - INFO - transformers.modeling_utils -   All model checkpoint weights were used when initializing RobertaModel.

11/04/2020 23:20:58 - INFO - transformers.modeling_utils -   All the weights of RobertaModel were initialized from the model checkpoint at microsoft/codebert-base.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use RobertaModel for predictions without further training.
11/04/2020 23:21:01 - INFO - __main__ -   *** Example ***
11/04/2020 23:21:01 - INFO - __main__ -   idx: 0
11/04/2020 23:21:01 - INFO - __main__ -   source_tokens: ['<s>', 'def', '_render', '_', 'body', '_(', '_context', '_,', '_options', '_)', '_if', '_options', '_.', '_key', '?', '_(', '_:', 'partial', '_)', '_[', '_render', '_', 'partial', '_(', '_context', '_,', '_options', '_)', '_]', '_else', '_Streaming', 'Template', 'R', 'end', 'erer', '_.', '_new', '_(', '_@', 'look', 'up', '_', 'context', '_)', '_.', '_render', '_(', '_context', '_,', '_options', '_)', '_end', '_end', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   source_ids: 0 9232 19930 1215 9773 36 5377 2156 1735 4839 114 1735 479 762 116 36 4832 45593 4839 646 19930 1215 45593 36 5377 2156 1735 4839 27779 1493 34245 49522 500 1397 7160 479 92 36 787 13724 658 1215 46796 4839 479 19930 36 5377 2156 1735 4839 253 253 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   target_tokens: ['<s>', 'Render', '_but', '_returns', '_a', '_valid', '_Rack', '_body', '_.', '_If', '_fibers', '_are', '_defined', '_we', '_return', '_a', '_streaming', '_body', '_that', '_renders', '_the', '_template', '_piece', '_by', '_piece', '_.', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   target_ids: 0 48440 53 2886 10 8218 34767 809 479 318 32902 32 6533 52 671 10 5230 809 14 33428 5 27663 2125 30 2125 479 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   *** Example ***
11/04/2020 23:21:01 - INFO - __main__ -   idx: 1
11/04/2020 23:21:01 - INFO - __main__ -   source_tokens: ['<s>', 'def', '_attribute', '_', 'missing', '_(', '_match', '_,', '_*', '_args', '_,', '_&', '_block', '_)', '___', 'send', '__', '_(', '_match', '_.', '_target', '_,', '_match', '_.', '_att', 'r', '_', 'name', '_,', '_args', '_,', '_block', '_)', '_end', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   source_ids: 0 9232 21643 1215 41947 36 914 2156 1009 49503 2156 359 1803 4839 27148 37785 30529 36 914 479 1002 2156 914 479 15095 338 1215 13650 2156 49503 2156 1803 4839 253 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   target_tokens: ['<s>', '+', '_attribute', '_', 'missing', '_+', '_is', '_like', '_+', '_method', '_', 'missing', '_+', '_but', '_for', '_attributes', '_.', '_When', '_+', '_method', '_', 'missing', '_+', '_is', '_called', '_we', '_check', '_to', '_see', '_if', '_there', '_is', '_a', '_matching', '_attribute', '_method', '_.', '_If', '_so', '_we', '_tell', '_+', '_attribute', '_', 'missing', '_+', '_to', '_dispatch', '_the', '_attribute', '_.', '_This', '_method', '_can', '_be', '_overloaded', '_to', '_customize', '_the', '_behavior', '_.', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   target_ids: 0 2744 21643 1215 41947 2055 16 101 2055 5448 1215 41947 2055 53 13 16763 479 520 2055 5448 1215 41947 2055 16 373 52 1649 7 192 114 89 16 10 8150 21643 5448 479 318 98 52 1137 2055 21643 1215 41947 2055 7 22903 5 21643 479 152 5448 64 28 40894 7 30447 5 3650 479 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   *** Example ***
11/04/2020 23:21:01 - INFO - __main__ -   idx: 2
11/04/2020 23:21:01 - INFO - __main__ -   source_tokens: ['<s>', 'def', '_matched', '_', 'attribute', '_', 'method', '_(', '_method', '_', 'name', '_)', '_matches', '_=', '_self', '_.', '_class', '_.', '_send', '_(', '_:', 'attribute', '_', 'method', '_', 'mat', 'chers', '_', 'match', 'ing', '_,', '_method', '_', 'name', '_)', '_matches', '_.', '_detect', '_{', '_|', '_match', '_|', '_attribute', '_', 'method', '?', '_(', '_match', '_.', '_att', 'r', '_', 'name', '_)', '_}', '_end', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   source_ids: 0 9232 9184 1215 49202 1215 45416 36 5448 1215 13650 4839 2856 5457 1403 479 1380 479 2142 36 4832 49202 1215 45416 1215 9244 7873 1215 10565 154 2156 5448 1215 13650 4839 2856 479 10933 25522 1721 914 1721 21643 1215 45416 116 36 914 479 15095 338 1215 13650 4839 35524 253 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   target_tokens: ['<s>', 'Returns', '_a', '_struct', '_representing', '_the', '_matching', '_attribute', '_method', '_.', '_The', '_struct', '_s', '_attributes', '_are', '_prefix', '_base', '_and', '_suffix', '_.', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   target_ids: 0 48826 10 29916 4561 5 8150 21643 5448 479 20 29916 579 16763 32 46622 1542 8 47503 479 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   *** Example ***
11/04/2020 23:21:01 - INFO - __main__ -   idx: 3
11/04/2020 23:21:01 - INFO - __main__ -   source_tokens: ['<s>', 'def', '_dem', 'od', 'ul', 'ize', '_(', '_path', '_)', '_path', '_=', '_path', '_.', '_to', '_', 's', '_if', '_i', '_=', '_path', '_.', '_r', 'index', '_(', '_"', '::', '"', '_)', '_path', '_[', '_(', '_i', '_+', '_2', '_)', '_..', '_-', '_1', '_]', '_else', '_path', '_end', '_end', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   source_ids: 0 9232 4410 1630 922 2072 36 2718 4839 2718 5457 2718 479 7 1215 29 114 939 5457 2718 479 910 18480 36 22 38304 113 4839 2718 646 36 939 2055 132 4839 29942 111 112 27779 1493 2718 253 253 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   target_tokens: ['<s>', 'Rem', 'oves', '_the', '_module', '_part', '_from', '_the', '_expression', '_in', '_the', '_string', '_.', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   target_ids: 0 31157 14337 5 20686 233 31 5 8151 11 5 6755 479 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   *** Example ***
11/04/2020 23:21:01 - INFO - __main__ -   idx: 4
11/04/2020 23:21:01 - INFO - __main__ -   source_tokens: ['<s>', 'def', '_const', '_', 're', 'gex', 'p', '_(', '_camel', '_', 'c', 'ased', '_', 'word', '_)', '_parts', '_=', '_camel', '_', 'c', 'ased', '_', 'word', '_.', '_split', '_(', '_"', '::', '"', '_)', '_return', '_Re', 'gex', 'p', '_.', '_escape', '_(', '_camel', '_', 'c', 'ased', '_', 'word', '_)', '_if', '_parts', '_.', '_blank', '?', '_last', '_=', '_parts', '_.', '_pop', '_parts', '_.', '_reverse', '_.', '_inject', '_(', '_last', '_)', '_do', '_|', '_acc', '_,', '_part', '_|', '_part', '_.', '_empty', '?', '_?', '_acc', '_:', '_"#', '{', 'part', '}', '(', '::', '#', '{', 'acc', '})', '?"', '_end', '_end', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   source_ids: 0 9232 10759 1215 241 45767 642 36 35579 1215 438 11835 1215 14742 4839 1667 5457 35579 1215 438 11835 1215 14742 479 3462 36 22 38304 113 4839 671 1223 45767 642 479 5111 36 35579 1215 438 11835 1215 14742 4839 114 1667 479 15818 116 94 5457 1667 479 3495 1667 479 7213 479 17951 36 94 4839 109 1721 7678 2156 233 1721 233 479 5802 116 17487 7678 4832 35290 45152 7755 24303 1640 38304 10431 45152 7904 49424 1917 253 253 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   source_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:01 - INFO - __main__ -   target_tokens: ['<s>', 'Mount', 's', '_a', '_regular', '_expression', '_returned', '_as', '_a', '_string', '_to', '_ease', '_interpol', 'ation', '_that', '_will', '_match', '_part', '_by', '_part', '_the', '_given', '_constant', '_.', '</s>']
11/04/2020 23:21:01 - INFO - __main__ -   target_ids: 0 42036 29 10 1675 8151 1835 25 10 6755 7 5136 46687 1258 14 40 914 233 30 233 5 576 5891 479 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
11/04/2020 23:21:01 - INFO - __main__ -   target_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11/04/2020 23:21:02 - INFO - __main__ -   ***** Running training *****
11/04/2020 23:21:02 - INFO - __main__ -     Num examples = 2000
11/04/2020 23:21:02 - INFO - __main__ -     Batch size = 32
11/04/2020 23:21:02 - INFO - __main__ -     Num epoch = 10
epoch 0 loss 8.0074: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [01:08<00:00,  1.08s/it]
11/04/2020 23:22:10 - INFO - __main__ -   
***** Running evaluation *****
11/04/2020 23:22:10 - INFO - __main__ -     Num examples = 100
11/04/2020 23:22:10 - INFO - __main__ -     Batch size = 32
11/04/2020 23:22:11 - INFO - __main__ -     eval_ppl = 508.65501
11/04/2020 23:22:11 - INFO - __main__ -     global_step = 64
11/04/2020 23:22:11 - INFO - __main__ -     train_loss = 8.0074
11/04/2020 23:22:11 - INFO - __main__ -     ********************
11/04/2020 23:22:12 - INFO - __main__ -     Best ppl:508.65501
11/04/2020 23:22:12 - INFO - __main__ -     ********************
Traceback (most recent call last):
  File "run.py", line 518, in <module>
    main()
  File "run.py", line 434, in main
    preds = model(source_ids=source_ids,source_mask=source_mask)  
  File "/home/wangz/miniconda3/envs/codebert/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/wangz/GenLC/CodeXGLUE/Code-Text/code-to-text/code/model.py", line 97, in forward
    input_ids.data.copy_(input_ids.data.index_select(0, beam.getCurrentOrigin()))
RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #3 'index' in call to _th_index_select

Code to text

Hi

I trying to generate the corpus for Code to text for my model. However I am not able to remove all the "\" in argument string . Also does preprocess remove the comments or not as in my case it does not .

Thank you

[Code2Text] Unable to start training even with batch size 1 for large custom dataset on colab

Hi There
I am working on code2text problem, i have created my own dataset for javascript code, comment pairs in ".jsonl" format with 5 lakhs+ datapoints.
However, i am unable to start training on P100 google colab GPU (with 16GB VRam) even with batch size 1 due to memory issues.

In case, i reduce the datapoints in range of 2.5 lakhs from original 5 lakhs, i am able to start training.

Any thoughts which step in code consumes so much memory and training couldn't start even with batch size 1 as i want to train on entire 5 lakh files on google colab.

POJ-104 dataset problem descriptions

Hi,

I successfully download a dataset of POJ-104, but I could not be able to find descriptions of those 104 problems.

Can you please share description of those 104-problems?

Baselines in Code Defect Detection task

Hi,

The following two baseline methods are listed for the code defect (vulnerability) detection task (REF).

RoBERTa | 61.05
CodeBERT | 62.08

Which RoBERTa model is used here? Also, CodeBERT is trained on CodeSearchNet which doesn't have C++ functions. So, is CodeBERT directly fine-tuned on the Devign dataset? Please provide more information.

Can't find the code for filtering the CodeSearchNet dataset

Hi,

  • Remove examples that codes cannot be parsed into an abstract syntax tree.
  • Remove examples that #tokens of documents is < 3 or >256
  • Remove examples that documents contain special tokens (e.g. <img ...> or https:...)
  • Remove examples that documents are not English.

It shows that you filtered the CodeSearchNet dataset in the way above, but I cannot find the corresponding code in this repository. Also, in the folder 'dataset.zip', I only found the files 'train.txt, valid.txt, test.txt, test_code.jsonl' for the python programming language without the other 5 programming languages. I think they are the results of filtering the dataset, right?

Therefore, could you please provide the code to filter the dataset or provide the filtered data for the other 5 programming languages (java, javascript, php, go, ruby)? I would appreciate it.

Looking forward to your reply.

No pretrained weights for code-to-text

Description

Only the CodeBERT base model seems to be available. However, this downstream task (code-to-text) builds a model around the base. No weights are provided for the complete model and seemingly everyone needs to finetune the model themselves before being able to get started.

Is it possible to add a download link to the README for the weights?

Files

A list of relevant files for this issue:

Inference cost code search task.

Hello I saw that the codesearch task had an inference cost of about 7 min for the AdvTest and I was wondering how do you compute that ?

Is it the time you got using a single query on 2x P100 or on multiples queries ?

And is the inference cost similar in https://github.com/microsoft/CodeBERT ? If not why does it take so much time and is there a way to reduce it ?

CodeXGLUE -- Code-To-Text running problem

I am confused, does this new pipeline only change the number of epochs and batch size compared to the old pipeline (CodeBERT/code2nl) ? I just want to confirm, thank you.
image

code-code CodeCompletion input

Hello~ thx for this project.
Is it possible to use initial raw code as input to test the completion model?
Currently the input format is not suitable to all the models, and it is difficult to restore the input format to the raw code.

Question regarding tokenizers in Code to Text

Hi there,

I am currently looking into the Code-To-Text parts, particularly I am looking into Inferring some results for new java files.
I just have an issue with understanding some of the pipeline and data.

When I look into the datasets JSONl's I see entries such as:

[...]
"code": "@Override\n    public ImageSource apply(ImageSource input) {\n        final int[][] pixelMatrix = new int[3][3];\n\n        int w = input.getWidth() [...]
"code_tokens": ["@", "Override", "public", "ImageSource", "apply", "(", "ImageSource", "input", ")", "{", "final", "int", "[", "]", "[", "]", "pixelMatrix", "=", "new", "int", "[", "3", "]", "[", "3", "]", ";" [...]
"docstring": "Expects a height mat as input\n\n@param input - A grayscale height map\n@return edges", 
"docstring_tokens": ["Expects", "a", "height", "mat", "as", "input"],
[...]

When I apply the hugging face transformer to the JSONl's code field, I get vastly different results (namely a ton of G's and C's with these spanish hats above them).

So, my question is, am I right that the standard Huggingface Roberta-Tokenizer is used and the Code-Tokens are ignored for this task, or is there some trick I am missing?

Question about dataset of ClozeTesting-maxmin

Hi,
I am looking through the ClozeTesting-maxmin dataset. However, I did not see the ground truth labels for the dataset anywhere. Can someone help clarify the dataset and its usage?
Many thanks!

Pre-Proceesing for Code Completion Task

Hi Team,

I want to understand what are the ideal steps for pre-processing a code corpus downloaded from Github before giving it to training for Code Completion.

As per my understanding, i mined around 11 GB of varied domain Java code from Github.
Next, i removed all strings from code
I removed all comments from code
I included end of line character after each line.

What are other ideal steps that you would suggest.
Also, should we remove comments, or don't.

Many Thanks

Raw Code Search (AdvTest) dataset

Since the name of the methods, the name of the variables, and also the signature of the methods all carry semantic signals that are important for code representation and retrieval, is there a version of the Code Search (AdvTest) dataset where these names defined by the programmers are not anonymized?

Text-Code MRR

I have two questions on Text-Code readme:

  1. Do you consider MRR of both validation and test dataset? The dataset (test.txt) includes both validation and test dataset of CodeSearchNet.
  2. MRR of CodeBERT reported as "0.8685" here but it is indicated as {'MRR': 0.2719} in readme.

There is no answer data in ClozeTesting part?

Hi,
I want to make some experiments based on the Cloze Test task. But it seems that there is no answer data in both ClozeTesting-all and ClozeTesting-maxmin in this project.
According to evaluator.py, the calculate_scores function needs the answers.txt and the model predicting results predictions.txt to calculate the final scores. But all the evaluator/answers/<lang>/answers.txt are like:
maxmin-1<CODESPLIT>token1
maxmin-2<CODESPLIT>token2
maxmin-3<CODESPLIT>token3
maxmin-4<CODESPLIT>token4
maxmin-5<CODESPLIT>token5
I do not understand what it means. I think the answers.txt should contains the real answers corresponding to the cloze-maxmin/<lang>/clozeTest.json. Is there somewhere else I can access the related data?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.