GithubHelp home page GithubHelp logo

github / codesearchnet Goto Github PK

View Code? Open in Web Editor NEW
2.1K 60.0 377.0 29.33 MB

Datasets, tools, and benchmarks for representation learning of code.

Home Page: https://arxiv.org/abs/1909.09436

License: MIT License

Dockerfile 0.60% Jupyter Notebook 55.31% Shell 0.23% Python 43.86%
deep-learning natural-language-processing programming-language-theory machine-learning tensorflow datasets data-science machine-learning-on-source-code representation-learning neural-networks

codesearchnet's Introduction

Tests License: MIT Python 3.6 Weights-And-Biases

The CodeSearchNet challenge has been concluded

We would like to thank all participants for their submissions and we hope that this challenge provided insights to practitioners and researchers about the challenges in semantic code search and motivated new research. We would like to encourage everyone to continue using the dataset and the human evaluations, which we now provide publicly. Please, see below for details, specifically the Evaluation section.

No new submissions to the challenge will be accepted.

Table of Contents

Quickstart

If this is your first time reading this, we recommend skipping this section and reading the following sections. The below commands assume you have Docker and Nvidia-Docker, as well as a GPU that supports CUDA 9.0 or greater. Note: you should only have to run script/setup once to download the data.

# clone this repository
git clone https://github.com/github/CodeSearchNet.git
cd CodeSearchNet/
# download data (~3.5GB) from S3; build and run the Docker container
script/setup
# this will drop you into the shell inside a Docker container
script/console
# optional: log in to W&B to see your training metrics,
# track your experiments, and submit your models to the benchmark
wandb login

# verify your setup by training a tiny model
python train.py --testrun
# see other command line options, try a full training run with default values,
# and explore other model variants by extending this baseline script
python train.py --help
python train.py

# generate predictions for model evaluation
python predict.py -r github/CodeSearchNet/0123456 # this is the org/project_name/run_id

Finally, you can submit your run to the community benchmark by following these instructions.

Introduction

Project Overview

CodeSearchNet is a collection of datasets and benchmarks that explore the problem of code retrieval using natural language. This research is a continuation of some ideas presented in this blog post and is a joint collaboration between GitHub and the Deep Program Understanding group at Microsoft Research - Cambridge. We aim to provide a platform for community research on semantic code search via the following:

  1. Instructions for obtaining large corpora of relevant data
  2. Open source code for a range of baseline models, along with pre-trained weights
  3. Baseline evaluation metrics and utilities
  4. Mechanisms to track progress on a shared community benchmark hosted by Weights & Biases

We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.

More context regarding the motivation for this problem is in this technical report. Please, cite the dataset and the challenge as

@article{husain2019codesearchnet,
  title={{CodeSearchNet} challenge: Evaluating the state of semantic code search},
  author={Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
  journal={arXiv preprint arXiv:1909.09436},
  year={2019}
}

Data

The primary dataset consists of 2 million (comment, code) pairs from open source libraries. Concretely, a comment is a top-level function or method comment (e.g. docstrings in Python), and code is an entire function or method. Currently, the dataset contains Python, Javascript, Ruby, Go, Java, and PHP code. Throughout this repo, we refer to the terms docstring and query interchangeably. We partition the data into train, validation, and test splits such that code from the same repository can only exist in one partition. Currently this is the only dataset on which we train our model. Summary statistics about this dataset can be found in this notebook

For more information about how to obtain the data, see this section.

Evaluation

The metric we use for evaluation is Normalized Discounted Cumulative Gain. Please reference this paper for further details regarding model evaluation. The evaluation script can be found here.

Annotations

We manually annotated retrieval results for the six languages from 99 general queries. This dataset is used as groundtruth data for evaluation only. Please refer to this paper for further details on the annotation process. These annotations were used to compute the scores in the leaderboard. Now that the competition has been concluded, you can find the annotations, along with the annotator comments here.

Setup

You should only have to perform the setup steps once to download the data and prepare the environment.

  1. Due to the complexity of installing all dependencies, we prepared Docker containers to run this code. You can find instructions on how to install Docker in the official docs. Additionally, you must install Nvidia-Docker to satisfy GPU-compute related dependencies. For those who are new to Docker, this blog post provides a gentle introduction focused on data science.

  2. After installing Docker, you need to download the pre-processed datasets, which are hosted on S3. You can do this by running script/setup.

    script/setup
    

    This will build Docker containers and download the datasets. By default, the data is downloaded into the resources/data/ folder inside this repository, with the directory structure described here.

The datasets you will download (most of them compressed) have a combined size of only ~ 3.5 GB.

  1. To start the Docker container, run script/console:
    script/console
    
    This will land you inside the Docker container, starting in the /src directory. You can detach from/attach to this container to pause/continue your work.

For more about the data, see Data Details below, as well as this notebook.

Data Details

Data Acquisition

If you have run the setup steps above you will already have the data, and nothing more needs to be done. The data will be available in the /resources/data folder of this repository, with this directory structure.

Schema & Format

Data is stored in jsonlines format. Each line in the uncompressed file represents one example (usually a function with an associated comment). A prettified example of one row is illustrated below.

  • repo: the owner/repo
  • path: the full path to the original file
  • func_name: the function or method name
  • original_string: the raw string before tokenization or parsing
  • language: the programming language
  • code: the part of the original_string that is code
  • code_tokens: tokenized version of code
  • docstring: the top-level comment or docstring, if it exists in the original string
  • docstring_tokens: tokenized version of docstring
  • sha: this field is not being used [TODO: add note on where this comes from?]
  • partition: a flag indicating what partition this datum belongs to of {train, valid, test, etc.} This is not used by the model. Instead we rely on directory structure to denote the partition of the data.
  • url: the url for the code snippet including the line numbers

Code, comments, and docstrings are extracted in a language-specific manner, removing artifacts of that language.

{
  'code': 'def get_vid_from_url(url):\n'
          '        """Extracts video ID from URL.\n'
          '        """\n'
          "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
          "          match1(url, r'youtube\\.com/embed/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
          "          match1(url, r'youtube\\.com/watch/([^/?]+)') or \\\n"
          "          parse_query_param(url, 'v') or \\\n"
          "          parse_query_param(parse_query_param(url, 'u'), 'v')",
  'code_tokens': ['def',
                  'get_vid_from_url',
                  '(',
                  'url',
                  ')',
                  ':',
                  'return',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtu\\.be/([^?/]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/embed/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/v/([^/?]+)'",
                  ')',
                  'or',
                  'match1',
                  '(',
                  'url',
                  ',',
                  "r'youtube\\.com/watch/([^/?]+)'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'v'",
                  ')',
                  'or',
                  'parse_query_param',
                  '(',
                  'parse_query_param',
                  '(',
                  'url',
                  ',',
                  "'u'",
                  ')',
                  ',',
                  "'v'",
                  ')'],
  'docstring': 'Extracts video ID from URL.',
  'docstring_tokens': ['Extracts', 'video', 'ID', 'from', 'URL', '.'],
  'func_name': 'YouTube.get_vid_from_url',
  'language': 'python',
  'original_string': 'def get_vid_from_url(url):\n'
                      '        """Extracts video ID from URL.\n'
                      '        """\n'
                      "        return match1(url, r'youtu\\.be/([^?/]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/embed/([^/?]+)') or "
                      '\\\n'
                      "          match1(url, r'youtube\\.com/v/([^/?]+)') or \\\n"
                      "          match1(url, r'youtube\\.com/watch/([^/?]+)') or "
                      '\\\n'
                      "          parse_query_param(url, 'v') or \\\n"
                      "          parse_query_param(parse_query_param(url, 'u'), "
                      "'v')",
  'partition': 'test',
  'path': 'src/you_get/extractors/youtube.py',
  'repo': 'soimort/you-get',
  'sha': 'b746ac01c9f39de94cac2d56f665285b0523b974',
  'url': 'https://github.com/soimort/you-get/blob/b746ac01c9f39de94cac2d56f665285b0523b974/src/you_get/extractors/youtube.py#L135-L143'
}

Summary statistics such as row counts and token length histograms can be found in this notebook

Downloading Data from S3

The shell script /script/setup will automatically download these files into the /resources/data directory. Here are the links to the relevant files for visibility:

The s3 links follow this pattern:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/{python,java,go,php,javascript,ruby}.zip

For example, the link for the java is:

https://s3.amazonaws.com/code-search-net/CodeSearchNet/v2/java.zip

The size of the dataset is approximately 20 GB. The various files and the directory structure are explained here.

Human Relevance Judgements

To train neural models with a large dataset we use the documentation comments (e.g. docstrings) as a proxy. For evaluation (and the leaderboard), we collected human relevance judgements of pairs of realistic-looking natural language queries and code snippets. Now that the challenge has been concluded, we provide the data here as a .csv, with the following fields:

  • Language: The programming language of the snippet.
  • Query: The natural language query
  • GitHubUrl: The URL of the target snippet. This matches the URL key in the data (see here).
  • Relevance: the 0-3 human relevance judgement, where "3" is the highest score (very relevant) and "0" is the lowest (irrelevant).
  • Notes: a free-text field with notes that annotators optionally provided.

Running Our Baseline Model

We encourage you to reproduce and extend these models, though most variants take several hours to train (and some take more than 24 hours on an AWS P3-V100 instance).

Model Architecture

Our baseline models ingest a parallel corpus of (comments, code) and learn to retrieve a code snippet given a natural language query. Specifically, comments are top-level function and method comments (e.g. docstrings in Python), and code is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.

The query has a single encoder, whereas each programming language has its own encoder. The available encoders are Neural-Bag-Of-Words, RNN, 1D-CNN, Self-Attention (BERT), and a 1D-CNN+Self-Attention Hybrid.

The diagram below illustrates the general architecture of our baseline models:

alt text

Training

This step assumes that you have a suitable Nvidia-GPU with Cuda v9.0 installed. We used AWS P3-V100 instances (a p3.2xlarge is sufficient).

  1. Start the model run environment by running script/console:

    script/console
    

    This will drop you into the shell of a Docker container with all necessary dependencies installed, including the code in this repository, along with data that you downloaded earlier. By default, you will be placed in the src/ folder of this GitHub repository. From here you can execute commands to run the model.

  2. Set up W&B (free for open source projects) per the instructions below if you would like to share your results on the community benchmark. This is optional but highly recommended.

  3. The entry point to this model is src/train.py. You can see various options by executing the following command:

    python train.py --help
    

    To test if everything is working on a small dataset, you can run the following command:

    python train.py --testrun
    
  4. Now you are prepared for a full training run. Example commands to kick off training runs:

  • Training a neural-bag-of-words model on all languages

    python train.py --model neuralbow
    

    The above command will assume default values for the location(s) of the training data and a destination where you would like to save the output model. The default location for training data is specified in /src/data_dirs_{train,valid,test}.txt. These files each contain a list of paths where data for the corresponding partition exists. If more than one path specified (separated by a newline), the data from all the paths will be concatenated together. For example, this is the content of src/data_dirs_train.txt:

    $ cat data_dirs_train.txt
    ../resources/data/python/final/jsonl/train
    ../resources/data/javascript/final/jsonl/train
    ../resources/data/java/final/jsonl/train
    ../resources/data/php/final/jsonl/train
    ../resources/data/ruby/final/jsonl/train
    ../resources/data/go/final/jsonl/train
    

    By default, models are saved in the resources/saved_models folder of this repository.

  • Training a 1D-CNN model on Python data only:

    python train.py --model 1dcnn /trained_models ../resources/data/python/final/jsonl/train ../resources/data/python/final/jsonl/valid ../resources/data/python/final/jsonl/test
    

    The above command overrides the default locations for saving the model to trained_models and also overrides the source of the train, validation, and test sets.

Additional notes:

  • Options for --model are currently listed in src/model_restore_helper.get_model_class_from_name.

  • Hyperparameters are specific to the respective model/encoder classes. A simple trick to discover them is to kick off a run without specifying hyperparameter choices, as that will print a list of all used hyperparameters with their default values (in JSON format).

References

Benchmark

We are using a community benchmark for this project to encourage collaboration and improve reproducibility. It is hosted by Weights & Biases (W&B), which is free for open source projects. Our entries in the benchmark link to detailed logs of our training and evaluation metrics, as well as model artifacts, and we encourage other participants to provide as much detail as possible.

We invite the community to submit their runs to this benchmark to facilitate transparency by following these instructions.

How to Contribute

We anticipate that the community will design custom architectures and use frameworks other than Tensorflow. Furthermore, we anticipate that additional datasets will be useful. It is not our intention to integrate these models, approaches, and datasets into this repository as a superset of all available ideas. Rather, we intend to maintain the baseline models and links to the data in this repository as a central place of reference. We are accepting PRs that update the documentation, link to your project(s) with improved benchmarks, fix bugs, or make minor improvements to the code. Here are more specific guidelines for contributing to this repository; note particularly our Code of Conduct. Please open an issue if you are unsure of the best course of action.

Other READMEs

W&B Setup

To initialize W&B:

  1. Navigate to the /src directory in this repository.

  2. If it's your first time using W&B on a machine, you will need to log in:

    $ wandb login
    
  3. You will be asked for your API key, which appears on your W&B profile settings page.

Licenses

The licenses for source code used as data for this project are provided with the data download for each language in _licenses.pkl files.

This code and documentation for this project are released under the MIT License.

codesearchnet's People

Contributors

0bserver07 avatar amazingguni avatar bzz avatar celsofranssa avatar chihchungshen avatar cicyfan avatar fengzhangyin avatar gmorgachev avatar hamelsmu avatar huangyz0918 avatar jianguda avatar mahimanzum avatar mallamanis avatar maxscha avatar mmjb avatar novoselrok avatar raubitsj avatar seungjaeryanlee avatar shawnlewis avatar sheikhaei avatar siace avatar springherald avatar staceysv avatar tchandannyu avatar tsholmes avatar tysonandre avatar wenkso avatar xinchihuang avatar yss14 avatar zxx1022 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codesearchnet's Issues

Groundtruth

Is there a plan to release the annotations?

NDCG computation

Hi there! In the function corresponding to NDCG calculation I noticed that ignore_rank_of_non_annotated_urls flag is present.

ignore_rank_of_non_annotated_urls: bool=True) -> float:

The question is: are your results in the paper calculated with ignore_rank_of_non_annotated_urls=True?

question: calculating mrr and loss

I'm making a model for this using my own codebase. I wanted to confirm if this way of calculating loss and mrr is correct.

def softmax_loss(y_true, y_pred):
    q, c = y_pred
    similarity_score = tf.matmul(q, K.transpose(c))
    per_sample_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(
        logits=similarity_score,
        labels=tf.range(q.shape[0])
    )
    return tf.reduce_sum(per_sample_loss) / q.shape[0]


def mrr(y_true, y_pred):
   q, c = y_pred
   similarity_scores = tf.matmul(q, K.transpose(c))
   correct_scores = tf.linalg.diag_part(similarity_scores)
   compared_scores = similarity_scores >= tf.expand_dims(correct_scores, axis=-1)
   compared_scores = tf.cast(compared_scores, tf.float16)
   return K.mean(tf.math.reciprocal(tf.reduce_sum(compared_scores, axis=1)))

Here, q and c are query and code feature vector of shape (batch_size, vector_dimension).
I'm a bit hesitant because I feel that mrr and loss depend on the batch_size and kind of example in the batch(closely related(mrr might be low) or far apart(mrr can be high)).

EDIT:
I've looked around some issues already present. Is my understanding correct, during testing, we will have a batch_size of 1000 and we won't be shuffling the data?

Shuffle Validation Set when calculating MRR in included scripts

from @mallamanis


One masters' student in Berkeley has asked me the following question for CodeSearchNet.

The validation loss logging shows that the MRR performance decreases as it is being computed (at the end of the epoch). This seems to be the case with many of the runs on W&B. Do you have any idea why this might happening? I don't see anything obviously wrong.

For example,

11 (valid): Batch     0 (has 200 samples). Processed 0 samples. Loss so far: 0.0000.  MRR so far: 0.0000 
11 (valid): Batch     1 (has 200 samples). Processed 200 samples. Loss so far: 2.3159.  MRR so far: 0.5468 
11 (valid): Batch     2 (has 200 samples). Processed 400 samples. Loss so far: 2.3237.  MRR so far: 0.5623 
11 (valid): Batch     3 (has 200 samples). Processed 600 samples. Loss so far: 2.3163.  MRR so far: 0.5652 
11 (valid): Batch     4 (has 200 samples). Processed 800 samples. Loss so far: 2.3615.  MRR so far: 0.5568 
11 (valid): Batch     5 (has 200 samples). Processed 1000 samples. Loss so far: 2.5153.  MRR so far: 0.5323 
11 (valid): Batch     6 (has 200 samples). Processed 1200 samples. Loss so far: 2.8651.  MRR so far: 0.4921 
11 (valid): Batch     7 (has 200 samples). Processed 1400 samples. Loss so far: 2.7642.  MRR so far: 0.5085 
11 (valid): Batch     8 (has 200 samples). Processed 1600 samples. Loss so far: 2.7468.  MRR so far: 0.5091 
11 (valid): Batch     9 (has 200 samples). Processed 1800 samples. Loss so far: 2.7153.  MRR so far: 0.5134 
11 (valid): Batch    10 (has 200 samples). Processed 2000 samples. Loss so far: 2.7024.  MRR so far: 0.5131 
11 (valid): Batch    11 (has 200 samples). Processed 2200 samples. Loss so far: 2.7061.  MRR so far: 0.5125 
11 (valid): Batch    12 (has 200 samples). Processed 2400 samples. Loss so far: 2.6665.  MRR so far: 0.5183 
11 (valid): Batch    13 (has 200 samples). Processed 2600 samples. Loss so far: 2.6986.  MRR so far: 0.5157 
11 (valid): Batch    14 (has 200 samples). Processed 2800 samples. Loss so far: 2.6975.  MRR so far: 0.5166 
11 (valid): Batch    15 (has 200 samples). Processed 3000 samples. Loss so far: 2.7363.  MRR so far: 0.5118 
11 (valid): Batch    16 (has 200 samples). Processed 3200 samples. Loss so far: 2.7226.  MRR so far: 0.5137 
11 (valid): Batch    17 (has 200 samples). Processed 3400 samples. Loss so far: 2.7146.  MRR so far: 0.5153 
11 (valid): Batch    18 (has 200 samples). Processed 3600 samples. Loss so far: 2.7491.  MRR so far: 0.5115 
11 (valid): Batch    19 (has 200 samples). Processed 3800 samples. Loss so far: 2.7468.  MRR so far: 0.5108 
11 (valid): Batch    20 (has 200 samples). Processed 4000 samples. Loss so far: 2.7470.  MRR so far: 0.5097 
11 (valid): Batch    21 (has 200 samples). Processed 4200 samples. Loss so far: 2.7783.  MRR so far: 0.5070 
11 (valid): Batch    22 (has 200 samples). Processed 4400 samples. Loss so far: 2.7725.  MRR so far: 0.5086 
11 (valid): Batch    23 (has 200 samples). Processed 4600 samples. Loss so far: 2.7606.  MRR so far: 0.5096 
11 (valid): Batch    24 (has 200 samples). Processed 4800 samples. Loss so far: 2.7733.  MRR so far: 0.5069 
11 (valid): Batch    25 (has 200 samples). Processed 5000 samples. Loss so far: 2.8067.  MRR so far: 0.5030 
11 (valid): Batch    26 (has 200 samples). Processed 5200 samples. Loss so far: 2.7878.  MRR so far: 0.5054 
11 (valid): Batch    27 (has 200 samples). Processed 5400 samples. Loss so far: 2.7869.  MRR so far: 0.5054 
11 (valid): Batch    28 (has 200 samples). Processed 5600 samples. Loss so far: 2.8128.  MRR so far: 0.5003 
11 (valid): Batch    29 (has 200 samples). Processed 5800 samples. Loss so far: 2.8420.  MRR so far: 0.4959 
11 (valid): Batch    30 (has 200 samples). Processed 6000 samples. Loss so far: 2.8311.  MRR so far: 0.4981 
11 (valid): Batch    31 (has 200 samples). Processed 6200 samples. Loss so far: 2.8291.  MRR so far: 0.4978 
11 (valid): Batch    32 (has 200 samples). Processed 6400 samples. Loss so far: 2.8190.  MRR so far: 0.4993 
11 (valid): Batch    33 (has 200 samples). Processed 6600 samples. Loss so far: 2.8345.  MRR so far: 0.4980 
11 (valid): Batch    34 (has 200 samples). Processed 6800 samples. Loss so far: 2.8100.  MRR so far: 0.5011 
11 (valid): Batch    35 (has 200 samples). Processed 7000 samples. Loss so far: 2.7998.  MRR so far: 0.5026 
11 (valid): Batch    36 (has 200 samples). Processed 7200 samples. Loss so far: 2.7841.  MRR so far: 0.5052 
11 (valid): Batch    37 (has 200 samples). Processed 7400 samples. Loss so far: 2.7836.  MRR so far: 0.5052 
11 (valid): Batch    38 (has 200 samples). Processed 7600 samples. Loss so far: 2.7906.  MRR so far: 0.5044 
11 (valid): Batch    39 (has 200 samples). Processed 7800 samples. Loss so far: 2.8042.  MRR so far: 0.5020 
11 (valid): Batch    40 (has 200 samples). Processed 8000 samples. Loss so far: 2.8095.  MRR so far: 0.5013 
11 (valid): Batch    41 (has 200 samples). Processed 8200 samples. Loss so far: 2.8119.  MRR so far: 0.5008 
11 (valid): Batch    42 (has 200 samples). Processed 8400 samples. Loss so far: 2.7931.  MRR so far: 0.5038 
11 (valid): Batch    43 (has 200 samples). Processed 8600 samples. Loss so far: 2.7902.  MRR so far: 0.5043 
11 (valid): Batch    44 (has 200 samples). Processed 8800 samples. Loss so far: 2.7918.  MRR so far: 0.5041 
11 (valid): Batch    45 (has 200 samples). Processed 9000 samples. Loss so far: 2.7948.  MRR so far: 0.5044 
11 (valid): Batch    46 (has 200 samples). Processed 9200 samples. Loss so far: 2.7991.  MRR so far: 0.5031 
11 (valid): Batch    47 (has 200 samples). Processed 9400 samples. Loss so far: 2.8023.  MRR so far: 0.5028 
11 (valid): Batch    48 (has 200 samples). Processed 9600 samples. Loss so far: 2.8052.  MRR so far: 0.5020 
11 (valid): Batch    49 (has 200 samples). Processed 9800 samples. Loss so far: 2.8261.  MRR so far: 0.4986 
11 (valid): Batch    50 (has 200 samples). Processed 10000 samples. Loss so far: 2.8538.  MRR so far: 0.4944 
11 (valid): Batch    51 (has 200 samples). Processed 10200 samples. Loss so far: 2.8511.  MRR so far: 0.4946 
11 (valid): Batch    52 (has 200 samples). Processed 10400 samples. Loss so far: 2.8531.  MRR so far: 0.4949 
11 (valid): Batch    53 (has 200 samples). Processed 10600 samples. Loss so far: 2.8617.  MRR so far: 0.4928 
11 (valid): Batch    54 (has 200 samples). Processed 10800 samples. Loss so far: 2.8791.  MRR so far: 0.4897 
11 (valid): Batch    55 (has 200 samples). Processed 11000 samples. Loss so far: 2.8967.  MRR so far: 0.4876 
11 (valid): Batch    56 (has 200 samples). Processed 11200 samples. Loss so far: 2.9021.  MRR so far: 0.4871 
11 (valid): Batch    57 (has 200 samples). Processed 11400 samples. Loss so far: 2.9030.  MRR so far: 0.4873 
11 (valid): Batch    58 (has 200 samples). Processed 11600 samples. Loss so far: 2.9257.  MRR so far: 0.4843 
11 (valid): Batch    59 (has 200 samples). Processed 11800 samples. Loss so far: 2.9270.  MRR so far: 0.4841 
11 (valid): Batch    60 (has 200 samples). Processed 12000 samples. Loss so far: 2.9342.  MRR so far: 0.4836 
11 (valid): Batch    61 (has 200 samples). Processed 12200 samples. Loss so far: 2.9443.  MRR so far: 0.4819 
11 (valid): Batch    62 (has 200 samples). Processed 12400 samples. Loss so far: 2.9565.  MRR so far: 0.4798 
11 (valid): Batch    63 (has 200 samples). Processed 12600 samples. Loss so far: 2.9494.  MRR so far: 0.4810 
11 (valid): Batch    64 (has 200 samples). Processed 12800 samples. Loss so far: 2.9631.  MRR so far: 0.4785 
11 (valid): Batch    65 (has 200 samples). Processed 13000 samples. Loss so far: 2.9674.  MRR so far: 0.4779 
11 (valid): Batch    66 (has 200 samples). Processed 13200 samples. Loss so far: 2.9685.  MRR so far: 0.4778 
11 (valid): Batch    67 (has 200 samples). Processed 13400 samples. Loss so far: 2.9789.  MRR so far: 0.4769 
11 (valid): Batch    68 (has 200 samples). Processed 13600 samples. Loss so far: 2.9794.  MRR so far: 0.4765 
11 (valid): Batch    69 (has 200 samples). Processed 13800 samples. Loss so far: 2.9754.  MRR so far: 0.4766 
11 (valid): Batch    70 (has 200 samples). Processed 14000 samples. Loss so far: 2.9731.  MRR so far: 0.4767 
11 (valid): Batch    71 (has 200 samples). Processed 14200 samples. Loss so far: 2.9822.  MRR so far: 0.4751 
11 (valid): Batch    72 (has 200 samples). Processed 14400 samples. Loss so far: 2.9755.  MRR so far: 0.4756 
11 (valid): Batch    73 (has 200 samples). Processed 14600 samples. Loss so far: 2.9747.  MRR so far: 0.4758 
11 (valid): Batch    74 (has 200 samples). Processed 14800 samples. Loss so far: 2.9666.  MRR so far: 0.4770 
11 (valid): Batch    75 (has 200 samples). Processed 15000 samples. Loss so far: 2.9757.  MRR so far: 0.4758 
11 (valid): Batch    76 (has 200 samples). Processed 15200 samples. Loss so far: 2.9775.  MRR so far: 0.4756 
11 (valid): Batch    77 (has 200 samples). Processed 15400 samples. Loss so far: 2.9813.  MRR so far: 0.4745 
11 (valid): Batch    78 (has 200 samples). Processed 15600 samples. Loss so far: 2.9830.  MRR so far: 0.4739 
11 (valid): Batch    79 (has 200 samples). Processed 15800 samples. Loss so far: 2.9905.  MRR so far: 0.4726 
11 (valid): Batch    80 (has 200 samples). Processed 16000 samples. Loss so far: 3.0173.  MRR so far: 0.4688 
11 (valid): Batch    81 (has 200 samples). Processed 16200 samples. Loss so far: 3.0520.  MRR so far: 0.4645 
11 (valid): Batch    82 (has 200 samples). Processed 16400 samples. Loss so far: 3.0599.  MRR so far: 0.4630 
11 (valid): Batch    83 (has 200 samples). Processed 16600 samples. Loss so far: 3.0639.  MRR so far: 0.4625 
11 (valid): Batch    84 (has 200 samples). Processed 16800 samples. Loss so far: 3.0691.  MRR so far: 0.4616 
11 (valid): Batch    85 (has 200 samples). Processed 17000 samples. Loss so far: 3.0723.  MRR so far: 0.4614 
11 (valid): Batch    86 (has 200 samples). Processed 17200 samples. Loss so far: 3.0691.  MRR so far: 0.4615 
11 (valid): Batch    87 (has 200 samples). Processed 17400 samples. Loss so far: 3.0919.  MRR so far: 0.4589 
11 (valid): Batch    88 (has 200 samples). Processed 17600 samples. Loss so far: 3.0917.  MRR so far: 0.4587 
11 (valid): Batch    89 (has 200 samples). Processed 17800 samples. Loss so far: 3.0887.  MRR so far: 0.4592 
11 (valid): Batch    90 (has 200 samples). Processed 18000 samples. Loss so far: 3.1092.  MRR so far: 0.4561 
11 (valid): Batch    91 (has 200 samples). Processed 18200 samples. Loss so far: 3.1391.  MRR so far: 0.4515 
11 (valid): Batch    92 (has 200 samples). Processed 18400 samples. Loss so far: 3.1650.  MRR so far: 0.4482 
11 (valid): Batch    93 (has 200 samples). Processed 18600 samples. Loss so far: 3.1630.  MRR so far: 0.4486 
11 (valid): Batch    94 (has 200 samples). Processed 18800 samples. Loss so far: 3.1624.  MRR so far: 0.4488 
11 (valid): Batch    95 (has 200 samples). Processed 19000 samples. Loss so far: 3.1692.  MRR so far: 0.4480 
11 (valid): Batch    96 (has 200 samples). Processed 19200 samples. Loss so far: 3.1660.  MRR so far: 0.4486 
11 (valid): Batch    97 (has 200 samples). Processed 19400 samples. Loss so far: 3.1729.  MRR so far: 0.4478 
11 (valid): Batch    98 (has 200 samples). Processed 19600 samples. Loss so far: 3.1779.  MRR so far: 0.4468 
11 (valid): Batch    99 (has 200 samples). Processed 19800 samples. Loss so far: 3.1967.  MRR so far: 0.4445 
11 (valid): Batch   100 (has 200 samples). Processed 20000 samples. Loss so far: 3.1978.  MRR so far: 0.4443 
11 (valid): Batch   101 (has 200 samples). Processed 20200 samples. Loss so far: 3.1949.  MRR so far: 0.4443 
11 (valid): Batch   102 (has 200 samples). Processed 20400 samples. Loss so far: 3.1932.  MRR so far: 0.4444 
11 (valid): Batch   103 (has 200 samples). Processed 20600 samples. Loss so far: 3.1864.  MRR so far: 0.4453 
11 (valid): Batch   104 (has 200 samples). Processed 20800 samples. Loss so far: 3.1881.  MRR so far: 0.4453 
11 (valid): Batch   105 (has 200 samples). Processed 21000 samples. Loss so far: 3.1855.  MRR so far: 0.4457 
11 (valid): Batch   106 (has 200 samples). Processed 21200 samples. Loss so far: 3.1818.  MRR so far: 0.4460 
11 (valid): Batch   107 (has 200 samples). Processed 21400 samples. Loss so far: 3.1841.  MRR so far: 0.4458 
11 (valid): Batch   108 (has 200 samples). Processed 21600 samples. Loss so far: 3.1811.  MRR so far: 0.4462 
11 (valid): Batch   109 (has 200 samples). Processed 21800 samples. Loss so far: 3.1806.  MRR so far: 0.4463 
11 (valid): Batch   110 (has 200 samples). Processed 22000 samples. Loss so far: 3.1816.  MRR so far: 0.4464 
11 (valid): Batch   111 (has 200 samples). Processed 22200 samples. Loss so far: 3.1796.  MRR so far: 0.4467 
11 (valid): Batch   112 (has 200 samples). Processed 22400 samples. Loss so far: 3.1842.  MRR so far: 0.4457 
  Epoch 11 (valid) took 6.12s [processed 3689 samples/second]
 Validation:  Loss: 3.190802 | MRR: 0.444851

Less number of data found than stated in the paper

The paper says that there are 503502 data available for python, but when I download the python data, I get 457461 data combining the 14 files of train data, the file for test data and the file for valid data. I used the whole corpus (of size 1.1M) to find the data with non-empty 'docstring' field and ended up with the reported number of 503502 though. I assume 46k data have been filtered but cannot seem to find why.

Missing annoy module

I tried to execute the predict.py script but I have got the following error:
Traceback (most recent call last):
File "predict.py", line 56, in
from annoy import AnnoyIndex

ModuleNotFoundError: No module named 'annoy'

what version of Annoy module should I use?

wandb custom submission error

I'm currently trying to submit a custom model. After entering the run, evaluating the ndcg score, and writing a brief note on how we approached the results, I get the following error

wandb_submission_error

Now, when I click refresh, I will be redirected to the submission page again. When trying to re-submit, I get the following error Invalid CSV format. Please upload a well-formatted CSV file..

How can I get the annotated code?

Hello, Mr. auther I'd like to do some L2R reseach with the CodeSearchNet code , I wonder if I can get the data that you've annotated already as you said in your paper "annotation statistic"! Thx a lot!!!

Comments are not removed in code_tokens

This is a data from java_test_0.jsonl file. There are still comments in code_tokens. I highlight them using black.

"repo": "ReactiveX/RxJava", "path": "src/main/java/io/reactivex/Observable.java", "func_name": "Observable.skipLast", "original_string": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "language": "java", "code": "@CheckReturnValue\n @SchedulerSupport(SchedulerSupport.CUSTOM)\n public final Observable skipLast(long time, TimeUnit unit, Scheduler scheduler, boolean delayError, int bufferSize) {\n ObjectHelper.requireNonNull(unit, "unit is null");\n ObjectHelper.requireNonNull(scheduler, "scheduler is null");\n ObjectHelper.verifyPositive(bufferSize, "bufferSize");\n // the internal buffer holds pairs of (timestamp, value) so double the default buffer size\n int s = bufferSize << 1;\n return RxJavaPlugins.onAssembly(new ObservableSkipLastTimed(this, time, unit, scheduler, s, delayError));\n }", "code_tokens": ["@", "CheckReturnValue", "@", "SchedulerSupport", "(", "SchedulerSupport", ".", "CUSTOM", ")", "public", "final", "Observable", "<", "T", ">", "skipLast", "(", "long", "time", ",", "TimeUnit", "unit", ",", "Scheduler", "scheduler", ",", "boolean", "delayError", ",", "int", "bufferSize", ")", "{", "ObjectHelper", ".", "requireNonNull", "(", "unit", ",", ""unit is null"", ")", ";", "ObjectHelper", ".", "requireNonNull", "(", "scheduler", ",", ""scheduler is null"", ")", ";", "ObjectHelper", ".", "verifyPositive", "(", "bufferSize", ",", ""bufferSize"", ")", ";", "// the internal buffer holds pairs of (timestamp, value) so double the default buffer size", "int", "s", "=", "bufferSize", "<<", "1", ";", "return", "RxJavaPlugins", ".", "onAssembly", "(", "new", "ObservableSkipLastTimed", "<", "T", ">", "(", "this", ",", "time", ",", "unit", ",", "scheduler", ",", "s", ",", "delayError", ")", ")", ";", "}"]

Preprocessing of docstrings can be improved

Hello,

first thanks for the challenge, the code and the dataset! Really cool stuff that you're doing and I want to work on this task. :)

I've read the Contribution Guidelines and know that you will not change any of the preprocessing code, but nevertheless I want to discuss the preprocessing of the docstrings here in case someone wants to produce a similar dataset (or maybe v3 ;) ) .

Current preprocessing of docstrings

I read your code and it seems that this is the way you preprocess the docstrings:

  1. You extract the documentation from the method and strip c-style delimiters
    def strip_c_style_comment_delimiters(comment: str) -> str:
  2. Then you extract the relevant part of the docstring (which acts as a summary) using the following heuristics:
    2.1 If \n\n is found you take the part before
    2.2 otherwise take the part before the first @
    2.3. otherwise take the full docstring
    def get_docstring_summary(docstring: str) -> str:
    """Get the first lines of the documentation comment up to the empty lines."""
    if '\n\n' in docstring:
    return docstring.split('\n\n')[0]
    elif '@' in docstring:
    return docstring[:docstring.find('@')] # This usually is the start of a JavaDoc-style @param comment.
    return docstring
  3. Finally you tokenize the extracted summary using the following regex:
    DOCSTRING_REGEX_TOKENIZER = re.compile(r"[^\s,'\"`.():\[\]=*;>{\}+-/\\]+|\\+|\.+|\(\)|{\}|\[\]|\(+|\)+|:+|\[+|\]+|{+|\}+|=+|\*+|;+|>+|\++|-+|/+")

Problems with this pipeline

This way of preprocessing produces a couple of results that are probably not wanted and could be improved.

Extracting the summary

Compare the first 12 lines of the tokenized docstrings of the java train set to the raw ones

Bind indexed elements to the supplied collection .
Set {
Add {
Set servlet names that the filter will be registered against . This will replace any previously specified servlet names .
Add servlet names for the filter .
Set the URL patterns that the filter will be registered against . This will replace any previously specified URL patterns .
Add URL patterns as defined in the Servlet specification that the filter will be registered against .
Convenience method to {
Configure registration settings . Subclasses can override this method to perform additional configuration if required .
Create a nested {
Create a nested {
Create a nested {

As you can see 6/12 are basically not usable.

**Bind indexed elements to the supplied collection.** @param name the name of the property to bind @param target the target bindable @param elementBinder the binder to use for elements @param aggregateType the aggregate type, may be a collection or an array @param elementType the element type @param result the destination for results
**Set** {@link **ServletRegistrationBean**}**s that the filter will be registered against.** @param servletRegistrationBeans the Servlet registration beans
**Add** {@link **ServletRegistrationBean**}**s for the filter.** @param servletRegistrationBeans the servlet registration beans to add @see #setServletRegistrationBeans
**Set servlet names that the filter will be registered against. This will replace any previously specified servlet names.** @param servletNames the servlet names @see #setServletRegistrationBeans @see #setUrlPatterns
**Add servlet names for the filter.** @param servletNames the servlet names to add
**Set the URL patterns that the filter will be registered against. This will replace any previously specified URL patterns.** @param urlPatterns the URL patterns @see #setServletRegistrationBeans @see #setServletNames
**Add URL patterns, as defined in the Servlet specification, that the filter will be registered against.** @param urlPatterns the URL patterns
**Convenience method to** {@link **#setDispatcherTypes(EnumSet) set dispatcher types**} **using the specified elements.** @param first the first dispatcher type @param rest additional dispatcher types
**Configure registration settings. Subclasses can override this method to perform additional configuration if required.** @param registration the registration
**Create a nested** {@link **DependencyCustomizer**} **that only applies if any of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if all of the specified class names are not on the class path.** @param classNames the class names to test @return a nested {@link DependencyCustomizer}
**Create a nested** {@link **DependencyCustomizer**} **that only applies if the specified paths are on the class path.** @param paths the paths to test @return a nested {@link DependencyCustomizer}

However, the relevant information is in the raw docstrings (I added the ** to highlight relevant passages). Simply using the part before the first @ produces pretty bad results (at least in java) as its common practice to highlight code blocks or links with javadoc-tags.
Possible solution: Stripping everything before the first param (or maybe @param) and afterwards removing javadoc-tags (maybe keeping the tokens inside).

Cleaning

The preprocessing does not include any cleaning. This manifests in docstrings that contain html-tags (which are commonly found in javadoc), as well as URLs (which afterwards get pretty verbose tokenized). See these two samples from java

Determine if a uri is in origin - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .
Determine if a uri is in asterisk - form according to <a href = https : // tools . ietf . org / html / rfc7230#section - 5 . 3 > rfc7230 5 . 3< / a > .

Some stats

>>> wc -l java.train.comment 
454436 java.train.comment
>>> grep -E '<p >|<p >' java.train.comment | wc -l
42750

At least 10% of the tokenized java docstrings still contain html tags.

>>> grep '{ @' java.train.comment | wc -l
44500

Another 10% still contain javadoc.

>>> grep "{ @inheritDoc }" java.train.comment | wc -l
1685

2k consist only of a single javadoc-tag indication that the doc was inherited.

Many of the golang docstrings contain URLs, which are not very useful in the tokenized version the regex produces (see above in java).

>>> wc -l go.train.comment 
317822 go.train.comment
>>> grep -E 'http :|https :' go.train.comment | wc -l
19753

~6% contain URLs (starting with http :)

>>> grep -E "autogenerated|auto generated" go.train.comment | wc -l
4850

Around 5k auto generated methods

>>> grep "/ *" go.train.comment | wc -l
33620

10% still contain c-style comment delimiters.

Tokenization

Any specific reason you keep punctuation symbols like ., ,, -, /, :, <, >, *, =, @, (, ) as tokens? Is it to keep code in the docstrings?

Summary

I really think better cleaning and a language-dependent preprocessing would produce higher quality docstrings At least for java a removal of javadoc and html could be beneficial. As well as using everything before the first param as a summary (maybe in combination with the first paragraph \n\n heuristic).

W&B Default Screen - Wasted Real Estate

Consider opening the Auto-Visualizations by default, right now - when you navigate to a project it opens with the below view. Seems like wasted real-estate

image

W&B When You Tag A Run, Doesn't Appear Until Refresh

See animation below, perhaps something is broken in javascript

label_problem

I do get the following error in the console: Error while trying to use the following icon from the Manifest: https://app.wandb.ai/favicon.png (Resource size is not correct - typo in the Manifest?)

Missing code to build files *_dedupe_definitions_v2.pkl

I have noticed the usage of the files *_dedupe_definitions_v2.pkl when using predict.py
However, I cannot find the code to build the files *_dedupe_definitions_v2.pkl
how should I build those files?
what is the purpose of those files?

thanks

script/setup failed

Thanks for your cool project!
I follow the quickstart of the instruction, but it seems that I can't pip tensorflow_gpu in the docker container because of network problem. Here is the error information:
Step 13/21 : RUN pip --no-cache-dir install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
---> Running in 10ecd6cca56f
Collecting tensorflow-gpu==1.12.0 from https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl
Downloading https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.12.0-cp36-cp36m-linux_x86_64.whl (281.7MB)
Requirement already satisfied: six>=1.10.0 in /usr/lib/python3/dist-packages (from tensorflow-gpu==1.12.0) (1.10.0)
Collecting termcolor>=1.1.0 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/8a/48/a76be51647d0eb9f10e2a4511bf3ffb8cc1e6b14e9e4fab46173aa79f981/termcolor-1.1.0.tar.gz
Collecting keras-preprocessing>=1.0.5 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/28/6a/8c1f62c37212d9fc441a7e26736df51ce6f0e38455816445471f10da4f0a/Keras_Preprocessing-1.1.0-py2.py3-none-any.whl (41kB)
Collecting keras-applications>=1.0.6 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
Collecting numpy>=1.13.3 (from tensorflow-gpu==1.12.0)
Downloading https://files.pythonhosted.org/packages/e5/e6/c3fdc53aed9fa19d6ff3abf97dfad768ae3afce1b7431f7500000816bda5/numpy-1.17.2-cp36-cp36m-manylinux1_x86_64.whl (20.4MB)

ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 397, in _error_catcher
yield
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 479, in read
data = self._fp.read(amt)
File "/usr/lib/python3.6/http/client.py", line 449, in read
n = self.readinto(b)
File "/usr/lib/python3.6/http/client.py", line 493, in readinto
n = self.fp.readinto(b)
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
socket.timeout: The read operation timed out

php with utf-8 tokenization are broken

I've tried to use php_dedupe_definitions_v2.pkl for my own project and found many functions with broken tokenization. For example, find functions with empty ('') tokens - there are above 8000 of that. Then, If we try to look for all 1-letter tokens we will get tons of 1-letter utf8 tokens which is impossible.

How to custom model submissions?

How do I submit custom models only containing a model_predictions.csv file as an output of the custom model? It seems like the CI tests fail on custom model submissions and thus the PR is closed (see #184)?

Error while processing a single Python file.

In CodeSearchNet/function_parser/function_parser/demo.ipynb . I kept everything same till thrid cell and then I did this processor.process_single_file(py_file_path) . Here py_file_path contains the complete path of .py file that I want to process.

After executing the above line I got the following error :
unhashable type: 'tree_sitter.Node' in file function_parser/function_parser/parsers/language_parser.py.

Am I missing something?

How to deconstruct code into tokens to extract functions and comments?

I want to make a code search corpus. I have collected a lots of GitHub repositories. Now I need to deconstruct code into tokens to extract functions and comments. You describe in the paper CodeSearchNet Challenge Evaluating the State of Semantic Code Search: We then tokenize all Go, Java, JavaScript, Python, PHP and Ruby functions (or methods) using TreeSitter — GitHub’s universal parser — and, where available, their respective documentation text using a heuristic regular expression.

I can extract functions in python. But it hasn't comments. How do you extract functions with comments? Can you share your codes?

script to request and download model NDCG

Since the relevance_annotations.csv is not available, a script to upload model_predictions.csv and download a model statistics file (NDCG, MRR, for example) would be great.

Is the MRR calculation reasonable now?

CodeSearchNet is a very good task. I think this will greatly promote the development of this field.
But I have some questions, as follows:

  1. The MRR and NDCG performances are inconsistent. The model has a high MRR value on test sets, but has a low score on the leaderboard. Why is that?
  2. According to my understand, the calculation of MRR is based on the order in a batch. However, the examples in the batch are random and the batch size affects the calculation of the MRR. Is this appropriate?
  3. If I want to train a model, how should I evaluate the model and select the best model during training? Is there a better way ?

Request for a smaller dataset for researchers with lesser resources

Thank you for making this amazing problem statement public, along with a very comprehensive dataset!

Can a relatively smaller size dataset ( a subset ) of it be made available for independent developers/researchers who might try running this on their personal machines ?

This will open up the problem for a larger audience and may bring in some innovative solutions!

999 distractor snippets

Thank You for CodeSearchNet. A quick question: if we wish to replicate Table 3 (of your paper) where can we find the 999 distractor snippets and how did you form those 999 snippets?

Where can I find the pre-trained model

I saw that one of the goal of this project is to "Open source code for a range of baseline models, along with pre-trained weights".
So Is there a pre-trained model? I couldn't find it here.

question about NDCG calculation

I've noticed that the official calculation of NDCG is here.

def ndcg(predictions: Dict[str, List[str]], relevance_scores: Dict[str, Dict[str, float]],

On this basis the original paper has reported the NDCG of six languages on code search tasks.

NDCG

While I re-implement a baseline search model based on MLP. And I calculate the MRR MAP and NDCG metrics by myself.

The MRR is 0.5128467211800546, MAP is 0.19741363623638755 and NDCG is 0.6274463943748803.

Both MRR and MAP seem great, but NDCG is nearly 3 times outperform than the results which the original paper noticed.

I think it's may not be the power of my baseline modal, there's must be something wrong with the NDCG implementation.

Here's the function I used for calculation:

def iDCG(true_list:list,topk=-1):

    true_descend_list = np.sort(true_list)[::-1]
    
    idcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
    if not topk==-1:
        idcg_list = idcg_list[:topk]
    idcg = np.sum(idcg_list)
    return idcg

def DCG(true_list,pred_list,topk=-1):

    pred_descend_order = np.argsort(pred_list)[::-1]
    
    true_descend_list = [true_list[i] for i in pred_descend_order]
    
    dcg_list = [(np.power(2,label)-1)/(np.log2(num+1)) for num,label in enumerate(true_descend_list,start=1)]
    
    if not topk==-1:
        dcg_list = dcg_list[:topk]
    
    dcg = np.sum(dcg_list)
    return dcg

def NDCG(true_rank_dict,pred_rank_dict,topk=-1):
    
    ndcg_lst = []
    
    for qid in pred_rank_dict:
        temp_pred = pred_rank_dict[qid]
        temp_true = true_rank_dict[qid]
        
        idcg = iDCG(true_list=temp_true,topk=topk)
        
        dcg = DCG(true_list=temp_true,pred_list=temp_pred,topk=topk)
        
        ndcg = dcg / idcg if not idcg == 0 else 0
        
        ndcg_lst.append(ndcg)
        
    return np.average(ndcg_lst)

I'm confused about how original paper calculates NDCG metrics, especially how to choose the threshold K of NDCG@K which is not noticed in paper.

Pls help.

Sorry I still unkown the whole Train-Test workflow after reading all instructions.

Greetings, I'm a PhD student working in IR/SE. I'm very happy to see CSNet project and thank you for your great work.

I'm working on software representation and code retrieval/search. I'm familiar with IR dataset, but forgive me I quite not understand how CSNet dataset provide data annotataions.

I've got all six language for AWS, and I've found queries here https://github.com/github/CodeSearchNet/blob/master/resources/queries.csv in resource folder.

I explore data from pandas.Dataframe and see all six language have been divided into train/test/valid partation label.

But I still cannot find the relation beweeen queries and code data. In other words, every query text has no related source code data for it.

I'm newer to WandB platform, I wonder there be golden label for me to train model offline. Or I strictly need to write a dataloader or function for accepting WandB's online label data.

Error when executing docker run

I have tried to execute the scripts to run the dockers images such as
docker run -v $(pwd):/home/dev preprocessing

However, I have got the following:
standard_init_linux.go:219: exec user process caused: no such file or directory

Additionally, there is no script to run the docker-cpu.Dockerfile
how can I run that docker?

I have tried something like
docker run --net=host -v $(pwd):/home/dev csnet:cpu bash
but, it doesn't return any, I can notice the instance goes down by observing the syslog file

I am using an ubuntu 18.04 virtual machine with python 3.6.5

Extract descriptions from `@return` doc comments if method has no summary?

e.g. some short methods may contain a description in the return tag, but not a description of the method itself. (to avoid redundancy).

Doing this would extract more methods, but they may be of lower quality if incorrectly parsed or automatically generated. I'd expect a description such as @return bool STARTDESCRIPTION true if this is a float to be extracted (I'm not familiar with how the data representation works)

  • Some libraries/applications are strict about always having a summary in their coding standards, others aren't.
  • e.g. I've seen @return the description without a type for php
  • Could try to annotate the fact that the fallback was used and this is a non-official summary
    /**
     * @return bool true if this is a float
     */
    public function isFloat()

It would be nice to account for code such as @return HasTemplate<string, stdClass>, etc. Making sure that <([{ in the first token are matched up may be useful as a basic approach (and give up if that fails). (There's no official standard and different static analyzers have their own extensions to the type syntax)

An example implementation for PHP is https://github.com/phan/phan/blob/2.2.12/src/Phan/Language/Element/MarkupDescription.php - examples of what it generates as descriptions are https://github.com/phan/phan/blob/2.2.12/tests/Phan/Language/Element/MarkupDescriptionTest.php#L132-L155

A minor Java tokenization utf-related issue

This may not be something very important or worth fixing immediately, but there may be a small bug in Java function tokenization.

At least one function in the dataset has code_tokens that do not include a { token.


Quick inspection with

with pd.option_context('display.max_colwidth', -1):
    display(jdf.loc[jdf['url'] == 'https://github.com/jbehave/jbehave-core/blob/bdd6a6199528df3c35087e72d4644870655b23e6/examples/i18n/src/main/java/org/jbehave/examples/trader/i18n/steps/DeSteps.java#L22-L25'][['code', 'code_tokens']])

shows tokens like tring , ymbol for this code

@Given("ich habe eine Aktion mit dem Symbol $sümbol und eine Schwelle von $threshold")
public void aStock(@Named("sümbol") String symbol, @Named("threshold") double threshold) { ...

code_tokens looks like this

[@, Given, (, "ich habe eine Aktion mit dem Symbol  𝑠ü𝑚𝑏𝑜𝑙𝑢𝑛𝑑𝑒𝑖𝑛𝑒𝑆𝑐ℎ𝑤𝑒𝑙𝑙𝑒𝑣𝑜𝑛 threshold"), , public, void, aStock, (, @, Named, (, "sümbol"), , tring , ymbol,, , N, amed(, ", threshold"), ...]

I'm not very familiar with the extraction pipeline codebase, but the fact that tree-sitter seems to identify the locations well
Screen Shot 2019-10-27 at 2 52 10 PM makes me think that JavaParser.get_definition(), that is doing some index math, may be worth closer inspection.

tokens_str in Encoder

I have been trying to implement a "custom" encoder within this codebase, and I was wondering how to get access to the raw tokens (in string form).

What I have tried so far:

  • Setting up the tokens_str placeholder:
self.placeholders['tokens_str'] = \
            tf.placeholder(tf.string,
                           shape=[None, self.get_hyper('max_num_tokens')],
                           name='tokens_str')

Once I have those, I am trying to pipe the tokens into the tf_hub Elmo module, as follows:

seq_tokens = self.placeholders['tokens_str']
seq_tokens_lengths = self.placeholders['tokens_lengths']
            
# ## DEBUGGING: OUTPUT SHAPES
# print("Sequence Tokens Shape: %s" % seq_tokens.shape)
# print("Sequence Tokens Lengths: %s" % seq_tokens_lengths)

## pull elmo model from tensorflow hub
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=is_train)
token_embeddings = elmo(
     {
           "tokens": seq_tokens,
           "sequence_len": seq_tokens_lengths
      },
     signature='tokens',
    as_dict=True
)['elmo'] ## [batch_size, max_length, 1024 or 512]

I can see from my debugging statements, that seq_tokens shape is (?, 200), which is expected and seq_tokens_lengths shape is (?) which is also expected.

The error I get is len(seq_lens) != input.dims(0), (1000 vs. 0), which is coming from the ELMo model. But I'm guessing that this is coming from the seq_tokens not being pushed into the ELMo model.
Any help is appreciated!

Request for pre-fine tuning self-art weights

Hi

I would like to obtain the self attention model weights before any fine tuning was done to it.
Does this link from the leaderboard contain such a weights file. If it is the fine tuned one, could you make the base self-att model for python available.

My intention is to pass these weights to a model benefitting from contextual word embeddings in PyTorch. Any information pertaining to the structure of the weights file would be beneficial.

Thank you

Carlos

about

use train set and valid set for training
and codebase for evaluating the model
so test set for what?
deep code search says that manual evaluation,
there is way for auto evaluation?

and for java dataset some desc is empty and some desc is not english ,like german

The function processor.process_single_file() fail to output.

Hi, I want to use the parser to process my own data. First I want to try several single files, but the function processor.process_single_file() fails to output things (just an empty list). How can I debug? I have used the provided docker to set up my environment.

Support Dart and Flutter

Flutter is a UI DSL currently pushed by Google. It uses the Dart programming language.

Is there any chance that you create Dart and Flutter dataset just like java, python, php, go, js and ruby?

Thanks for your work!

Generating Pypi module for function_parser

Hi,

Are there any plans to export the function_parser library this repo has into a proper pypi module people could easily install? It looks awesome and I bet a ton of SE people could make use of it, especially if it was made a bit easier to setup. I'm more than happy to help with generating a PR to do it and if need be helping finishing any final touches it needs (I understand how research tools sometimes be :)).

BTW, love the research and uploading all your data and code. I've used it numerous times in my research 🤓.

Calculating NDCG for a Keras model during run-time.

Hi

I am trying to build a Keras model from scratch using only Python dataset for a start. But I am confused as to how NDCG can be calculated during training or even testing.

According to my understanding, calculating NDCG will require a ground truth of which pages are relevant to a query in a ranked order and the model's predictions for that query. But in the dataset, each code-block is provided with its own query (docstring), but no ranking order of pages for a query.

I am using same network architecture as proposed in this repository, but only have python as the language input for now, the final layer being a 2D matrix (softmax applied) of shape (BS*BS), with each cell having relevance score for (query, code-page) pair. As ground-truth to this, I prepared a 2D matrix having 1 as diagonal elements and 0 elsewhere.

But how can I calculate NDCG in this scenario, as only 1 page is relevant for each query, instead of a relevance list of pages?

GPU Docker Image can not be build

When building the gpu-docker image following error occurs:

W: The repository 'http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu xenial Release' does not have a Release file. E: Failed to fetch http://ppa.launchpad.net/jonathonf/python-3.6/ubuntu/dists/xenial/main/binary-amd64/Packages 403 Forbidden E: Some index files failed to download. They have been ignored, or old ones used instead.

Seems like the PPA for Python 3.6 was removed. See: https://launchpad.net/~jonathonf/+archive/ubuntu/python-3.6

RELEVANCE_ANNOTATIONS_CSV_PATH file for running evaluation

Hi,

I wanted to run an evaluation using the NDCG score as done in the paper.

Where is the RELEVANCE_ANNOTATIONS_CSV_PATH for the 99 queries as mentioned in the README to run the /src/relevanceeval.py file?

Just want to test my results..

Thanks

Just wanted to leave a note of appreciation. I've been looking for parsers that would deconstruct code into tokens easily (specifically for Ruby code, I took at look at parser which is just an AST and would have required more processing) so that I could figure out potential call sites and relations between test files and source code.

I'm looking to prioritize tests that should be run based on which functions are added/removed/updated, similar to what's found in the paper The art of testing less without sacrificing quality.

(I also found that Facebook has done something like this and they have a paper on it)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.