danielzuegner / code-transformer Goto Github PK

View Code? Open in Web Editor NEW

164.0 9.0 31.0 2.71 MB

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Home Page: https://www.in.tum.de/daml/code-transformer/

License: MIT License

Python 83.30% Jupyter Notebook 2.96% Shell 2.12% C# 3.26% Java 8.36%

deep-learning machine-learning ml4code transformers iclr2021

code-transformer's People

Contributors

Stargazers

Watchers

code-transformer's Issues

Can i build function classifier using code-transformer

Hi,
I would like to build a method/function classifier using code-transformer, the labels are going to be 0,1,2,..9,

Could i get some suggestions on how to build it?

Any guidelines would be hugely appreciated! :D

Specify Python version

Greetings,

I am writing a replication paper on the CodeTransformer for the data mining class at my university.
Unfortunately, I cannot find any specification of the Python version that was used for the experiments.
Inferring from torch==1.4.0 from the requirements.txt file, I figured that Python 3.7 might have been the used version, as this is the most recent Python version supporting torch 1.4.0 (juding by the "Programming languages" section athttps://pypi.org/project/torch/1.4.0/).
I would be very grateful if you specified the used Python version, even though the release of your paper dates a while back.

Thanks and best regards,
Philipp

finetune to predict docstring instead of func_name

I want to finetune the model to predict docstrings instead of func_name. Can you please help?

Using code-transformer on c++ code

Hello, I'm currently working on clustering some competitive programming code based on the algorithm used and I would like to know if I could use your model to predict method's names. Couldn't find anything cpp related.

Thanks ! Have a good day !

PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch

Hi,
Preprocessing is running fine without errors for few codes, but it is throwing the exception "PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch" for few other codes.
Can anyone tell how to overcome this error?

The code being used for preprocessing is from the interactive_prediction.ipynb notebook provided.

preprocessor = CTStage1Preprocessor(code_snippet_language, allow_empty_methods=True)
stage1_sample = preprocessor.process([("", "", code_snippet)], 0)

Error when preprocessing java-medium

Dear Authors,

Thank you very much for your work! I used the scripts for preprocessing the java code2seq-type datasets to test the performance of the method name prediction model on the java-medium dataset later but was faced with a strange error. Everything was fine with the stage1 scripts, but after I ran the stage2 script (with a command like python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml java-medium train, in particular), I have found the process failed (after a couple of hours, with a setting batch_size: 1000, num_processes: 8) with such an error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The detailed traceback follows:

Spoiler

Traceback (most recent calls WITHOUT Sacred internals):
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent calls WITHOUT Sacred internals):
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 340, in main
    Preprocess2Container().run()
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 312, in run
    for batch in dataset_slice)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Unfortunately, I didn't manage to specify the procedure causing this error. So, my questions are what is the reason why this error appears in java-medium (and not appears on smaller datasets), and how can I resolve this problem (at least, what particular line may produce it – maybe I can catch an exception somewhere on some poor data)?

Error when compute asts from code2seq dataset on stage 1

I encountered a problem when executing the command . Does anyone know how to solve it?

TypeError: 'float' object cannot be interpreted as an integer on modern pytroch

When run on pytorch '1.10.2+cu102' I get following error:

Traceback (most recent call last):
  File "/code_transformer/preprocessing/graph/transform.py", line 30, in __call__
    distance_matrix = distance_metric(adj)
  File "/code_transformer/preprocessing/graph/distances.py", line 76, in __call__
    sp_length = all_pairs_shortest_paths(G=G)
  File "/code_transformer/preprocessing/graph/alg.py", line 45, in all_pairs_shortest_paths
    values = torch.tensor([(dct[0], key, value) for dct in sps for key, value in dct[1].items()],
TypeError: 'float' object cannot be interpreted as an integer

Code embeddings

First of all thanks for publishing such great work.

In the example notebook you are showing how to use "Query Stream Embedding of the masked method name token in the final encoder layer" as "meaningful embedding of the provided AST/Source code pair". Typically I would use the embedding of [CLS] token. Is the [CLS] token added underneath when I use the processing from the notebook or should I add it myself?

Missing 'Experiment' Class Definition

Hi Authors,

Thanks for open source the awesome code!

In experiment.py, it tries to "from sacred import Experiment", but I couldn't really find the its definition anywhere, inside or outside 'code_transformer.utils.sacred'. Afterwards the code defines "ex = Experiment(base_dir='../../', interactive=False)", and use 'ex' in quite some places too.

May I enquire if it is something critical, and where to find it?

Thank you so much!

Regards,

How to just get the code embedding of entire code_snippet, without having to predict any function name

Hi,
I want to get code embedding of entire code_snippet. How to get it?
As per Code Snippet embedding section in interactive_prediction.ipynb, it gives the embedding of only masked method. I don't have any masked method or func name to predict. I just want the embedding of entire code. How can we do it?

Number of training steps needed?

Thanks for releasing this amazing repo! The documentation is thorough and extremely helpful!

I didn't find the number of training steps or epochs needed in Appendix A.6 in the paper. I am running
python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml (I changed the #layers in yaml file from 1 to 3 according to the appendix in the paper) over 2 days on a single GPU.

I have run for 600k steps and F-1 score in the tensorboard (I guess this is average F-1 score over 4 coding languages?) is around 0.27 (the micro F-1 is 0.33). The number is still a bit off from table 2. I wonder should I just train longer or something is wrong with my training.

Where to obtain and how to use the binary files for javaparser and javamethodextractor

Dear Authors,

Thank you so much for open source the great work. I am very new to this, may I ask how to get the java-parser-1.0-SNAPSHOT.jar and JavaMethodExtractor-1.0.0-SNAPSHOT.jar? They are listed as required files under CODE_TRANSFORMER_BINARY_PATH, in parallel with semantic.

The java_to_ast method requires the file. Without it, the notebook interactive_prediction cannot run too.

I tried to download and use https://jar-download.com/artifacts/com.google.code.javaparser/javaparser/1.0.8/source-code but couldn't make it work. I got error of 'no main manifest attribute, in ./javaparser-1.0.8.jar\n'.

Help will be greatly appreciated!

Data is not available

Hi！ There is something wrong with the data.

Internal Server Error.
How can I get the data and checkpoint? Thanks!

danielzuegner / code-transformer Goto Github PK

code-transformer's People

Contributors

Stargazers

Watchers

Forkers

code-transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs