GithubHelp home page GithubHelp logo

danielzuegner / code-transformer Goto Github PK

View Code? Open in Web Editor NEW
161.0 9.0 30.0 2.71 MB

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Home Page: https://www.in.tum.de/daml/code-transformer/

License: MIT License

Python 83.30% Jupyter Notebook 2.96% Shell 2.12% C# 3.26% Java 8.36%
deep-learning machine-learning ml4code transformers iclr2021

code-transformer's People

Contributors

danielzuegner avatar tobias-kirschstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

code-transformer's Issues

Data is not available

Hi! There is something wrong with the data.
image

Internal Server Error.
How can I get the data and checkpoint? Thanks!

Error when preprocessing java-medium

Dear Authors,

Thank you very much for your work! I used the scripts for preprocessing the java code2seq-type datasets to test the performance of the method name prediction model on the java-medium dataset later but was faced with a strange error. Everything was fine with the stage1 scripts, but after I ran the stage2 script (with a command like python -m scripts.run-preprocessing code_transformer/experiments/preprocessing/preprocess-2.yaml java-medium train, in particular), I have found the process failed (after a couple of hours, with a setting batch_size: 1000, num_processes: 8) with such an error:

struct.error: 'i' format requires -2147483648 <= number <= 2147483647

The detailed traceback follows:

Spoiler

Traceback (most recent calls WITHOUT Sacred internals):
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 344, in _sendback_result
    exception=exception))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/externals/loky/backend/queues.py", line 240, in put
    self._writer.send_bytes(obj)
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes
    header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
"""

The above exception was the direct cause of the following exception:

Traceback (most recent calls WITHOUT Sacred internals):
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 340, in main
    Preprocess2Container().run()
  File "code_transformer/experiments/preprocessing/preprocess-2.py", line 312, in run
    for batch in dataset_slice)
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 1017, in __call__
    self.retrieve()
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/parallel.py", line 909, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ubuntu/.local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 562, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 435, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Unfortunately, I didn't manage to specify the procedure causing this error. So, my questions are what is the reason why this error appears in java-medium (and not appears on smaller datasets), and how can I resolve this problem (at least, what particular line may produce it – maybe I can catch an exception somewhere on some poor data)?

Specify Python version

Greetings,

I am writing a replication paper on the CodeTransformer for the data mining class at my university.
Unfortunately, I cannot find any specification of the Python version that was used for the experiments.
Inferring from torch==1.4.0 from the requirements.txt file, I figured that Python 3.7 might have been the used version, as this is the most recent Python version supporting torch 1.4.0 (juding by the "Programming languages" section athttps://pypi.org/project/torch/1.4.0/).
I would be very grateful if you specified the used Python version, even though the release of your paper dates a while back.

Thanks and best regards,
Philipp

Missing 'Experiment' Class Definition

Hi Authors,

Thanks for open source the awesome code!

In experiment.py, it tries to "from sacred import Experiment", but I couldn't really find the its definition anywhere, inside or outside 'code_transformer.utils.sacred'. Afterwards the code defines "ex = Experiment(base_dir='../../', interactive=False)", and use 'ex' in quite some places too.

May I enquire if it is something critical, and where to find it?

Thank you so much!

Regards,

Number of training steps needed?

Thanks for releasing this amazing repo! The documentation is thorough and extremely helpful!

I didn't find the number of training steps or epochs needed in Appendix A.6 in the paper. I am running
python -m scripts.run-experiment code_transformer/experiments/code_transformer/code_summarization.yaml (I changed the #layers in yaml file from 1 to 3 according to the appendix in the paper) over 2 days on a single GPU.

I have run for 600k steps and F-1 score in the tensorboard (I guess this is average F-1 score over 4 coding languages?) is around 0.27 (the micro F-1 is 0.33). The number is still a bit off from table 2. I wonder should I just train longer or something is wrong with my training.

Using code-transformer on c++ code

Hello, I'm currently working on clustering some competitive programming code based on the algorithm used and I would like to know if I could use your model to predict method's names. Couldn't find anything cpp related.

Thanks ! Have a good day !

TypeError: 'float' object cannot be interpreted as an integer on modern pytroch

When run on pytorch '1.10.2+cu102' I get following error:

Traceback (most recent call last):
  File "/code_transformer/preprocessing/graph/transform.py", line 30, in __call__
    distance_matrix = distance_metric(adj)
  File "/code_transformer/preprocessing/graph/distances.py", line 76, in __call__
    sp_length = all_pairs_shortest_paths(G=G)
  File "/code_transformer/preprocessing/graph/alg.py", line 45, in all_pairs_shortest_paths
    values = torch.tensor([(dct[0], key, value) for dct in sps for key, value in dct[1].items()],
TypeError: 'float' object cannot be interpreted as an integer

PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch

Hi,
Preprocessing is running fine without errors for few codes, but it is throwing the exception "PreprocessingException: Error processing batch 0: No snippets left after filtering step. Skipping batch" for few other codes.
Can anyone tell how to overcome this error?

The code being used for preprocessing is from the interactive_prediction.ipynb notebook provided.

preprocessor = CTStage1Preprocessor(code_snippet_language, allow_empty_methods=True)
stage1_sample = preprocessor.process([("", "", code_snippet)], 0)

Code embeddings

First of all thanks for publishing such great work.

In the example notebook you are showing how to use "Query Stream Embedding of the masked method name token in the final encoder layer" as "meaningful embedding of the provided AST/Source code pair". Typically I would use the embedding of [CLS] token. Is the [CLS] token added underneath when I use the processing from the notebook or should I add it myself?

Where to obtain and how to use the binary files for javaparser and javamethodextractor

Dear Authors,

Thank you so much for open source the great work. I am very new to this, may I ask how to get the java-parser-1.0-SNAPSHOT.jar and JavaMethodExtractor-1.0.0-SNAPSHOT.jar? They are listed as required files under CODE_TRANSFORMER_BINARY_PATH, in parallel with semantic.

The java_to_ast method requires the file. Without it, the notebook interactive_prediction cannot run too.

I tried to download and use https://jar-download.com/artifacts/com.google.code.javaparser/javaparser/1.0.8/source-code but couldn't make it work. I got error of 'no main manifest attribute, in ./javaparser-1.0.8.jar\n'.

Help will be greatly appreciated!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.