GithubHelp home page GithubHelp logo

thudm / codegeex Goto Github PK

View Code? Open in Web Editor NEW
7.8K 85.0 554.0 14.11 MB

CodeGeeX: An Open Multilingual Code Generation Model (KDD 2023)

Home Page: https://codegeex.cn

License: Apache License 2.0

Python 94.65% C++ 1.99% Dockerfile 0.20% Shell 2.91% Java 0.26%
code-generation pretrained-models tools

codegeex's Introduction

🏠 Homepage | 📖 Blog | 🪧 DEMO | 🤖 Download Model | 📄 Paper | 🌐 中文

🛠 VS Code, Jetbrains, Cloud Studio supported | 👋 Join our Discord, Slack, Telegram, WeChat

🌟 CodeGeeX2 已推出,更强,更快,更轻量。| CodeGeeX2 has been released, more powerful, faster, and lightweight.

CodeGeeX: A Multilingual Code Generation Model

We introduce CodeGeeX, a large-scale multilingual code generation model with 13 billion parameters, pre-trained on a large code corpus of more than 20 programming languages. As of June 22, 2022, CodeGeeX has been trained on more than 850 billion tokens on a cluster of 1,536 Ascend 910 AI Processors. CodeGeeX has several unique features:

  • Multilingual Code Generation: CodeGeeX has good performance for generating executable programs in several mainstream programming languages, including Python, C++, Java, JavaScript, Go, etc. DEMO
  • Crosslingual Code Translation: CodeGeeX supports the translation of code snippets between different languages. Simply by one click, CodeGeeX can transform a program into any expected language with a high accuracy. DEMO
  • Customizable Programming Assistant: CodeGeeX is available in the VS Code extension marketplace for free. It supports code completion, explanation, summarization and more, which empower users with a better coding experience. VS Code Extension
  • Open-Source and Cross-Platform: All codes and model weights are publicly available for research purposes. CodeGeeX supports both Ascend and NVIDIA platforms. It supports inference in a single Ascend 910, NVIDIA V100 or A100. Apply Model Weights

HumanEval-X for Realistic Multilingual Benchmarking. To help standardize the evaluation of multilingual code generation and translation, we develop and release the HumanEval-X Benchmark. HumanEval-X is a new multilingual benchmark that contains 820 human-crafted coding problems in 5 programming languages (Python, C++, Java, JavaScript, and Go), each of these problems is associated with tests and solutions. Usage 🤗 Available in HuggingFace

CodeGeeX achieves the highest average performance compared with other open-sourced multilingual baselines.

News

  • 🌟 2023-07-24: CodeGeeX2 has been released, more powerful, faster, and lightweight. Support 100+ languages and many new features.

  • 2023-5-16: CodeGeeX paper has been accepted by KDD 2023, Long Beach and will be represented during the conference.

  • 2023-03-30: CodeGeeX paper is now available at arxiv.

  • 2023-02-14: CodeGeeX now supports Cloud Studio, a fantastic web IDE from Tencent. Click on the badge on top of this page to quickly launch an environment to test CodeGeeX.

  • 2023-02-13: Thanks a lot to OneFlow team for adding oneflow backend for CodeGeeX's inference (Even faster than FasterTransformer under FP16!). Check more details here.

  • 2023-02: We are hosting CodeGeeX "Coding With AI" Hackathon, design cool applications based on CodeGeeX and win prizes (RTX 4090, DJI drone, etc)!

  • 2022-12-31: We release the FasterTransformer version of CodeGeeX in codegeex-fastertransformer. The INT8 accelerated version reaches an a verage speed of <15ms/token. Happy new year to everyone!

  • 2022-12-13: We release the source code of CodeGeeX VS Code extension in codegeex-vscode-extension. Follow QuickStart to start development.

  • 2022-12-11: CodeGeeX is now available for Jetbrains IDEs (IntelliJ IDEA, PyCharm, GoLand, CLion, etc), download it here.

  • 2022-12-04: We release source code of quantization (requires less GPU RAM: 27GB -> 15GB) and model parallelism (possible to run on multiple GPUs with <8G RAM).

  • 2022-09-30: We release the cross-platform source code and models weights for both Ascend and NVIDIA platforms.

Getting Started

CodeGeeX is initially implemented in Mindspore and trained Ascend 910 AI Processors. We provide a torch-compatible version based on Megatron-LM to facilitate usage on GPU platforms.

Installation

Python 3.7+ / CUDA 11+ / PyTorch 1.10+ / DeepSpeed 0.6+ are required. Install codegeex package via:

git clone [email protected]:THUDM/CodeGeeX.git
cd CodeGeeX
pip install -e .

Or use CodeGeeX docker to quickly set up the environment (with nvidia-docker installed):

docker pull codegeex/codegeex:latest
# To enable GPU support, clarify device ids with --device
docker run --gpus '"device=0,1"' -it --ipc=host --name=codegeex codegeex/codegeex

Model Weights

Apply and download model weights through this link. You'll receive by mail urls.txt that contains temporary download links. We recommend you to use aria2 to download it via the following command (Please make sure you have enough disk space to download the checkpoint (~26GB)):

aria2c -x 16 -s 16 -j 4 --continue=true -i urls.txt 

Run the following command to get the full model weights:

cat codegeex_13b.tar.gz.* > codegeex_13b.tar.gz
tar xvf codegeex_13b.tar.gz

Inference on GPUs

Have a try on generating the first program with CodeGeeX. First, specify the path of the model weights in configs/codegeex_13b.sh. Second, write the prompt (natural language description or code snippet) into a file, e.g., tests/test_prompt.txt, then run the following script:

# On a single GPU (with more than 27GB RAM)
bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt

# With quantization (with more than 15GB RAM)
bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt

# On multiple GPUs (with more than 6GB RAM, need to first convert ckpt to MP_SIZE partitions)
bash ./scripts/convert_ckpt_parallel.sh <LOAD_CKPT_PATH> <SAVE_CKPT_PATH> <MP_SIZE>
bash ./scripts/test_inference_parallel.sh <MP_SIZE> ./tests/test_prompt.txt

VS Code and Jetbrains Extension Guidance

Based on CodeGeeX, we also develop free extentions for VS Code and Jetbrains IDEs, and more in the future.

For VS Code, search "codegeex" in Marketplace or install it here. Detailed instructions can be found in VS Code Extension Guidance. For developers, we have also released the source code in codegeex-vscode-extension, please follow QuickStart to start development.

For Jetbrains IDEs, search "codegeex" in Plugins or install it here. Make sure your IDE version is 2021.1 or later. CodeGeeX now supports IntelliJ IDEA, PyCharm, GoLand, CLion, Android Studio, AppCode, Aqua, DataSpell, DataGrip, Rider, RubyMine, and WebStorm.

CodeGeeX: Architecture, Code Corpus, and Implementation

Architecture: CodeGeeX is a large-scale pre-trained programming language model based on transformers. It is a left-to-right autoregressive decoder, which takes code and natural language as input and predicts the probability of the next token. CodeGeeX contains 40 transformer layers with a hidden size of 5,120 for self-attention blocks and 20,480 for feed-forward layers, making its size reach 13 billion parameters. It supports a maximum sequence length of 2,048.

Left: the proportion of programming languages in CodeGeeX's training data. Right: the plot of training loss against the training steps of CodeGeeX.

Code Corpus: Our training data contains two parts. The first part is from open-sourced code datasets, The Pile and CodeParrot. The Pile contains a subset of code corpus that collects public repositories with more than 100 stars from GitHub, from which we select codes in 23 popular programming languages. The second part is supplementary data directly scrapped from the public GitHub repositories that do not appear in previous datasets, including Python, Java and C++. To obtain data of potentially higher quality, repositories with at least one star and its size smaller than 10MB are chosen. A file is filtered out if it 1) has more than 100 characters per line on average, 2) is automatically generated, 3) has a ratio of alphabet less than 40%, or 4) is bigger than 100KB or smaller than 1KB. To help the model distinguish different languages, we add a language-specific prefix at the beginning of each segment in the form of [Comment sign] language: [LANG], e.g., # language: Python. For tokenization, we use the same tokenizer as GPT-2 and process whitespaces as extra tokens, resulting in a vocabulary of 50,400 tokens. In total, the code corpus has 23 programming languages with 158.7B tokens.

Training: We implement CodeGeeX in Mindspore 1.7 and train it on 1,536 Ascend 910 AI Processor (32GB). The model weights are under FP16 format, except that we use FP32 for layer-norm and softmax for higher precision and stability. The entire model consumes about 27GB of memory. To increase the training efficiency, we adopt an 8-way model parallel training together with 192-way data parallel training, with ZeRO-2 optimizer enabled. The micro-batch size is 16 and the global batch size reaches 3,072. Moreover, we adopt techniques to further boost the training efficiency including the element-wise operator fusion, fast gelu activation, matrix multiplication dimension optimization, etc. The entire training process takes nearly two months, spanning from April 18 to June 22, 2022, during which 850B tokens were passed for training, i.e., 5+ epochs.

HumanEval-X: A new benchmark for Multilingual Program Synthesis

To better evaluate the multilingual ability of code generation models, we propose a new benchmark HumanEval-X. While previous works evaluate multilingual program synthesis under semantic similarity (e.g., CodeBLEU) which is often misleading, HumanEval-X evaluates the functional correctness of the generated programs. HumanEval-X consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks.

An illustration of tasks supported by HumanEval-X. Declarations, docstrings, and solutions are marked with red, green, and blue respectively. Code generation uses declaration and docstring as input, to generate solution. Code translation uses declaration in both languages and translate the solution in source language to the one in target language.

In HumanEval-X, every sample in each language contains declaration, docstring, and solution, which can be combined in various ways to support different downstream tasks including generation, translation, summarization, etc. We currently focus on two tasks: code generation and code translation. For code generation, the model uses declaration and docstring as input to generate the solution. For code translation, the model uses declarations in both languages and the solution in the source language as input, to generate solutions in the target language. We remove the description during code translation to prevent the model from directly solving the problem. For both tasks, we use the unbiased pass@k metric proposed in Codex: $\text{pass}@k:= \mathbb{E}[1-\frac{\tbinom{n-c}{k}}{\tbinom{n}{k}}]$, with $n=200$ and $k\in(1,10,100)$.

Multilingual Code Generation

Left: the detailed pass@k (k=1,10,100) performance on code generation task for five languages in HumanEval-X. Right: the average performance of all languages of each model. CodeGeeX achieves the highest average performance compared with InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B.

We compare CodeGeeX with two other open-sourced code generation models, InCoder (from Meta) and CodeGen (from Salesforce). Specifically, InCoder-6.7B, CodeGen-Multi-6B and CodeGen-Multi-16B are considered. CodeGeeX significantly outperforms models with smaller scales (by 7.5%~16.3%) and is competitive with CodeGen-Multi-16B with a larger scale (average performance 54.76% vs. 54.39%). CodeGeeX achieves the best average performance across languages.

Crosslingual Code Translation

Results on HumanEval-X code translation task. Best language-wise performance are bolded.

We also evaluate the performance of translation across different programming languages. We test the zero-shot performance of CodeGeeX, as well as the fine-tuned CodeGeeX-13B-FT (fine-tuned using the training set of code translation tasks in XLCoST; Go is absent in the original set, we thus add a small set to it). The results indicate that models have a preference for languages, e.g., CodeGeeX is good at translating other languages to Python and C++, while CodeGen-Multi-16B is better at translating to JavaScript and Go; these could probably be due to the difference in language distribution in the training corpus. Among 20 translation pairs, we also observe that the performance of A-to-B and B-to-A are always negatively correlated, which might indicate that the current models are still not capable of learning all languages well.

How to use HumanEval-X and contribute to it?

For more details on how to use HumanEval-X, please see usage. We highly welcome the community to contribute to HumanEval-X by adding more problems or extending it to other languages, please check out the standard format of HumanEval-X and add a pull request.

Please kindly let us know if you have any comment or suggestion, via [email protected].

Examples of Generation
Acknowledgement
This project is supported by the National Science Foundation for Distinguished Young Scholars (No. 61825602).

Lead Contributors

Qinkai Zheng (Tsinghua KEG), Xiao Xia (Tsinghua KEG), Xu Zou (Tsinghua KEG)

Contributors

Tsinghua KEG---The Knowledge Engineering Group at Tsinghua: Aohan Zeng, Wendi Zheng, Lilong Xue

Zhilin Yang's Group at Tsinghua IIIS: Yifeng Liu, Yanru Chen, Yichen Xu (BUPT, work was done when visiting Tsinghua)

Peng Cheng Laboratory: Qingyu Chen, Zhongqi Li, Gaojun Fan

Zhipu.AI: Yufei Xue, Shan Wang, Jiecai Shan, Haohan Jiang, Lu Liu, Xuan Xue, Peng Zhang

Ascend and Mindspore Team: Yifan Yao, Teng Su, Qihui Deng, Bin Zhou

Data Annotations

Ruijie Cheng (Tsinghua), Peinan Yu (Tsinghua), Jingyao Zhang (Zhipu.AI), Bowen Huang (Zhipu.AI), Shaoyu Wang (Zhipu.AI)

Advisors

Zhilin Yang (Tsinghua IIIS), Yuxiao Dong (Tsinghua KEG), Wenguang Chen (Tsinghua PACMAN), Jie Tang (Tsinghua KEG)

Computation Sponsors

Peng Cheng Laboratory

Zhipu.AI---an AI startup that aims to teach machines to think like humans

Project Leader

Jie Tang (Tsinghua KEG & BAAI)

License

Our code is licensed under the Apache-2.0 license. Our model is licensed under the license.

Citation

If you find our work useful, please cite:

@misc{zheng2023codegeex,
      title={CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X}, 
      author={Qinkai Zheng and Xiao Xia and Xu Zou and Yuxiao Dong and Shan Wang and Yufei Xue and Zihan Wang and Lei Shen and Andi Wang and Yang Li and Teng Su and Zhilin Yang and Jie Tang},
      year={2023},
      eprint={2303.17568},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

codegeex's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

codegeex's Issues

Unable to generate code in Visual Studio Code. SSL Error

When trying to generate code I receive this error:

There was an error sending the request
Error: write EPROTO 32950664:error:100000f7:SSL routines:OPENSSL_internal:WRONG_VERSION_NUMBER:../../third_party/boringssl/src/ssl/tls_record.cc:242:

Mainframe programming languages ?

Hi, fantastic work.

Would it be possible to integrate COBOL and other mainframe related programming languages like JCL for example ? This would be super useful and the codebase is huge !

For the purpose of original developments as well as for the translation part.

能不能换个按键?

tab键对于代码书写过于重要了,有时候我想把代码对齐,会不小心输入我不想要的代码,能不能把Tab键谎称一个不常用的键呢?比如让用户自定义按键。

PyTorch Int8 inference time cost

Hi,我用Int8 精度在A100机器(CUDA 11.0)上进行推理,能看到CodeGeeX只占15GB 显存,而且GPU 利用率也是满的,我每次都输入一条1024序列(不复用Cache),它的速度大概是单Token 280ms,这个是正常的么。我是不是有地方没有调对?因为 codegeex-fastertransformer 项目中的对比表,这种方式应该是75ms左右。

本地部署的推荐配置

我使用Nvdia RTX 5000(16GB显存)显卡来本地运行codegeex,程序运行起来了,但是速度很慢。
通过注释生成一个20行左右的函数,会花费1分钟~2分钟。
请问本地部署有推荐的配置吗,期望通过注释生成一般函数控制在10秒左右完成,谢谢。

严重影响使用感受的冲突:智能提示intellisense和CodeGeeX似乎无法共存

在绝大多数时候,我们依赖于智能提示intellisense如pylance来提供类的函数和属性来进行正确的代码补全,提示我们已有的变量的名字,提示类的成员,避免大小写错误或者记错,本身的智能提示intellisense和CodeGeeX似乎无法同时出现,并且TAB按键冲突,以至于前者更为重要而不得不放弃disable掉CodeGeeX

我不知道是不是我环境自身的问题.

我的智能提示intellisense和CodeGeeX只能出现一个,比如在CodeGeeX没有出现的时候提示,或者交替出现,以至于按TAB的时候无法确定是谁的补全。

非常感谢Chinese Team From Tsinghua and CodeGeeX for free. I would appeciate it if you could accept my advice.

Most of the time, we rely on intelligent prompts such as intellisense to provide class functions and attributes for correct code completion, prompt the name of our existing variables, prompt the members of the class, and avoid case errors or memory errors. It seems that intellisense and CodeGeeX cannot appear at the same time, and TAB keys conflict, so that the former is more important and has to give up disabling CodeGeeX

I don't know if it is a problem with my environment. My intelligent prompts Intellisense and CodeGeeX can't coexist, and only one can appear. For example, when CodeGeeX does not appear, the prompts appear alternately, so that when you press TAB, you can't determine whether it is Intellisense or CodeGeeX

关于inference样例的输出结果

您好,我尝试了一下 Inference on GPUs 的样例,即test/test_prompt.txt的输入样例。感觉输出的翻译代码比较正常,但是后续的中文解释还是很奇怪的?比如有的解释与代码无关,有的解释好像会重复。不知道这种现象合理吗。以及有什么办法让模型不输出后续的中文解释吗?

以下是以相同输入多次运行的一些结果:
结果1

================================= Generated code:

def has_close_elements(nums, threshold):
    for i in range(len(nums)-1):
        for j in range(i+1, len(nums)):
            if abs(nums[i]-nums[j]) < threshold:
                return True
    return False



### 源码分析

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数组中元素的个数,第二个参数是阈值。

Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数�
Python 中 `has_close_elements` 函数有两个参数,第一个参数是数组,第二个参数是阈值,第一个参数是数�
{1024: [107.01575442299963]}
1024 107.01575442299963
Generation finished.

结果2

================================= Generated code:

def has_close_elements(nums, threshold):
    for i in range(len(nums)-1):
        for j in range(i+1, len(nums)):
            if abs(nums[i]-nums[j]) < threshold:
                return True
    return False



### 源码分析

1. 判断数组中是否有元素为绝对值小于 threshold 的元素,如果有,则返回 True;否则返回 False。
2. 利用 Python 的 `for` 循环,遍历数组中的每一个元素,如果它的绝对值小于 threshold,则返回 True;否则返回 False。

### 复杂度分析

遍历数组的时间复杂度为 $$O(n)$$, 使用了 Python 的 `for` 循环的时间复杂度为 $$O(n)$$.

## 两数之和

[LeetCode中文](https://leetcode-cn.com/problems/two-sum/)

[LeetCode英文](https://leetcode.com/problems/two-sum/)

给定一个整数数组 nums 和一个目标值 target,请你在该数组中找出和为目标值的那 两个 整数,并返回他们的数组下标。

你可以假设每种输入只会对应一个答案。但是,你不能重复利用这个数组中同样的元素。

**示例 1**:


输入: nums = [2,7,11,15], target = 9
输出: [0,1]
解释: 2 与 7 之和等于目标值 9 。因此 index0 = 0, index1 = 1


**示例 2**:


输入: nums = [3,2,4], target = 6
输出: [1,2]


**示例 3**:


输入: nums = [3,3], target = 6
输出: [0,1]


### 解答

#### **动态规划**

动态规划的**是,维护一个数组 `dp`,其中 `dp[i]` 表示数组中索引为 `i` �
{1024: [109.34215570299966]}
1024 109.34215570299966
Generation finished.

结果3

================================= Generated code:

def has_close_elements(nums, threshold):
    for i in range(len(nums)-1):
        for j in range(i+1, len(nums)):
            if abs(nums[i]-nums[j]) < threshold:
                return True
    return False



### 源码分析

1. 利用了`for`循环,`for`循环中的`i`和`j`都是`i+1`和`j-1`,`i`和`j`的值都是`i`和`j`的下标。
2. 利用`abs(nums[i]-nums[j]) < threshold`判断`nums[i]`和`nums[j]`的差值是否小于`threshold`。

### 复杂度分析

时间复杂度 $$O(n^2)$$, 空间复杂度 $$O(1)$$.

## 参考资料

- [Python 判断两个数字是否相等的方法](http://www.cnblogs.com/grandyang/p/4378306.html)
- [Python 判断两个数字是否相等的方法](http://www.cnblogs.com/grandyang/p/4378306.html)
<|endoftext|>
{1024: [53.19243339499917]}
1024 53.19243339499917
Generation finished.

vs code插件不起作用

Version: 1.74.2 (user setup)
Electron: 19.1.8
Chromium: 102.0.5005.167
Node.js: 16.14.2
V8: 10.2.154.15-electron.0
OS: Windows_NT x64 10.0.19044
Sandboxed: Yes
安装v1.0.9版本插件后完全不工作,

模型微调和离线部署

你好,如果想基于codegeex模型采用自己数据进行微调训练,有没有相关的指导资料。
目前的vscode插件在离线环境无法使用,请问codegeex是否可以离线本地部署并使用vscode插件调用?有没有有关的指导。
谢谢

Need to Fix Unit Test Code for Some JavaScript Problems of HumanEval-X

Case 1: Missing unit test function call

Problems 32, 119, and 151 have no unit test function call in their "test" code.
For example, the "test" code of Problem 32 is as follows. (Note that there is no function call for the testfindZero function)

const testfindZero = () => {
  const getRandomIntInclusive = (min = 0, max = 9) => {
    min = Math.ceil(min)
    max = Math.floor(max)
    return Math.floor(Math.random() * (max - min + 1)) + min
  }

  for (let i = 0; i < 100; i++) {
    let ncoeff = 2 * getRandomIntInclusive(1, 4);
    let coeffs = [];
    for (let j = 0; j < ncoeff; j++) {
      let coeff = getRandomIntInclusive(-10, 10);
      if (coeff === 0)
        coeff = 1;
      coeffs.push(coeff);
    }
    let solution = findZero(coeffs);
    console.assert(Math.abs(poly(coeffs, solution)) < 1e-4);
  }
}

Problems are always regarded as passed problems if there is no unit test function call.
Please add the unit test function call at the end of "test" code for problems 32, 119, and 151.

Case 2: Typo in the unit test

The problem 112 has typo in its unit test code.
Currently, the "test" code of problem 112 is as follows. (Note the position of the third closing parenthesis)

const testReverseDelete = () => {
  console.assert(JSON.stringify(reverseDelete('abcde', 'ae'))) ===
    JSON.stringify(['bcd', false])
  console.assert(JSON.stringify(reverseDelete('abcdef', 'b'))) ===
    JSON.stringify(['acdef', false])
  console.assert(JSON.stringify(reverseDelete('abcdedcba', 'ab'))) ===
    JSON.stringify(['cdedc', true])
  console.assert(JSON.stringify(reverseDelete('dwik', 'w'))) ===
    JSON.stringify(['dik', false])
  console.assert(JSON.stringify(reverseDelete('a', 'a'))) ===
    JSON.stringify(['', true])
  console.assert(JSON.stringify(reverseDelete('abcdedcba', ''))) ===
    JSON.stringify(['abcdedcba', true])
  console.assert(JSON.stringify(reverseDelete('abcdedcba', 'v'))) ===
    JSON.stringify(['abcdedcba', true])
  console.assert(JSON.stringify(reverseDelete('vabba', 'v'))) ===
    JSON.stringify(['abba', true])
  console.assert(JSON.stringify(reverseDelete('mamma', 'mia'))) ===
    JSON.stringify(['', true])
}
testReverseDelete()

The problem 112 is always regarded as passed due to the wrong position of the closing parenthesis.
You should move the third closing parenthesis to the end of the sentence for each unit test code.

Case 3: There is no tuple in JavaScript

Problems 107, 136, 155 use tuple syntax of Python in their unit test.
However, there is no tuple syntax in JavaScript, which leads to unexpected behaviors (e.g., all unit tests fail)
For example, the "test" code of problem 107 is as follows.

const testEvenOddPalindrome = () => {
  console.assert(
    JSON.stringify(evenOddPalindrome(123)) === JSON.stringify((8, 13))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(12)) === JSON.stringify((4, 6))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(3)) === JSON.stringify((1, 2))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(63)) === JSON.stringify((6, 8))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(25)) === JSON.stringify((5, 6))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(19)) === JSON.stringify((4, 6))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(9)) === JSON.stringify((4, 5))
  )
  console.assert(
    JSON.stringify(evenOddPalindrome(1)) === JSON.stringify((0, 1))
  )
}

testEvenOddPalindrome()

You need to use list instead of tuple.

Error when tanslating long c++ code to python. For example, scanf was not converted.

Error occured when I'm using this: https://models.aminer.cn/codegeex/codeTranslator

  1. Scanf was not converted.
  2. Some constant values are also incorrect.
  3. Can't deal with Conditional operator '?'.

Q1 and Q3 did not occur when the input is short. But for Q2, the model generated redundant information.
Perhaps the performance is limited when dealing with long code?

When dealing with long code

Input:

#include <iostream>
#include <cstdio>

using namespace std;

const long long maxn=1097152;//n最大值 
const long long maxx=2097152;//xi最大值 

struct E//队列中元素 
{
	long long val;//值 
	long long num;//数目 
};

long long n,T;
long long x[maxn]={};//下标从1开始 
long long m[maxn]={};//下标从1开始 
long long p,q;

long long bucket[maxx]={};//记录值为bucket[j]的k的个数 
long long sum_bucket[maxx]={};//bucket的前缀和 

E Queap[maxn]={};//数组模拟队列 
long long front=0,back=0;//左开右闭 
long long size=0;//队列大小 
void push_back(long long val)//加入队列 
{
	Queap[back].val=val;
	Queap[back].num=1;
	while(back>front && Queap[back-1].val<=Queap[back].val)//如果之前结点val小于当前结点val,则合并 
	{
		Queap[back-1].val=Queap[back].val;//替换为更大值 
		Queap[back-1].num+=Queap[back].num;//数量相加 
		back--;
	}
	back++;
	size++;
}

long long get_front()//获取队首元素 
{
	return Queap[front].val;
}

void pop_front()//弹出队首元素 
{
	if(--Queap[front].num==0)
		front++;
	size--;
}

int main()
{
	scanf("%lld",&n);
	for(int i=1;i<=n;i++)
		scanf("%lld",&x[i]);
	for(int i=1;i<=n;i++)
		scanf("%lld",&m[i]);
	for(int i=1;i<=n;i++)
	{
		while(size>m[i])
			pop_front();
		if(size)
			bucket[get_front()]++;//直接将k装入桶中,其中k=get_front();
		else
			bucket[0]++;//直接将k装入桶中,其中k=0;
		push_back(x[i]);
	}
	for(int i=1;i<maxx;i++)//bucket前缀和:sum_bucket[j]=bucket[0]+...+bucket[j-1],表示满足0<=k<j的k的个数 
		sum_bucket[i]=sum_bucket[i-1]+bucket[i-1];
	scanf("%lld",&T);
	for(int i=1;i<=T;i++)
	{
		scanf("%lld%lld",&p,&q);
		printf("%lld %lld\n",(p>=maxx?sum_bucket[maxx-1]:sum_bucket[p]),(q>=maxx?sum_bucket[maxx-1]:sum_bucket[q])-(p>=maxx?sum_bucket[maxx-1]:sum_bucket[p]));//低风险:bucket[0]+...+bucket[p-1]=sum_bucket[p];中风险:bucket[p]+...+bucket[q-1]=sum_bucket[q]-sum_bucket[p] 
	}
	return 0;
}

Output:

maxn = 10 ** 9 + 7
maxx = 10 ** 9 + 7
from collections import defaultdict
def push_back(val):
    global Queap, front, back, size
    Queap[back].val = val
    Queap[back].num = 1
    while(back > front and Queap[back - 1].val <= Queap[back].val):
        Queap[back - 1].val = Queap[back].val
        Queap[back - 1].num += Queap[back].num
        back -= 1
    back += 1
    size += 1
def get_front():
    return Queap[front].val
def pop_front():
    if(Queap[front].num == 0):
        front += 1
    size -= 1
if __name__ == "__main__":
    n, T = 0, 0
    x, m, p, q = [], [], 0, 0
    Queap, front, back, size = [], 0, 0, 0
    bucket, sum_bucket = [], [0 for i in range(maxx)]
    push_back(0)
    for i in range(1, n + 1):
        x.append(0)
        m.append(0)
    for i in range(1, n + 1):
        while(size > m[i]):
            pop_front()
        if(size):
            bucket[get_front()] += 1
        else:
            bucket[0] += 1
        push_back(x[i])
    for i in range(1, maxx):
        sum_bucket[i] = sum_bucket[i - 1] + bucket[i - 1]
    for i in range(1, T + 1):
        p = 0
        q = 0
        while p < maxx and bucket[p]!= 0:
            p += 1
        while q < maxx and bucket[q]!= 0:
            q += 1
        print(sum_bucket[p] + sum_bucket[q] - (p < maxx and sum_bucket[p] or 0), end = " ")
        print(sum_bucket[p] + sum_bucket[q])

Manual Correction:

maxn = 1097152
maxx = 2097152

class E():
    val=0
    num=0

def push_back(val):
    global Queap, front, back, size
    Queap[back].val = val
    Queap[back].num = 1
    while(back > front and Queap[back - 1].val <= Queap[back].val):
        Queap[back - 1].val = Queap[back].val
        Queap[back - 1].num += Queap[back].num
        back -= 1
    back += 1
    size += 1
def get_front():
    global Queap, front, back, size
    return Queap[front].val
def pop_front():
    global Queap, front, back, size
    if(Queap[front].num == 0):
        front += 1
    size -= 1
if __name__ == "__main__":
    n, T = 0, 0
    n = int(input())
    x, m, p, q = [], [], 0, 0
    Queap, front, back, size = [E() for i in range(maxx)], 0, 0, 0
    bucket, sum_bucket = [0 for i in range(maxx)], [0 for i in range(maxx)]
    x = [0] + list(map(int,input().split(' ')))
    m = [0] + list(map(int,input().split(' ')))
    for i in range(1, n + 1):
        while(size > m[i]):
            pop_front()
        if(size):
            bucket[get_front()] += 1
        else:
            bucket[0] += 1
        push_back(x[i])
    for i in range(1, maxx):
        sum_bucket[i] = sum_bucket[i - 1] + bucket[i - 1]
    T = int(input())
    for i in range(1, T + 1):
        p, q = list(map(int,input().split(' ')))
        # printf("%lld %lld\n",(p>=maxx?sum_bucket[maxx-1]:sum_bucket[p]),(q>=maxx?sum_bucket[maxx-1]:sum_bucket[q])-(p>=maxx?sum_bucket[maxx-1]:sum_bucket[p]));
        print(sum_bucket[maxx-1] if p>=maxx else sum_bucket[p], end = " ")
        print((sum_bucket[maxx-1] if q>=maxx else sum_bucket[q])-(sum_bucket[maxx-1] if p>=maxx else sum_bucket[p]))

Testing data:

input:
7
1 2 3 4 1000 100 20
1 2 3 3 3 4 3
2
3 5
100 200

output:
3 2
5 0

When dealing with Q2 with shorter code:

Input:

const long long maxn=1097152;
const long long maxx=2097152;

Output:

import sys
def main():
    n=2
    x=2
    while(x<10**7):
        n+=1
        x=n*(n-1)*(n-2)*(n-3)*(n-4)*(n-5)*(n-6)*(n-7)*(n-8)*(n-9)
    print(n)
main()

9b5d5924912eef437e36ab8bc6bf53a
82e5f196607113f4c931cb375b01339
image

Error:timeout of 120000ms exceeded

莫名原因出现该报错,就普通正常操作后,空闲一段时间,就会报这个

Error:timeout of 120000ms exceeded

来源:CodeGeeX(扩展)

Hardware requirements?

You mention "NVIDIA V100 or A100", but can a newer consumer RTX 3080 card work? How much VRAM for the models?

Missing menu entries at Jetbrains IDE?

Should there CodeGeeX entries in serveral menus/windows at the JetBrains IDE's?
E.g. PHPStorm > Main Menu > Tools.

I can't find any way to switch between the 3 advertised modes.

Add Extension to OpenVSX

I'm a fan of Gitpod, for having this extension in the gitpod vscode editor, it should be available in OpenVSX
Can you guys please add this extension to OpenVSX?

Any option to limit CodeGeeX generation area?

For example I have a method and want CodeGeeX to fill it with code, but it generating a lot of other methods I don't need. Is there any option to limit CodeGeeX generation area to (for example) current method?
image

How to inference on multi-gpu?

I tried to make the inference on A30, while an error occurred: RuntimeError: CUDA out of memory. How to inference on multi cards?

请限制ctrl+enter的触发时机

Hi, 现在这个vscode插件的命令是ctrl+enter, 这和在git面板的commit快捷键冲突了, 快捷键配置应配置when来限制触发提示的时机.
image
image

Cannot load the tokenizer

Running the following command (from README) results in an error:

bash ./scripts/test_inference.sh <GPU_ID> ./tests/test_prompt.txt

The error:

Loading tokenizer ...
Traceback (most recent call last):
  File "./tests/test_inference.py", line 203, in <module>
    main()
  File "./tests/test_inference.py", line 124, in main
    tokenizer = CodeGeeXTokenizer(
  File "./codegeex/tokenizer/tokenizer.py", line 43, in __init__
    self.tokenizer = tokenizer if tokenizer is not None else AutoTokenizer.from_pretrained(tokenizer_path)
  File ".../lib/python3.9/site-packages/transformers/models/auto/tokenization_auto.py", line 619, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File ".../lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    return cls._from_pretrained(
  File ".../lib/python3.9/site-packages/transformers/tokenization_utils_base.py", line 1932, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File ".../lib/python3.9/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 138, in __init__
    super().__init__(
  File ".../lib/python3.9/site-packages/transformers/tokenization_utils_fast.py", line 114, in __init__
    fast_tokenizer = convert_slow_tokenizer(slow_tokenizer)
  File ".../lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 1162, in convert_slow_tokenizer
    return converter_class(transformer_tokenizer).converted()
  File ".../lib/python3.9/site-packages/transformers/convert_slow_tokenizer.py", line 273, in converted
    BPE(
Exception: Error while initializing BPE: Token `Ġhealthier` out of vocabulary

Running the following command directly results in the same error:

from tokenizers import AutoTokenizer
AutoTokenizer.from_pretrained('./codegeex/tokenizer')

It seems like the merges.txt is not consistent with the vocabulary.

Versions:

Ubuntu 18.04.6 LTS
Python 3.9.13
torch 1.13.0
transformers 4.24.0
tokenizers 0.13.2

The generated comments is strange

I wrote the following code:

from typing import List

def is_even(numbers: List[int]) -> List[bool]:
    '''
    Take True for even numbers and False for odd numbers, return them as a list.
    Write three call examples in the comment.

And got:

from typing import List

def is_even(numbers: List[int]) -> List[bool]:
    '''
    Take True for even numbers and False for odd numbers, return them as a list.
    Write three call examples in the comment.

    >>> is_even([2])
    [True]
    >>> is_even([1, 3, 5, 7, 9])
    [True, False, False, True]
    >>> is_even([])
    [False]
    '''

    even_numbers = []

    for number in numbers:
        if number % 2 == 0:
            even_numbers.append(True)
        else:
            even_numbers.append(False)

    return even_numbers

Could not read from remote repository.

Cloning into 'CodeGeeX'...
kex_exchange_identification: Connection closed by remote host
Connection closed by 20.205.243.166 port 22
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

single 3090 seems OOM

3900X(24 core)+32GB RAM+3090
trying to inference using this script failed
after logging "Building CodeGeeX model" the python process got killed, seems like typical out of memory issue

# With quantization (with more than 15GB RAM)
bash ./scripts/test_inference_quantized.sh <GPU_ID> ./tests/test_prompt.txt

so I switch to codegeex-fastertransformer, it still OOM

Traceback (most recent call last):
  File "api.py", line 105, in <module>
    if not codegeex.load(ckpt_path=args.ckpt_path):
  File "/workspace/codegeex-fastertransformer/examples/pytorch/codegeex/utils/codegeex.py", line 413, in load
    self.cuda()
  File "/workspace/codegeex-fastertransformer/examples/pytorch/codegeex/utils/codegeex.py", line 430, in cuda
    self.weights._map(lambda w: w.contiguous().cuda(self.device))
  File "/workspace/codegeex-fastertransformer/examples/pytorch/codegeex/utils/codegeex.py", line 177, in _map
    w[i] = func(w[i])
  File "/workspace/codegeex-fastertransformer/examples/pytorch/codegeex/utils/codegeex.py", line 430, in <lambda>
    self.weights._map(lambda w: w.contiguous().cuda(self.device))
RuntimeError: CUDA out of memory. Tried to allocate 200.00 MiB (GPU 0; 24.00 GiB total capacity; 23.11 GiB already allocated; 0 bytes free; 23.11 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentat
ion.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Publish CodeGeeX inference server code?

Hi,

Thanks for open sourcing the VSCode extension and the model files. Is there any way you can release the inferencing server code as well? I'd like to host the model myself but VSCode extension seems to call a number of API (POST) endpoints and I don't know how they're implemented (so I can do something similar).

Thanks

Dependency issue?

I tried to follow the installation guide but kept running into problems.

The first issue was a dependency requires Python 3.8, despite that the README specifies "Python 3.7+". So I installed Python 3.8.

Then it complains about lacking a few packages including "apex.multi_tensor_apply". I looked it up, it seems part of apex. I followed its installation guide and installed it successfully (I guess).

When I ran bash ./scripts/test_inference.sh 0 ./tests/test_prompt.txt, it complains amp_C was built for Python 3.7 instead of Python 3.8 which I was using.

I then noticed the Dockerfile. I built it but apex still seems missing but it failed to build because somehow cuda is missing.

So do you have a working container environment? And can you share it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.