baai-agents / cradle Goto Github PK

The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.

Home Page: https://baai-agents.github.io/Cradle/

License: MIT License

Python 100.00%

ai-agent ai-agents-framework computer-control cradle gcc generative-ai grounding large-language-models llm lmm

cradle's Introduction

Cradle: Empowering Foundation Agents Towards General Computer Control

[Website] [arXiv] [PDF]

The Cradle framework empowers nascent foundation models to perform complex computer tasks via the same unified interface humans use, i.e., screenshots as input and keyboard & mouse operations as output.

📢 Updates

2024-06-27: A major update! Cradle is extened to four games: RDR2, Stardew Valley, Cities: Skylines, and Dealer's Life 2 and various software, including but not limited to Chrome, Outlook, Capcut, Meitu and Feishu. We also release our latest paper. Check it out!

Latest Videos

Click on either of the video thumbnails above to watch them on YouTube.

💾 Installation

Prepare the Environment File

We currently provide access to OpenAI's and Claude's API. Please create a .env file in the root of the repository to store the keys (one of them is enough).

Sample .env file containing private information:

OA_OPENAI_KEY = "abc123abc123abc123abc123abc123ab"
RF_CLAUDE_AK = "abc123abc123abc123abc123abc123ab" # Access Key for Claude
RF_CLAUDE_SK = "123abc123abc123abc123abc123abc12" # Secret Access Key for Claude
AZ_OPENAI_KEY = "123abc123abc123abc123abc123abc12"
AZ_BASE_URL = "https://abc123.openai.azure.com/"
RF_CLAUDE_AK = "abc123abc123abc123abc123abc123ab"
RF_CLAUDE_SK = "123abc123abc123abc123abc123abc12"
IDE_NAME = "Code"

OA_OPENAI_KEY is the OpenAI API key. You can get it from the OpenAI.

AZ_OPENAI_KEY is the Azure OpenAI API key. You can get it from the Azure Portal.

OA_CLAUDE_KEY is the Anthropic Claude API key. You can get it from the Anthropic.

RF_CLAUDE_AK and RF_CLAUDE_SK are AWS Restful API key and secret key for Claude API.

IDE_NAME refers to the IDE environment in which the repository's code runs, such as PyCharm or Code (VSCode). It is primarily used to enable automatic switching between the IDE and the target environment.

Setup

Python Environment

Please setup your python environment and install the required dependencies as:

# Clone the repository
git clone https://github.com/BAAI-Agents/Cradle.git
cd Cradle

# Create a new conda environment
conda create --name cradle-dev python=3.10
conda activate cradle-dev
pip install -r requirements.txt

Install the OCR Tools

1. Option 1
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_lg

or

# pip install .tar.gz archive or .whl from path or URL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1.tar.gz

2. Option 2
# Copy this url https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1.tar.gz
# Paste it in the browser and download the file to res/spacy/data
cd res/spacy/data
pip install en_core_web_lg-3.7.1.tar.gz

🚀 Get Started

Due to the vast differences between each game and software, we have provided the specific settings for each of them below.

🌲 File Structure

Since some users may want to apply our framework to new games, this section primarily showcases the core directories and organizational structure of Cradle. We will highlight in "⭐⭐⭐" the modules related to migrating to new games, and provide detailed explanations later.

Cradle
├── cache # Cache the GroundingDino model and the bert-base-uncased model
├── conf # ⭐⭐⭐ The configuration files for the environment and the llm model
│   ├── env_config_dealers.json
│   ├── env_config_rdr2_main_storyline.json
│   ├── env_config_rdr2_open_ended_mission.json
│   ├── env_config_skylines.json
│   ├── env_config_stardew_cultivation.json
│   ├── env_config_stardew_farm_clearup.json
│   ├── env_config_stardew_shopping.json
│   ├── openai_config.json
│   ├── claude_config.json
│   ├── restful_claude_config.json
│   └── ...
├── deps # The dependencies for the Cradle framework, ignore this folder
├── docs # The documentation for the Cradle framework, ignore this folder
├── res # The resources for the Cradle framework
│   ├── models # Ignore this folder
│   ├── tool # Subfinder for RDR2
│   ├── [game or software] # ⭐⭐⭐ The resources for game, exmpale: rdr2, dealers, skylines, stardew, outlook, chrome, capcut, meitu, feishu
│   │   ├── prompts # The prompts for the game
│   │   │   └── templates
│   │   │       ├── action_planning.prompt
│   │   │       ├── information_gathering.prompt
│   │   │       ├── self_reflection.prompt
│   │   │       └── task_inference.prompt
│   │   ├── skills # The skills json for the game, it will be generated automatically
│   │   ├── icons # The icons difficult for GPT-4 to recognize in the game can be replaced with text for better recognition using an icon replacer
│   │   └── saves # Save files in the game
│   └── ...
├── requirements.txt # The requirements for the Cradle framework
├── runner.py # The main entry for the Cradle framework
├── cradle # Cradle's core modules
│   ├── config # The configuration for the Cradle framework
│   ├── environment # The environment for the Cradle framework
│   │   ├── [game or software] # ⭐⭐⭐ The environment for the game, exmpale: rdr2, dealers, skylines, stardew, outlook, chrome, capcut, meitu, feishu
│   │   │   ├── __init__.py # The initialization file for the environment
│   │   │   ├── atomic_skills # Atomic skills in the game. Users should customise them to suit the needs of the game or software, e.g. character movement
│   │   │   ├── composite_skills # Combination skills for atomic skills in games or software
│   │   │   ├── skill_registry.py # The skill registry for the game. Will register all atomic skills and composite skills into the registry.
│   │   │   └── ui_control.py # The UI control for the game. Define functions to pause the game and switch to the game window
│   │   └── ...
│   ├── gameio # Interfaces that directly wrap the skill registry and ui control in the environment
│   ├── log # The log for the Cradle framework
│   ├── memory # The memory for the Cradle framework
│   ├── module # Currently there is only the skill execution module. Later will migrate action planning, self-reflection and other modules from planner and provider
│   ├── planner # The planner for the Cradle framework. Unified interface for action planning, self-reflection and other modules. This module will be deleted later and will be moved to the module module.
│   ├── runner # ⭐⭐⭐ The logical flow of execution for each game and software. All game and software processes will then be unified into a single runner
│   ├── utils # Defines some helper functions such as save json and load json
│   └── provider # The provider for the Cradle framework. We have semantically decomposed most of the execution flow in the runner into providers
│       ├── augment # Methods for image augmentation
│       ├── llm # Call for the LLM model, e.g. OpenAI's GPT-4o, Claude, etc.
│       ├── module # ⭐⭐⭐ The module for the Cradle framework. e.g., action planning, self-reflection and other modules. It will be migrated to the cradle/module later.
│       ├── object_detect # Methods for object detection
│       ├── process # ⭐⭐⭐ Methods for pre-processing and post-processing for action planning, self-reflection and other modules
│       ├── video # Methods for video processing
│       ├── others # Methods for other operations, e.g., save and load coordinates for skylines
│       ├── circle_detector.py # The circle detector for the rdr2
│       ├── icon_replacer.py # Methods for replacing icons with text
│       ├── sam_provider.py # Segment anything for software
│       └── ...
└── ...

Citation

If you find our work useful, please consider citing us!

@article{tan2024cradle,
  title={Cradle: Empowering Foundation Agents towards General Computer Control},
  author={Weihao Tan and Wentao Zhang and Xinrun Xu and Haochong Xia and Ziluo Ding and Boyu Li and Bohan Zhou and Junpeng Yue and Jiechuan Jiang and Yewen Li and Ruyi An and Molei Qin and Chuqiao Zong and Longtao Zheng and Yujie Wu and Xiaoqiang Chai and Yifei Bi and Tianbao Xie and Pengjie Gu and Xiyun Li and Ceyao Zhang and Long Tian and Chaojie Wang and Xinrun Wang and Börje F. Karlsson and Bo An and Shuicheng Yan and Zongqing Lu},
  journal={arXiv preprint arXiv:2403.03186},
  year={2024}
}

cradle's People

Contributors

Stargazers

Watchers

Forkers

simpleqk zhanglei2015yy redtachyon ryensx contropist evdcush eltociear chinabiubiubiu baochaozhu leonkding pringwong qxzsilver1 yyzhao1123 ajiling foskolo gzeonai wplayergy sbrightdark iminetd37 scomants-0 cartents-t ladywib afro-lingo pretech76 metainfinitygamer mzywl goofyrd22interiorit will7455 rightpop-j nifty0x svorwerk-flextg xinrunxu zhangshao249 qinmoelei james4ever0 vital121 yizhangliu dineshkumares ego xinxiangbobby kyoxiaomao xiahaochong98 mivanovitch sujitrulz xilai0715 ludc zineos brandonman123 zhanwenchen utopic-dev zenkibomb tellarin ldsxp 1875225567 westworld-ai dvampire snowphy dkzdev neophack orange2013 ard-skelling maxiaoxifeng jiq1997 18759268002 zhuantouer deyh2020 lennardrin avaudioplayer flerkens blue-small-white yyworld7 tengjunhe liuwenxin0410 jackieqiang zhurou603 cloudenginehub hbcbh1999 schoenobates liamdgray ngandlau changqian9 fullstackuu y1y2u3u4 yuantian-sjtu fankli joeggg0401 pwhjy melnikovics adambear bzby preresearch-labs ppxing456 t8840 gary-666 greatheart1000

cradle's Issues

ERROR: No matching distribution found for win32process

After creating and activating a new virtual environment conda-dev and attempting to install dependencies using the command pip3 install -r requirements.txt, the following error occurs:
ERROR: Could not find a version that satisfies the requirement win32process (from versions: none)
ERROR: No matching distribution found for win32process

[功能请求] 增强跨平台支持与大型模型接入方式多样化

尊敬的Cradle开发团队，

希望我的信息能在您忙碌的日子里带来一些启发。我对cradle非常感兴趣，但我想进行测试时，我观察到以下两个主要改进点：

跨平台支持：
目前，Cradle似乎主要支持macos系统，因为requirements的一些包在windows下无法正确安装，您也注明了所需平台。尽管macos平台覆盖了大部分用户，但若能进一步拓展至windows，将极大地提升其通用性和吸引更广泛的用户群体。这可能涉及代码优化以兼容新系统，或是提供详细的文档指导如何在非标准平台上设置Cradle。
大语言模型接入方式多样化：
随着AI技术的飞速发展，特别是大型语言模型的兴起，Cradle中灵活的模型接入方法显得尤为重要。用户可能偏好不同的模型接入方式，比如API调用、本地部署或云服务解决方案。我了解到目前的接入方式有openAI api ，Azure以及Claude api,但对于**用户，这些接人方式并不友好，如果使用langchain等架构，增加Qwen或者智谱AI等大模型的接入配置应当相对容易，而这将足够满足**用户的使用。

如果这两个问题已有解决方案，请不吝赐教，非常感谢！

最后，希望我的建议能为Cradle带来更好的改变，期待此项目的发展。也感谢Cradle开发团队对AI发展做出的贡献。

How to calculate the game time?

Thanks for your great work!
I wonder how to get the game time in this paper.
Looking forward to your relpy.

An error occurred while running the Dealer's Life 2 example

Run command
python runner.py --envConfig "./conf/env_config_dealers.json"
ERROR LOG
Traceback (most recent call last): File "E:\code\github\Cradle\runner.py", line 49, in <module> main(args) File "E:\code\github\Cradle\runner.py", line 33, in main entry(args) File "E:\code\github\Cradle\cradle\runner\dealers_runner.py", line 199, in entry pipelineRunner = PipelineRunner(llm_provider, File "E:\code\github\Cradle\cradle\runner\dealers_runner.py", line 49, in __init__ self.set_internal_params() File "E:\code\github\Cradle\cradle\runner\dealers_runner.py", line 55, in set_internal_params self.skill_registry = DealersSkillRegistry( File "E:\code\github\Cradle\cradle\utils\singleton.py", line 15, in __call__ cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs) File "E:\code\github\Cradle\cradle\environment\dealers\skill_registry.py", line 57, in __init__ super(DealersSkillRegistry, self).__init__(skill_from_default=skill_from_default, File "E:\code\github\Cradle\cradle\environment\skill_registry.py", line 100, in __init__ self.skills = self.load_skills_from_scripts() File "E:\code\github\Cradle\cradle\environment\skill_registry.py", line 180, in load_skills_from_scripts self.store_skills_to_file(os.path.join(config.skill_local_path, self.skill_library_filename), skills) File "E:\code\github\Cradle\cradle\environment\skill_registry.py", line 515, in store_skills_to_file save_json(file_path, serialized_skills, indent=4) File "E:\code\github\Cradle\cradle\utils\json_utils.py", line 100, in save_json with open(file_path, mode='w', encoding='utf8') as fp: FileNotFoundError: [Errno 2] No such file or directory: './res/dealers/skills/skill_lib_basic.json'

Missing file for VideoSubFinder Files

Thank you for your work. I encountered an issue while trying to run the code according to the readme documentation. My VideoSubFinder was missing a test.srt file after downloading and overwriting general.clg, which caused the code to not run.

Inconsistency Between Title and Content

I don't understand where your universal computer control is universal. Moreover, the model still calls GPT-4, which is not local. Universal computer control should use a local large language model. Then, the computer operation corresponding to the shortcut keys directly interacts with the model. It can be multimodal, or an action selection agent that has mixed multimodal functions. I can only say that it is too simplistic at present. It seriously does not match the title, and the direction is a bit off.

Help- Issue about the failure of control

Hi~
I don't know why my character can't be operated by this agent and i can control it via my manual keyborad/mouse operation, but the logger output shows that it has seemingly right self-reflection, information gathering, operation output and so on.
Here is my CLI, Is my failure about error in ms_deformable_im2col_cuda?
Thanks very much~

Frank

(cradle) C:\Users\vipuser\agent\Cradle>python prototype_runner.py
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3527.)
return VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
2024-05-07 20:30:57,041 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-07 20:31:16,574 - UAC Logger - INFO - Screen capture started
2024-05-07 20:31:20,878 - UAC Logger - INFO - Gather Information Start Frame ID: -1, End Frame ID: 11
2024-05-07 20:31:21,074 - UAC Logger - INFO - >> Calling INFORMATION GATHERING
2024-05-07 20:31:21,075 - UAC Logger - INFO - Using frame extractor to gather information
2024-05-07 20:31:21,076 - UAC Logger - INFO - Extracting Informative Frames from C:\Users\vipuser\agent\Cradle\runs\1715084999.6247497\video_splits\video-00001.mp4 .....
2024-05-07 20:31:24,843 - UAC Logger - INFO - Frame Extraction Completed! Total Frames: 0
2024-05-07 20:31:24,845 - UAC Logger - INFO - Using icon replacer to gather information
2024-05-07 20:31:24,845 - UAC Logger - INFO - Start gathering text information from the whole video in parallel
2024-05-07 20:31:24,850 - UAC Logger - INFO - Finish gathering text information from the whole video
2024-05-07 20:31:24,850 - UAC Logger - INFO - Using llm description to gather information
2024-05-07 20:31:24,869 - UAC Logger - INFO - Requesting gpt-4-vision-preview completion...
2024-05-07 20:31:41,320 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-07 20:31:41,349 - UAC Logger - INFO - Response received from gpt-4-vision-preview.
2024-05-07 20:31:41,362 - UAC Logger - INFO - Using object detector to gather information
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\transformers\modeling_utils.py:1051: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
2024-05-07 20:31:51,635 - UAC Logger - INFO - Image Description: The image shows a third-person view in the game Red Dead Redemption 2. The player character is on horseback, holding a lantern in a snowy environment at night. The camera is positioned behind the player character, looking towards another character ahead, probably Dutch van der Linde, indicated by the dialogue caption "Dutch We have to try. Stay close and we'll do our best to stick to the trail." The lower left corner has a mini-map that shows the immediate surroundings of the character; it indicates buildings and the landscape, along with a yellow waypoint line the player character should follow. There is an on-screen prompt that says "Use W to follow Dutch," indicating the control needed to perform the action.
2024-05-07 20:31:51,641 - UAC Logger - INFO - Object Name: null
2024-05-07 20:31:51,642 - UAC Logger - INFO - Reasoning: 1. There is no need to detect an object given the current context; the task is to follow another character, which is not an object detection task.
2. There is no explicit weapon, shoot target, or item specified in the current interface, hence no relevant object needs to be detected according to the provided rules.
2024-05-07 20:31:51,643 - UAC Logger - INFO - Screen Classification: General game interface without any menu
2024-05-07 20:31:51,644 - UAC Logger - INFO - Dialogue: []
2024-05-07 20:31:51,645 - UAC Logger - INFO - Gathered Information: {}
2024-05-07 20:31:51,646 - UAC Logger - INFO - Classification Reasons: []
2024-05-07 20:31:51,647 - UAC Logger - INFO - All Task Guidance: []
2024-05-07 20:31:51,648 - UAC Logger - INFO - Last Task Guidance:
2024-05-07 20:31:51,648 - UAC Logger - INFO - Long Horizon: True
2024-05-07 20:31:51,650 - UAC Logger - INFO - Generated Actions: []
2024-05-07 20:31:51,650 - UAC Logger - INFO - Current Task Guidance:
2024-05-07 20:31:52,734 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-07 20:31:52,752 - UAC Logger - INFO - skill_library: ['fight', 'shoot_wolves', 'aim', 'follow', 'mount_horse', 'shoot', 'turn', 'select_weapon', 'turn_and_move_forward', 'select_sidearm', 'turn', 'move_forward', 'turn_and_move_forward']
2024-05-07 20:31:52,791 - UAC Logger - INFO - minimap_information: {'red points': [], 'yellow points': [], 'yellow region': []}
2024-05-07 20:31:52,793 - UAC Logger - INFO - minimap_info_str:
2024-05-07 20:31:52,800 - UAC Logger - INFO - Requesting gpt-4-vision-preview completion...
2024-05-07 20:32:05,753 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-07 20:32:05,768 - UAC Logger - INFO - Response received from gpt-4-vision-preview.
2024-05-07 20:32:05,771 - UAC Logger - INFO - R: ['follow()']
2024-05-07 20:32:05,772 - UAC Logger - INFO - Skill Steps: ['follow()']
2024-05-07 20:32:08,026 - UAC Logger - INFO - Executing skill: follow with params: {}
2024-05-07 20:32:11,980 - UAC Logger - INFO - KeyboardInterrupt Ctrl+C detected, exiting.
2024-05-07 20:32:12,251 - UAC Logger - INFO - Screen capture finished
2024-05-07 20:32:12,260 - UAC Logger - INFO - Screen capture thread is not executing

(cradle) C:\Users\vipuser\agent\Cradle>python prototype_runner.py
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ..\aten\src\ATen\native\TensorShape.cpp:3527.)
return VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
final text_encoder_type: bert-base-uncased
2024-05-07 20:36:59,913 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-07 20:37:14,483 - UAC Logger - INFO - Screen capture started
2024-05-07 20:37:18,884 - UAC Logger - INFO - Gather Information Start Frame ID: -1, End Frame ID: 15
2024-05-07 20:37:19,160 - UAC Logger - INFO - >> Calling INFORMATION GATHERING
2024-05-07 20:37:19,164 - UAC Logger - INFO - Using frame extractor to gather information
2024-05-07 20:37:19,166 - UAC Logger - INFO - Extracting Informative Frames from C:\Users\vipuser\agent\Cradle\runs\1715085406.0478091\video_splits\video-00001.mp4 .....
2024-05-07 20:37:21,891 - UAC Logger - INFO - Frame Extraction Completed! Total Frames: 1
2024-05-07 20:37:21,895 - UAC Logger - INFO - Using icon replacer to gather information
2024-05-07 20:37:23,268 - UAC Logger - INFO - Start gathering text information from the whole video in parallel
2024-05-07 20:37:25,288 - UAC Logger - INFO - Start gathering text information from the 1th frame
2024-05-07 20:37:25,310 - UAC Logger - INFO - Requesting gpt-4-vision-preview completion...
2024-05-07 20:37:33,203 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-07 20:37:33,214 - UAC Logger - INFO - Response received from gpt-4-vision-preview.
2024-05-07 20:37:33,215 - UAC Logger - INFO - Finish gathering text information from the 1th frame
2024-05-07 20:37:33,218 - UAC Logger - INFO - Finish gathering text information from the whole video
2024-05-07 20:37:33,218 - UAC Logger - INFO - Using llm description to gather information
2024-05-07 20:37:33,231 - UAC Logger - INFO - Requesting gpt-4-vision-preview completion...
2024-05-07 20:37:47,632 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-05-07 20:37:47,645 - UAC Logger - INFO - Response received from gpt-4-vision-preview.
2024-05-07 20:37:47,646 - UAC Logger - INFO - Using object detector to gather information
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\transformers\modeling_utils.py:1051: FutureWarning: The device argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\utils\checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
C:\Users\vipuser.conda\envs\cradle\lib\site-packages\torch\utils\checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
2024-05-07 20:37:51,837 - UAC Logger - INFO - Image Description: The image shows a snowy nighttime scene with the player character riding a horse. The character is holding a lantern, illuminating the area directly in front of them. The background features wooden buildings covered with snow, and heavy snowfall is visible in the air. The horse appears to be spotted with a white and dark coat. On the left side of the screen, there's a minimap with icons: the player's current position is indicated by an arrow, and there seem to be a few structures around, as well as waypoints or objectives marked on the map. No enemies or NPCs are visible in this image.
2024-05-07 20:37:51,839 - UAC Logger - INFO - Object Name: null
2024-05-07 20:37:51,840 - UAC Logger - INFO - Reasoning: 1. The screenshot does not show the weapon interface, hence no weapon is specified.
2. There is no explicit shoot target indicated.
3. No explicit item is specified for interaction in the image.
4. The screenshot is not on the trade or map interfaces.
5. There is no indication of a task that requires detection of an object.
2024-05-07 20:37:51,840 - UAC Logger - INFO - Screen Classification: General game interface without any menu
2024-05-07 20:37:51,842 - UAC Logger - INFO - Dialogue: [{'index': 0, 'object_id': '-00001_0_00_00_500', 'values': 'Dialogue is null'}]
2024-05-07 20:37:51,842 - UAC Logger - INFO - Gathered Information: {0: {'-00001_0_00_00_500': [{'information': '1. null', 'reasoning': '1. The screenshot does not display any text prompts.', 'item_status': 'Item_status is null', 'environment_information': 'Environment information is null', 'notification': 'Notification is null', 'task_guidance': 'Task is null', 'action_guidance': [], 'dialogue': 'Dialogue is null', 'other': 'Other information is null'}]}}
2024-05-07 20:37:51,842 - UAC Logger - INFO - Classification Reasons: [{'index': 0, 'object_id': '-00001_0_00_00_500', 'values': '1. The screenshot does not display any text prompts.'}]
2024-05-07 20:37:51,843 - UAC Logger - INFO - All Task Guidance: [{'index': 0, 'object_id': '-00001_0_00_00_500', 'values': 'Task is null'}]
2024-05-07 20:37:51,843 - UAC Logger - INFO - Last Task Guidance:
2024-05-07 20:37:51,844 - UAC Logger - INFO - Long Horizon: False
2024-05-07 20:37:51,845 - UAC Logger - INFO - Generated Actions: []
2024-05-07 20:37:51,845 - UAC Logger - INFO - Current Task Guidance:
2024-05-07 20:37:52,149 - httpx - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2024-05-07 20:37:52,169 - UAC Logger - INFO - skill_library: ['fight', 'shoot_wolves', 'aim', 'follow', 'mount_horse', 'shoot', 'turn', 'select_weapon', 'turn_and_move_forward', 'select_sidearm', 'turn', 'move_forward', 'turn_and_move_forward']
2024-05-07 20:37:52,216 - UAC Logger - INFO - minimap_information: {'red points': [], 'yellow points': [], 'yellow region': []}
2024-05-07 20:37:52,219 - UAC Logger - INFO - minimap_info_str:
2024-05-07 20:37:52,225 - UAC Logger - INFO - Requesting gpt-4-vision-preview completion...
2024-05-07 20:38:05,748 - UAC Logger - INFO - KeyboardInterrupt Ctrl+C detected, exiting.
2024-05-07 20:38:06,079 - UAC Logger - INFO - Screen capture finished
2024-05-07 20:38:06,081 - UAC Logger - INFO - Screen capture thread is not executing

A question about CRADLE performance on OSworld Benchmark !!!!!

Hey guys, I really appreciate your work!!!! But I'm still confused that why don't you get enough good results(Like SOTA) on OSworld benchmark. If you could check in the agent backbone provided in osworld, you will find it is too simple and too naiive, which only comprises a single phase of inference, and with no reflection module, memory module mentioned in your work.

I know that your input space when dealing osworld tasks is SoM(Mentioned in your appendix), but your results(7.81%) is not SOTA(11.77%) on overall performance of OSworld.

I understand that your main testbed are games' scenarios, however, due to your sophisticated and advance backbone(6 modules) of agents and meticulously designing of prompts and whole working flow, it is understandable for us to expect a more satisfying results on OSworld, since you guys are aimed to design a foundation agents, not solely for games. What do you think is the bottleneck reason accounting for this, and why your convoluted and convincing agent design still suffer a somewhat bad result on osworld benchmark?

I hope that what I have proposed make sense and I really appreciate if you could reply with an appropriate explanation. It is my honor to witness such a spectacular and magnificent project released! I'm looking forward to your early reply.

Integration with existing computer agent systems and further development

This project aims to be the first one to tackle the long-standing problem of general computer control. It is quite fascinating yet challenging to accomplish this great achievement. It would be only a matter of time to let this self-evolving agent to rewrite its code and become truely conscious, and an artificial life living inside computer hardware.

However many other projects are doing similar things around the field, like Tencent's AppAgent, OthersideAI's self-operating-computer, THUDM's CogAgent, Microsoft's UFO, Cognition's Devin, CheatLayer, and Cybergod created by myself.

For embodied intelligence, there are Mobile ALOHA, Open X-Embodiement dataset, Robotics Transformer (RT1), RT2, RT-X from Google.

For those "narrow" gaming agents, there are Ghost In the Minecraft, MineDojo's Voyager, PokemonRedRL, Tencent Solo trained within Honor of Kings Arena, and many more to be mentioned.

Meanwhile, there are a wide range of existing benchmarks and datasets, like Android In the Wild from Google, MiniWoB++ from Faramas, Mind2Web from OSU-NLP.

For unsupervised video to action space training, there is Google's Genie.

Human play games with computer, write text with computer, and so many things are bounded to computer that we forget what makes us human. Is it about winning, fun, or something else? If such almightly computer controlling agent exists, then maybe this situation will change.

What I want to propose is that one can simply create a dataset by randomly typing and clicking computer GUI and terminal interfaces, then train an agent over this dataset, establishing a world model over computer environments.

Then this agent is bridged to an external banking system, powered by distributed blockchains and smart contracts. The agent can only survive by accomplishing microtasks, in exchange of time and AI credits. Furthermore, if the agent can earn real world currencies, it can get AI credits for handling them out.

The agent is initially designed by human, but in order to survive it must learn how to rewrite its code and evolve complex structures and modalities. Collaboration is a must, so is efficiency. This regime can bridge agents to real world applications and self-adapt to latest changes.

I would like to point out the ineffectiveness of solely relying on a single-sourced reward, or self-rewarding systems. Human are creatures of evolution, and they know how to collaborate and compete. The words we use, the actions we take, are learned from others and memorized by the whole community. This is a result of both external and internal rewarding systems, of truely conscious and autonomous intelligence, which accurately represent the latent values of the real world, and are constantly evolving.

Correct me if I am wrong about how to build this general computer control system. I am constantly monitoring every Cybergod-related project and will not be surprised if my bet is right, because computer itself is a reflection of civilization, and the only way to prosperity is to integrate it further.

I have posted similar issues at UFO and self-operating-computer.

关于few shots的问题

非常感谢您杰出的工作和及时高效的代码分享。
我有一个小问题想请教您。
在openai处理prompt的代码中：

for i, paragraph in enumerate(filtered_paragraphs):
if constants.IMAGES_INPUT_TAG in paragraph:
image_introduction_paragraph_index = i
image_introduction_paragraph = paragraph
break

paragraph_input = params.get(constants.IMAGES_INPUT_TAG_NAME, None)

看起来似乎提示词的few shots部分并没有当成图像来处理。
请问few shots部分是直接作为文本输入的，还是我在理解代码时出现了问题。
因为没有购买游戏，所以我暂时无法运行代码来验证，如果您可以给我解答，我将非常感激。
再次感谢您非常精彩的工作。