GithubHelp home page GithubHelp logo

x-plug / mobileagent Goto Github PK

View Code? Open in Web Editor NEW
2.7K 45.0 243.0 34.56 MB

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Home Page: https://arxiv.org/abs/2406.01014

License: MIT License

Python 100.00%
agent gpt4v mllm mobile-agents multimodal multimodal-large-language-models multimodal-agent android app gui

mobileagent's People

Contributors

aptsunny avatar auhowielau avatar eltociear avatar jingxuanchen916 avatar junyangwang0410 avatar kx-kexi avatar xhyandwyy avatar zhangxi1997 avatar zhiyuan8 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mobileagent's Issues

ImportError

Hi,
I get the following error message:

ImportError: cannot import name '_datasets_server' from 'datasets.utils'

PS:
Name: datasets
Version: 2.19.2

Any ideas how to fix this?

Thank you.

Why lots of 429 Network error during my running? And how can I deal with it...

Action: click icon (three vertical dots, top right, center)
一轮操作
ACT: tap (993,599)
请求了一次
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>

First, I know that 429 means request too frequently during a short time.
But I wanna to know, is there only me having encounter with this error? And how can I deal with it?

When will Chinese support be available?

Amazing! Thanks for sharing. My phone can really operate automatically. But when will the Chinese scene be supported? Text recognition in Chinese APP is not accurate.

找不到Screenshot目录下的screenshot.jpg文件。

2024-06-07 11:28:54,386 - modelscope - INFO - loading model done
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:286 in │
│ │
│ 283 │ iter += 1 │
│ 284 │ if iter == 1: │
│ 285 │ │ screenshot_file = "./screenshot/screenshot.jpg" │
│ ❱ 286 │ │ perception_infos, width, height = get_perception_infos(adb_path, screenshot_file │
│ 287 │ │ shutil.rmtree(temp_file) │
│ 288 │ │ os.mkdir(temp_file) │
│ 289 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:175 in get_perception_infos │
│ │
│ 172 │
│ 173 │
│ 174 def get_perception_infos(adb_path, screenshot_file): │
│ ❱ 175 │ get_screenshot(adb_path) │
│ 176 │ │
│ 177 │ width, height = Image.open(screenshot_file).size │
│ 178 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\MobileAgent\controller.py:49 in get_screenshot │
│ │
│ 46 │ subprocess.run(command, capture_output=True, text=True, shell=True) │
│ 47 │ image_path = "./screenshot/screenshot.png" │
│ 48 │ save_path = "./screenshot/screenshot.jpg" │
│ ❱ 49 │ image = Image.open(image_path) │
│ 50 │ image.convert("RGB").save(save_path, "JPEG") │
│ 51 │ os.remove(image_path) │
│ 52 │
│ │
│ C:\Users\lxc\AppData\Local\Programs\Python\Python310\lib\site-packages\PIL\Image.py:3277 in open │
│ │
│ 3274 │ │ filename = os.path.realpath(os.fspath(fp)) │
│ 3275 │ │
│ 3276 │ if filename: │
│ ❱ 3277 │ │ fp = builtins.open(filename, "rb") │
│ 3278 │ │ exclusive_fp = True │
│ 3279 │ │
│ 3280 │ try: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory:
'D:\github-app\MobileAgent\Mobile-Agent-v2\screenshot\screenshot.png'

我看了下代码,确实有去找这个文件的这个行为。但是我不确定这个行为是什么意思?我猜测是在gpt-4o工作之前需要拿到手机屏幕的一张图片,然后项目才能执行下去,是这样吗?

MobileAgent使用GPT-4o

使用 MobileAgent 测试一些功能,

gpt-4o
python 3.10
openai 1.30
base_url: https://api.nextapi.fun/v1/chat/completions

在运行后,发现返回结果为空,仅请求成功而已,

res = requests.post(api_url, headers=headers, json=data)
print('================= res-start: =============')
print(res.status_code)
if res.status_code == 200:
    print('请求成功:', res.text)
    print('content:', res.content)
print('================= res-end: ==============')
res = res.json()['choices'][0]['message']['content']
================= res-start: =============
<Response [200]>
200
请求成功: json{}

content: b'json\n{}\n'
================= res-end: ==============

步骤

执行结果

应该正确返回 json 内容,但却返回的是空 json。

334632865-55015af8-e29e-4174-9278-ac3f53d1dc5c

TypeError: annotate() got an unexpected keyword argument 'labels'

辛苦看看下面这个报错原因是什么呢?Python版本 3.9.13, 系统版本:windows 10
Traceback (most recent call last):
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 286, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 190, in get_perception_infos
coordinates = det(screenshot_file, "icon", groundingdino_model)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\MobileAgent\icon_localization.py", line 45, in det
result = groundingdino_model(inputs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 220, in call
output = self._process_single(input, *args, **kwargs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 255, in _process_single
out = self.forward(out, **forward_params)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 35, in forward
return self.model(inputs,**forward_params)
File "D:\Env\Python\lib\site-packages\modelscope\models\base\base_torch_model.py", line 36, in call
return self.postprocess(self.forward(*args, **kwargs))
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 66, in forward
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\groundingdino\util\inference.py", line 97, in annotate
annotated_frame = box_annotator.annotate(
File "D:\Env\Python\lib\site-packages\supervision\utils\conversion.py", line 23, in wrapper
return annotate_func(self, scene, *args, **kwargs)
TypeError: annotate() got an unexpected keyword argument 'labels'

请教一下memory unit

想问一下从代码上看好像planning agent和reflection agent的输入好像是没有memory参与的?
还有就是看你的demo运行速度很快,但实际我尝试调用api的时候一步就需要7,8秒的样子?

MobileAgent V2 error message

Traceback (most recent call last):
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 297, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 230, in get_perception_infos
icon_map = generate_api(images, prompt)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 127, in generate_api
response = future.result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 109, in process_image
print("process_image->text", response.text)
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 59, in getattr
return self[attr]
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 15, in getitem
return super().getitem(key)
KeyError: 'text'

GPT-4o的API_url是必填的吗

您好,我看到在v2的run.py中,需要填写GPT-4o的API_url和token,这两个参数是必须的吗?是不是使用了Qwen的qwen_api,就不需要GPT-4o的了?

代码错误

in_coordinate, out_coordinate = det(image, "icon", groundingdino_model)
这里返回两个值,但是方法只返回了一个值,是不是有错误啊,代码在Mobile-Agent/run.py 的149行
def det(input_image_path, caption, groundingdino_model, box_threshold=0.05, text_threshold=0.5):
image = Image.open(input_image_path)
size = image.size

caption = caption.lower()
caption = caption.strip()
if not caption.endswith('.'):
    caption = caption + '.'

inputs = {
    'IMAGE_PATH': input_image_path,
    'TEXT_PROMPT': caption,
    'BOX_TRESHOLD': box_threshold,
    'TEXT_TRESHOLD': text_threshold
}

result = groundingdino_model(inputs)
print(result)
boxes_filt = result['boxes']

H, W = size[1], size[0]
for i in range(boxes_filt.size(0)):
    boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
    boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
    boxes_filt[i][2:] += boxes_filt[i][:2]

boxes_filt = boxes_filt.cpu().int().tolist()
filtered_boxes = remove_boxes(boxes_filt, size)  # [:9]
coordinates = []
for box in filtered_boxes:
    coordinates.append([box[0], box[1], box[2], box[3]])

return coordinates

Memory issue

Is there any problem with the memory part? The variable insight here seems to be an empty string all the time. What is the purpose of this variable?

if memory_switch:
    prompt_memory = get_memory_prompt(insight)
    chat_action = add_response("user", prompt_memory, chat_action)
    output_memory = inference_chat(chat_action, 'gpt-4o', API_url, token)
    chat_action = add_response("assistant", output_memory, chat_action)
    status = "#" * 50 + " Memory " + "#" * 50
    print(status)
    print(output_memory)
    print('#' * len(status))
    output_memory = output_memory.split("### Important content ###")[-1].split("\n\n")[0].strip() + "\n"
    if "None" not in output_memory and output_memory not in memory:
        memory += output_memory

Mobile-Agent-v2 can't type even when ADB Keyboard is activated

Hi, thanks for the kind open-sourcing. I found an issue when running my experiments, and I am wondering whether it is something wrong on my side or a potential corner case for the code, so I would like to discuss it here.

I found that even when my ADB Keyboard is activated, the keyboard variable still shows as False, which affects the get_action_prompt() function. This causes the agent to perceive that the keyboard is not activated, preventing the agent from choosing the Type action. Below is an example of the issue:

Unable to Type. You cannot use the action "Type" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen.

I then tried to debug and found the related code:

if iter == 1:

keyboard = False
for perception_info in perception_infos:
if perception_info['coordinates'][1] < 0.95 * height:
continue
if 'ADB Keyboard' in perception_info['text']:
keyboard = True
break

Based on this code, there are two reasons why the agent cannot type on my side:

  1. Line 284: The keyword variable can only be switched to True in the first iteration. However, in my case (which might differ for different Android phones), my agent can only observe the ADB Keyboard {ON} when it can input something (e.g., already focused on the search box), which is almost impossible in the first iteration. Therefore, the keyboard variable is always False for the agent.
  2. Line 292: The switch might be skipped if the condition is not satisfied. In my case (due to the phone I am using, Google Pixel 8 Pro), the location where ADB Keyboard {ON} appears is relatively too high to satisfy the condition. When I make the threshold smaller (e.g., 0.8), the issue is fixed.

Though this issue might be a rare case, I would greatly appreciate it if you could share some comments about it.

Many thanks :D

Qwen API

Hello. Thank you for your great research.
I want to use mobileagentv2, and I'm wondering if the qwen API is essential for using it.
Since i'm an international, it seems impossible to obtain the qwen API.
Could you let me know if this means I cannot use this model?

I know that version 1 could be used with just the GPT API, but I want to know if version 2 cannot be used without the qwen API.

求技术大牛基于Mobile agent技术做屏幕人工智能方向

我们觉得未来屏幕人工智能赛道会有很大的空间,我们也已经找到了用户痛点场景,公司已经实现盈利,现邀请有意向的技术大牛加入,基于Mobile Agent开发新版本的产品,有意向的可以直接微信联系:douyinzhangfen69

Some risky code

Hi team,

this is excellent work, but I find some risky codes which breaks code execution

  1. Here, the image is RGBA format, need to add
    https://github.com/X-PLUG/MobileAgent/blob/main/MobileAgent/crop.py#L83-L84
    if cropped_image.mode == 'RGBA':
        cropped_image = cropped_image.convert('RGB')
  1. res is not defined in exception
    https://github.com/X-PLUG/MobileAgent/blob/main/MobileAgent/api.py#L30-L31
    change to
        try:
            res = requests.post(api_url, headers=headers, json=data)
            res = res.json()['choices'][0]['message']['content']
        except Exception as e:
            print(f"Network Error: {e}")

Am I able to contribute to this repo, too? I am a software engineer from Google : https://www.linkedin.com/in/zack-z-li

How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?

How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?

  1. I replaced the original grounding-dino model with a GPU-supported version, reducing the time required from about 7 seconds to just 0.2 seconds. For more details on the GPU version of grounding-dino, please refer to the link: https://github.com/IDEA-Research/GroundingDINO
  2. For the OCR model, is there a similarly faster GPU-supported version? Currently, each OCR operation takes approximately 3 seconds.
  3. For calling chatgpt-4o, do you have any suggestions for improving its execution speed? At present, each call to chatgpt-4o takes approximately 6-7 seconds.
    Looking forward to your response.

AttributeError: module 'tensorflow' has no attribute '__version__'

按照readme操作,提示zsh: illegal hardware instruction
然后参考其他issue换了下TensorFlow的版本为tensorflow-macos==2.9
再次运行报错
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
return importlib.import_module('.' + module_name, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/pipelines/cv/ocr_utils/ops.py", line 16, in
if tf.version >= '2.0':
AttributeError: module 'tensorflow' has no attribute 'version'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.