x-plug / mobileagent Goto Github PK

View Code? Open in Web Editor NEW

2.7K 45.0 243.0 34.56 MB

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

Home Page: https://arxiv.org/abs/2406.01014

License: MIT License

Python 100.00%

agent gpt4v mllm mobile-agents multimodal multimodal-large-language-models multimodal-agent android app gui

mobileagent's People

Contributors

Stargazers

Watchers

Forkers

myan123 xhyandwyy clarenceyc leeeex sundogs8603 guoqiangjia houqinzhe lihuibng sunyuping shiyukonghui dgo2dance qingyang1807 yoghur strategist922 techthiyanes nanixu liuyuchuan spike2233 kunwar-vikrant eltociear majiajue lvniqi xlight5 duzhanyuan ciel-zhang xkjava saulocatharino junyangwang0410 lucky1cat albinagaas1993 cv-ip obsidian6s spc121 wsy00345 herpacker buphnezz shuinsen luluchou seshakiran nnuujj oceanbio stracerxx bill007bill cqjiang2000 zuozhu6698 rohit0x1 sycomix guntoto kinofsin ailabteam mivanovitch closegoingaway cerviny e-kiss-me gyanachand1 d3p10y coder-drinker vamoko masemxiao soxunlocks anminhhung zowietao hhy5277 pczb ntt720 gptog hisstar gaecom toreino ariafyy johnleo-xjtu xupercoin moguijoe soon14 zahor55 staccats elonsu7 gilby56 yesglossy-nepheway fibeiryi8 htinkerchic hyowe odergoojphathe mariatrofimec c-sarenaudience wangxingjun778 ken2190 itsfitts bugsliglobays selfmoff46socialwil buffar-e romancexoxox-p minglueue-quayleflea a-larvional nonfutouz sigsawaii knightcooled-x yanxg untergang1 87cephagne

mobileagent's Issues

ImportError

Hi,
I get the following error message:

ImportError: cannot import name '_datasets_server' from 'datasets.utils'

PS:
Name: datasets
Version: 2.19.2

Any ideas how to fix this?

Thank you.

perception_infos index out of range

It usually occurs those errors

Why lots of 429 Network error during my running? And how can I deal with it...

Action: click icon (three vertical dots, top right, center)
一轮操作
ACT: tap (993,599)
请求了一次
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>

First, I know that 429 means request too frequently during a short time.
But I wanna to know, is there only me having encounter with this error? And how can I deal with it?

When will Chinese support be available?

Amazing! Thanks for sharing. My phone can really operate automatically. But when will the Chinese scene be supported? Text recognition in Chinese APP is not accurate.

找不到Screenshot目录下的screenshot.jpg文件。

2024-06-07 11:28:54,386 - modelscope - INFO - loading model done
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:286 in │
│ │
│ 283 │ iter += 1 │
│ 284 │ if iter == 1: │
│ 285 │ │ screenshot_file = "./screenshot/screenshot.jpg" │
│ ❱ 286 │ │ perception_infos, width, height = get_perception_infos(adb_path, screenshot_file │
│ 287 │ │ shutil.rmtree(temp_file) │
│ 288 │ │ os.mkdir(temp_file) │
│ 289 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:175 in get_perception_infos │
│ │
│ 172 │
│ 173 │
│ 174 def get_perception_infos(adb_path, screenshot_file): │
│ ❱ 175 │ get_screenshot(adb_path) │
│ 176 │ │
│ 177 │ width, height = Image.open(screenshot_file).size │
│ 178 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\MobileAgent\controller.py:49 in get_screenshot │
│ │
│ 46 │ subprocess.run(command, capture_output=True, text=True, shell=True) │
│ 47 │ image_path = "./screenshot/screenshot.png" │
│ 48 │ save_path = "./screenshot/screenshot.jpg" │
│ ❱ 49 │ image = Image.open(image_path) │
│ 50 │ image.convert("RGB").save(save_path, "JPEG") │
│ 51 │ os.remove(image_path) │
│ 52 │
│ │
│ C:\Users\lxc\AppData\Local\Programs\Python\Python310\lib\site-packages\PIL\Image.py:3277 in open │
│ │
│ 3274 │ │ filename = os.path.realpath(os.fspath(fp)) │
│ 3275 │ │
│ 3276 │ if filename: │
│ ❱ 3277 │ │ fp = builtins.open(filename, "rb") │
│ 3278 │ │ exclusive_fp = True │
│ 3279 │ │
│ 3280 │ try: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory:
'D:\github-app\MobileAgent\Mobile-Agent-v2\screenshot\screenshot.png'

我看了下代码，确实有去找这个文件的这个行为。但是我不确定这个行为是什么意思？我猜测是在gpt-4o工作之前需要拿到手机屏幕的一张图片，然后项目才能执行下去，是这样吗？

MobileAgent-v3 怎样引用？

MobileAgent-v3 怎样引用？
这里是引用v2
https://github.com/modelscope/modelscope-agent/blob/master/apps/mobile_agent/run.py
v3 有上面的run.py的demo代码么？

Error when doing action 'type'

这是不是没有成功，是怎么回事呀

Be in anticipation of the open-sourced MobileEval

Be in anticipation of the open-sourced MobileEval 😀

pycache目录没有必要上传吧

如题

请问基于 python 哪个版本

使用了 3.9.2 和 3.7.15，安装依赖都有不同组件报错

Typo in MobileAgent

Hi, I am wondering whether it shall be text = re.search(r"\((.*?)\)", action).group(1) instead of response?

MobileAgent/Mobile-Agent/run.py

Line 179 in e93345e

text = re.search(r"\((.*?)\)", response).group(1)

MobileAgent使用GPT-4o

使用 MobileAgent 测试一些功能，

gpt-4o
python 3.10
openai 1.30
base_url: https://api.nextapi.fun/v1/chat/completions

在运行后，发现返回结果为空，仅请求成功而已，

res = requests.post(api_url, headers=headers, json=data)
print('================= res-start: =============')
print(res.status_code)
if res.status_code == 200:
    print('请求成功:', res.text)
    print('content:', res.content)
print('================= res-end: ==============')
res = res.json()['choices'][0]['message']['content']

================= res-start: =============
<Response [200]>
200
请求成功: json{}

content: b'json\n{}\n'
================= res-end: ==============

步骤

安装好 MobileAgent
更改 MobileAgent/api.py 中 api_url = "https://api.nextapi.fun/v1/chat/completions"
运行 run.py

执行结果

应该正确返回 json 内容，但却返回的是空 json。

支出安卓模拟器

能否支持一下安装模拟器

Can it be used for mobile chess or Hearthstone?

It is too expensive to train a reinforcement learning model to do this. Can I use an agent to achieve this effect?

TypeError: annotate() got an unexpected keyword argument 'labels'

辛苦看看下面这个报错原因是什么呢？Python版本 3.9.13，系统版本：windows 10
Traceback (most recent call last):
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 286, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 190, in get_perception_infos
coordinates = det(screenshot_file, "icon", groundingdino_model)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\MobileAgent\icon_localization.py", line 45, in det
result = groundingdino_model(inputs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 220, in call
output = self._process_single(input, *args, **kwargs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 255, in _process_single
out = self.forward(out, **forward_params)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 35, in forward
return self.model(inputs,**forward_params)
File "D:\Env\Python\lib\site-packages\modelscope\models\base\base_torch_model.py", line 36, in call
return self.postprocess(self.forward(*args, **kwargs))
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 66, in forward
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\groundingdino\util\inference.py", line 97, in annotate
annotated_frame = box_annotator.annotate(
File "D:\Env\Python\lib\site-packages\supervision\utils\conversion.py", line 23, in wrapper
return annotate_func(self, scene, *args, **kwargs)
TypeError: annotate() got an unexpected keyword argument 'labels'

能否将gpt4-v改成国内的免费开源大模型

国内的大模型对中文的支持比较好，是否计划接入别的大模型？

请教一下memory unit

想问一下从代码上看好像planning agent和reflection agent的输入好像是没有memory参与的？
还有就是看你的demo运行速度很快，但实际我尝试调用api的时候一步就需要7，8秒的样子？

GroundingDINO报错：BoxAnnotator.annotate() got an unexpected keyword argument 'labels'

python 3.10的环境，这个错误有人遇到吗？

Necessity of GDINO?

Interesting Work!

MobileAgent V2 error message

Traceback (most recent call last):
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 297, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 230, in get_perception_infos
icon_map = generate_api(images, prompt)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 127, in generate_api
response = future.result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 109, in process_image
print("process_image->text", response.text)
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 59, in getattr
return self[attr]
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 15, in getitem
return super().getitem(key)
KeyError: 'text'

GPT-4o的API_url是必填的吗

您好，我看到在v2的run.py中，需要填写GPT-4o的API_url和token，这两个参数是必须的吗？是不是使用了Qwen的qwen_api，就不需要GPT-4o的了？

v2版本为什么只支持安卓和鸿蒙

你们所说的只支持安卓和鸿蒙的原因应该是在于用到了xml文件了吧，但是我在v2版本的代码中没看到哪里用到了xml文件了

代码错误

in_coordinate, out_coordinate = det(image, "icon", groundingdino_model)
这里返回两个值，但是方法只返回了一个值，是不是有错误啊，代码在Mobile-Agent/run.py 的149行
def det(input_image_path, caption, groundingdino_model, box_threshold=0.05, text_threshold=0.5):
image = Image.open(input_image_path)
size = image.size

caption = caption.lower()
caption = caption.strip()
if not caption.endswith('.'):
    caption = caption + '.'

inputs = {
    'IMAGE_PATH': input_image_path,
    'TEXT_PROMPT': caption,
    'BOX_TRESHOLD': box_threshold,
    'TEXT_TRESHOLD': text_threshold
}

result = groundingdino_model(inputs)
print(result)
boxes_filt = result['boxes']

H, W = size[1], size[0]
for i in range(boxes_filt.size(0)):
    boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
    boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
    boxes_filt[i][2:] += boxes_filt[i][:2]

boxes_filt = boxes_filt.cpu().int().tolist()
filtered_boxes = remove_boxes(boxes_filt, size)  # [:9]
coordinates = []
for box in filtered_boxes:
    coordinates.append([box[0], box[1], box[2], box[3]])

return coordinates

PC-agent安装和使用流程不好用。

报错

Mobile V3大概多久可以使用？

Cosidering use previous action and explore phase to improve accuracy

As shown in AppAgent

https://github.com/mnotgod96/AppAgent/blob/main/scripts/task_executor.py#L204-L206

they use last_act in their prompt, which makes it easier to detect if there is a dead loop and we need to find another solution.

Also, they use explore phase to improve their task execution phase.

Could those improve accuracy?

It uses wireless debugging but also requires connection via cable?

just wondering why not use usb debugging at that point?

also can it ever be native?

ModuleNotFoundError: OCRDetectionPipeline:

I get the following error message:

ModuleNotFoundError: OCRDetectionPipeline: No module named 'tf_keras.legacy_tf_layers'

Any ideas how to fix this?

Thank you.

Memory issue

Is there any problem with the memory part? The variable insight here seems to be an empty string all the time. What is the purpose of this variable?

if memory_switch:
    prompt_memory = get_memory_prompt(insight)
    chat_action = add_response("user", prompt_memory, chat_action)
    output_memory = inference_chat(chat_action, 'gpt-4o', API_url, token)
    chat_action = add_response("assistant", output_memory, chat_action)
    status = "#" * 50 + " Memory " + "#" * 50
    print(status)
    print(output_memory)
    print('#' * len(status))
    output_memory = output_memory.split("### Important content ###")[-1].split("\n\n")[0].strip() + "\n"
    if "None" not in output_memory and output_memory not in memory:
        memory += output_memory

报异常 zsh: illegal hardware instruction

Mobile-Agent-v2 can't type even when ADB Keyboard is activated

Hi, thanks for the kind open-sourcing. I found an issue when running my experiments, and I am wondering whether it is something wrong on my side or a potential corner case for the code, so I would like to discuss it here.

I found that even when my ADB Keyboard is activated, the keyboard variable still shows as False, which affects the get_action_prompt() function. This causes the agent to perceive that the keyboard is not activated, preventing the agent from choosing the Type action. Below is an example of the issue:

Unable to Type. You cannot use the action "Type" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen.

I then tried to debug and found the related code:

MobileAgent/Mobile-Agent-v2/run.py

Line 284 in 35a2264

if iter == 1:

MobileAgent/Mobile-Agent-v2/run.py

Lines 290 to 296 in 35a2264

 keyboard = False 

 for perception_info in perception_infos: 

 if perception_info['coordinates'][1] < 0.95 * height: 

 continue 

 if 'ADB Keyboard' in perception_info['text']: 

 keyboard = True 

 break

Based on this code, there are two reasons why the agent cannot type on my side:

Line 284: The keyword variable can only be switched to True in the first iteration. However, in my case (which might differ for different Android phones), my agent can only observe the ADB Keyboard {ON} when it can input something (e.g., already focused on the search box), which is almost impossible in the first iteration. Therefore, the keyboard variable is always False for the agent.
Line 292: The switch might be skipped if the condition is not satisfied. In my case (due to the phone I am using, Google Pixel 8 Pro), the location where ADB Keyboard {ON} appears is relatively too high to satisfy the condition. When I make the threshold smaller (e.g., 0.8), the issue is fixed.

Though this issue might be a rare case, I would greatly appreciate it if you could share some comments about it.

Many thanks :D

Qwen API

Hello. Thank you for your great research.
I want to use mobileagentv2, and I'm wondering if the qwen API is essential for using it.
Since i'm an international, it seems impossible to obtain the qwen API.
Could you let me know if this means I cannot use this model?

I know that version 1 could be used with just the GPT API, but I want to know if version 2 cannot be used without the qwen API.

求技术大牛基于Mobile agent技术做屏幕人工智能方向

我们觉得未来屏幕人工智能赛道会有很大的空间，我们也已经找到了用户痛点场景，公司已经实现盈利，现邀请有意向的技术大牛加入，基于Mobile Agent开发新版本的产品，有意向的可以直接微信联系：douyinzhangfen69

Some risky code

Hi team,

this is excellent work, but I find some risky codes which breaks code execution

Here, the image is RGBA format, need to add
https://github.com/X-PLUG/MobileAgent/blob/main/MobileAgent/crop.py#L83-L84

    if cropped_image.mode == 'RGBA':
        cropped_image = cropped_image.convert('RGB')

res is not defined in exception
https://github.com/X-PLUG/MobileAgent/blob/main/MobileAgent/api.py#L30-L31
change to

        try:
            res = requests.post(api_url, headers=headers, json=data)
            res = res.json()['choices'][0]['message']['content']
        except Exception as e:
            print(f"Network Error: {e}")

Am I able to contribute to this repo, too? I am a software engineer from Google : https://www.linkedin.com/in/zack-z-li

How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?

I replaced the original grounding-dino model with a GPU-supported version, reducing the time required from about 7 seconds to just 0.2 seconds. For more details on the GPU version of grounding-dino, please refer to the link: https://github.com/IDEA-Research/GroundingDINO
For the OCR model, is there a similarly faster GPU-supported version? Currently, each OCR operation takes approximately 3 seconds.
For calling chatgpt-4o, do you have any suggestions for improving its execution speed? At present, each call to chatgpt-4o takes approximately 6-7 seconds.
Looking forward to your response.

请问有QQ或者微信交流群吗？

可以支持不同分辨率的图片吗，例如平板

AttributeError: module 'tensorflow' has no attribute 'version'

按照readme操作，提示zsh: illegal hardware instruction
然后参考其他issue换了下TensorFlow的版本为tensorflow-macos==2.9
再次运行报错
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
return importlib.import_module('.' + module_name, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/pipelines/cv/ocr_utils/ops.py", line 16, in
if tf.version >= '2.0':
AttributeError: module 'tensorflow' has no attribute 'version'

请教下 draw_coordinates_on_image 函数的用途是什么？

https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v2/run.py#L183

我简单搜了下，把文本的中心坐标画到截图上之后，好像并没有其他流程去读取输出的这张图，是为了调试用的？

	keyboard = False
	for perception_info in perception_infos:
	if perception_info['coordinates'][1] < 0.95 * height:
	continue
	if 'ADB Keyboard' in perception_info['text']:
	keyboard = True
	break

x-plug / mobileagent Goto Github PK

mobileagent's People

Contributors

Stargazers

Watchers

Forkers

mobileagent's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs