x-plug / mobileagent Goto Github PK
View Code? Open in Web Editor NEWMobile-Agent: The Powerful Mobile Device Operation Assistant Family
Home Page: https://arxiv.org/abs/2406.01014
License: MIT License
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
Home Page: https://arxiv.org/abs/2406.01014
License: MIT License
Hi,
I get the following error message:
ImportError: cannot import name '_datasets_server' from 'datasets.utils'
PS:
Name: datasets
Version: 2.19.2
Any ideas how to fix this?
Thank you.
Action: click icon (three vertical dots, top right, center)
一轮操作
ACT: tap (993,599)
请求了一次
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
Network Error:
<Response [429]>
First, I know that 429 means request too frequently during a short time.
But I wanna to know, is there only me having encounter with this error? And how can I deal with it?
Amazing! Thanks for sharing. My phone can really operate automatically. But when will the Chinese scene be supported? Text recognition in Chinese APP is not accurate.
2024-06-07 11:28:54,386 - modelscope - INFO - loading model done
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:286 in │
│ │
│ 283 │ iter += 1 │
│ 284 │ if iter == 1: │
│ 285 │ │ screenshot_file = "./screenshot/screenshot.jpg" │
│ ❱ 286 │ │ perception_infos, width, height = get_perception_infos(adb_path, screenshot_file │
│ 287 │ │ shutil.rmtree(temp_file) │
│ 288 │ │ os.mkdir(temp_file) │
│ 289 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\run.py:175 in get_perception_infos │
│ │
│ 172 │
│ 173 │
│ 174 def get_perception_infos(adb_path, screenshot_file): │
│ ❱ 175 │ get_screenshot(adb_path) │
│ 176 │ │
│ 177 │ width, height = Image.open(screenshot_file).size │
│ 178 │
│ │
│ D:\github-app\MobileAgent\Mobile-Agent-v2\MobileAgent\controller.py:49 in get_screenshot │
│ │
│ 46 │ subprocess.run(command, capture_output=True, text=True, shell=True) │
│ 47 │ image_path = "./screenshot/screenshot.png" │
│ 48 │ save_path = "./screenshot/screenshot.jpg" │
│ ❱ 49 │ image = Image.open(image_path) │
│ 50 │ image.convert("RGB").save(save_path, "JPEG") │
│ 51 │ os.remove(image_path) │
│ 52 │
│ │
│ C:\Users\lxc\AppData\Local\Programs\Python\Python310\lib\site-packages\PIL\Image.py:3277 in open │
│ │
│ 3274 │ │ filename = os.path.realpath(os.fspath(fp)) │
│ 3275 │ │
│ 3276 │ if filename: │
│ ❱ 3277 │ │ fp = builtins.open(filename, "rb") │
│ 3278 │ │ exclusive_fp = True │
│ 3279 │ │
│ 3280 │ try: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
FileNotFoundError: [Errno 2] No such file or directory:
'D:\github-app\MobileAgent\Mobile-Agent-v2\screenshot\screenshot.png'
我看了下代码,确实有去找这个文件的这个行为。但是我不确定这个行为是什么意思?我猜测是在gpt-4o工作之前需要拿到手机屏幕的一张图片,然后项目才能执行下去,是这样吗?
MobileAgent-v3 怎样引用?
这里是引用v2
https://github.com/modelscope/modelscope-agent/blob/master/apps/mobile_agent/run.py
v3 有上面的run.py的demo代码么?
Be in anticipation of the open-sourced MobileEval 😀
如题
使用了 3.9.2 和 3.7.15,安装依赖都有不同组件报错
Hi, I am wondering whether it shall be text = re.search(r"\((.*?)\)", action).group(1)
instead of response
?
MobileAgent/Mobile-Agent/run.py
Line 179 in e93345e
使用 MobileAgent 测试一些功能,
gpt-4o
python 3.10
openai 1.30
base_url: https://api.nextapi.fun/v1/chat/completions
在运行后,发现返回结果为空,仅请求成功而已,
res = requests.post(api_url, headers=headers, json=data)
print('================= res-start: =============')
print(res.status_code)
if res.status_code == 200:
print('请求成功:', res.text)
print('content:', res.content)
print('================= res-end: ==============')
res = res.json()['choices'][0]['message']['content']
================= res-start: =============
<Response [200]>
200
请求成功: json{}
content: b'json\n{}\n'
================= res-end: ==============
步骤
执行结果
应该正确返回 json 内容,但却返回的是空 json。
能否支持一下安装模拟器
It is too expensive to train a reinforcement learning model to do this. Can I use an agent to achieve this effect?
辛苦看看下面这个报错原因是什么呢?Python版本 3.9.13, 系统版本:windows 10
Traceback (most recent call last):
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 286, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\run.py", line 190, in get_perception_infos
coordinates = det(screenshot_file, "icon", groundingdino_model)
File "D:\Project\script\MobileAgent-main\Mobile-Agent-v2\MobileAgent\icon_localization.py", line 45, in det
result = groundingdino_model(inputs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 220, in call
output = self._process_single(input, *args, **kwargs)
File "D:\Env\Python\lib\site-packages\modelscope\pipelines\base.py", line 255, in _process_single
out = self.forward(out, **forward_params)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 35, in forward
return self.model(inputs,**forward_params)
File "D:\Env\Python\lib\site-packages\modelscope\models\base\base_torch_model.py", line 36, in call
return self.postprocess(self.forward(*args, **kwargs))
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\ms_wrapper.py", line 66, in forward
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
File "C:\Users\xiaomi.cache\modelscope\modelscope_modules\GroundingDINO\groundingdino\util\inference.py", line 97, in annotate
annotated_frame = box_annotator.annotate(
File "D:\Env\Python\lib\site-packages\supervision\utils\conversion.py", line 23, in wrapper
return annotate_func(self, scene, *args, **kwargs)
TypeError: annotate() got an unexpected keyword argument 'labels'
国内的大模型对中文的支持比较好,是否计划接入别的大模型?
想问一下从代码上看好像planning agent和reflection agent的输入好像是没有memory参与的?
还有就是看你的demo运行速度很快,但实际我尝试调用api的时候一步就需要7,8秒的样子?
python 3.10的环境,这个错误有人遇到吗?
Interesting Work!
Traceback (most recent call last):
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 297, in
perception_infos, width, height = get_perception_infos(adb_path, screenshot_file)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 230, in get_perception_infos
icon_map = generate_api(images, prompt)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 127, in generate_api
response = future.result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 438, in result
return self.__get_result()
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
raise self._exception
File "/Users/changbao.nie/.pyenv/versions/3.10.0/lib/python3.10/concurrent/futures/thread.py", line 52, in run
result = self.fn(*self.args, **self.kwargs)
File "/Users/changbao.nie/Desktop/MobileAgent/Mobile-Agent-v2/run.py", line 109, in process_image
print("process_image->text", response.text)
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 59, in getattr
return self[attr]
File "/Users/changbao.nie/.pyenv/versions/myenv/lib/python3.10/site-packages/dashscope/api_entities/dashscope_response.py", line 15, in getitem
return super().getitem(key)
KeyError: 'text'
您好,我看到在v2的run.py中,需要填写GPT-4o的API_url和token,这两个参数是必须的吗?是不是使用了Qwen的qwen_api,就不需要GPT-4o的了?
你们所说的只支持安卓和鸿蒙的原因应该是在于用到了xml文件了吧,但是我在v2版本的代码中没看到哪里用到了xml文件了
in_coordinate, out_coordinate = det(image, "icon", groundingdino_model)
这里返回两个值,但是方法只返回了一个值,是不是有错误啊,代码在Mobile-Agent/run.py 的149行
def det(input_image_path, caption, groundingdino_model, box_threshold=0.05, text_threshold=0.5):
image = Image.open(input_image_path)
size = image.size
caption = caption.lower()
caption = caption.strip()
if not caption.endswith('.'):
caption = caption + '.'
inputs = {
'IMAGE_PATH': input_image_path,
'TEXT_PROMPT': caption,
'BOX_TRESHOLD': box_threshold,
'TEXT_TRESHOLD': text_threshold
}
result = groundingdino_model(inputs)
print(result)
boxes_filt = result['boxes']
H, W = size[1], size[0]
for i in range(boxes_filt.size(0)):
boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
boxes_filt[i][2:] += boxes_filt[i][:2]
boxes_filt = boxes_filt.cpu().int().tolist()
filtered_boxes = remove_boxes(boxes_filt, size) # [:9]
coordinates = []
for box in filtered_boxes:
coordinates.append([box[0], box[1], box[2], box[3]])
return coordinates
As shown in AppAgent
https://github.com/mnotgod96/AppAgent/blob/main/scripts/task_executor.py#L204-L206
they use last_act in their prompt, which makes it easier to detect if there is a dead loop and we need to find another solution.
Also, they use explore phase to improve their task execution phase.
Could those improve accuracy?
just wondering why not use usb debugging at that point?
also can it ever be native?
I get the following error message:
ModuleNotFoundError: OCRDetectionPipeline: No module named 'tf_keras.legacy_tf_layers'
Any ideas how to fix this?
Thank you.
Is there any problem with the memory part? The variable insight here seems to be an empty string all the time. What is the purpose of this variable?
if memory_switch:
prompt_memory = get_memory_prompt(insight)
chat_action = add_response("user", prompt_memory, chat_action)
output_memory = inference_chat(chat_action, 'gpt-4o', API_url, token)
chat_action = add_response("assistant", output_memory, chat_action)
status = "#" * 50 + " Memory " + "#" * 50
print(status)
print(output_memory)
print('#' * len(status))
output_memory = output_memory.split("### Important content ###")[-1].split("\n\n")[0].strip() + "\n"
if "None" not in output_memory and output_memory not in memory:
memory += output_memory
Hi, thanks for the kind open-sourcing. I found an issue when running my experiments, and I am wondering whether it is something wrong on my side or a potential corner case for the code, so I would like to discuss it here.
I found that even when my ADB Keyboard is activated, the keyboard
variable still shows as False, which affects the get_action_prompt()
function. This causes the agent to perceive that the keyboard is not activated, preventing the agent from choosing the Type action. Below is an example of the issue:
Unable to Type. You cannot use the action "Type" because the keyboard has not been activated. If you want to type, please first activate the keyboard by tapping on the input box on the screen.
I then tried to debug and found the related code:
MobileAgent/Mobile-Agent-v2/run.py
Line 284 in 35a2264
MobileAgent/Mobile-Agent-v2/run.py
Lines 290 to 296 in 35a2264
Based on this code, there are two reasons why the agent cannot type on my side:
keyword
variable can only be switched to True in the first iteration. However, in my case (which might differ for different Android phones), my agent can only observe the ADB Keyboard {ON} when it can input something (e.g., already focused on the search box), which is almost impossible in the first iteration. Therefore, the keyboard
variable is always False for the agent.Though this issue might be a rare case, I would greatly appreciate it if you could share some comments about it.
Many thanks :D
Hello. Thank you for your great research.
I want to use mobileagentv2, and I'm wondering if the qwen API is essential for using it.
Since i'm an international, it seems impossible to obtain the qwen API.
Could you let me know if this means I cannot use this model?
I know that version 1 could be used with just the GPT API, but I want to know if version 2 cannot be used without the qwen API.
我们觉得未来屏幕人工智能赛道会有很大的空间,我们也已经找到了用户痛点场景,公司已经实现盈利,现邀请有意向的技术大牛加入,基于Mobile Agent开发新版本的产品,有意向的可以直接微信联系:douyinzhangfen69
Hi team,
this is excellent work, but I find some risky codes which breaks code execution
if cropped_image.mode == 'RGBA':
cropped_image = cropped_image.convert('RGB')
try:
res = requests.post(api_url, headers=headers, json=data)
res = res.json()['choices'][0]['message']['content']
except Exception as e:
print(f"Network Error: {e}")
Am I able to contribute to this repo, too? I am a software engineer from Google : https://www.linkedin.com/in/zack-z-li
How to improve the execution speed of OCR, grounding-dino, and chatgpt-4o models to transition mobile-agent from laboratory research to engineering use?
RT
按照readme操作,提示zsh: illegal hardware instruction
然后参考其他issue换了下TensorFlow的版本为tensorflow-macos==2.9
再次运行报错
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/utils/import_utils.py", line 451, in _get_module
return importlib.import_module('.' + module_name, self.name)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/modelscope/pipelines/cv/ocr_utils/ops.py", line 16, in
if tf.version >= '2.0':
AttributeError: module 'tensorflow' has no attribute 'version'
https://github.com/X-PLUG/MobileAgent/blob/main/Mobile-Agent-v2/run.py#L183
我简单搜了下,把文本的中心坐标画到截图上之后,好像并没有其他流程去读取输出的这张图,是为了调试用的?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.