- Open-AutoGLM Phone Agent
- Skill by
- ara.so
- — Daily 2026 Skills collection.
- Open-AutoGLM is an open-source AI phone agent framework that enables natural language control of Android, HarmonyOS NEXT, and iOS devices. It uses the AutoGLM vision-language model (9B parameters) to perceive screen content and execute multi-step tasks like "open Meituan and search for nearby hot pot restaurants."
- Architecture Overview
- User Natural Language → AutoGLM VLM → Screen Perception → ADB/HDC/WebDriverAgent → Device Actions
- Model
-
- AutoGLM-Phone-9B (Chinese-optimized) or AutoGLM-Phone-9B-Multilingual
- Device control
-
- ADB (Android), HDC (HarmonyOS NEXT), WebDriverAgent (iOS)
- Model serving
-
- vLLM or SGLang (self-hosted) or BigModel/ModelScope API
- Input
- Screenshot + task description → Output: structured action commands Installation Prerequisites Python 3.10+ ADB installed and in PATH (Android) or HDC (HarmonyOS) or WebDriverAgent (iOS) Android device with Developer Mode + USB Debugging enabled ADB Keyboard APK installed on Android device (for text input) Install the framework git clone https://github.com/zai-org/Open-AutoGLM.git cd Open-AutoGLM pip install -r requirements.txt pip install -e . Verify ADB connection
Android
adb devices
Expected: emulator-5554 device
HarmonyOS NEXT
hdc list targets
Expected: 7001005458323933328a01bce01c2500
Model Deployment Options Option A: Third-party API (Recommended for quick start) BigModel (ZhipuAI) export BIGMODEL_API_KEY = "your-bigmodel-api-key" python main.py \ --base-url https://open.bigmodel.cn/api/paas/v4 \ --model "autoglm-phone" \ --apikey $BIGMODEL_API_KEY \ "打开美团搜索附近的火锅店" ModelScope export MODELSCOPE_API_KEY = "your-modelscope-api-key" python main.py \ --base-url https://api-inference.modelscope.cn/v1 \ --model "ZhipuAI/AutoGLM-Phone-9B" \ --apikey $MODELSCOPE_API_KEY \ "open Meituan and find nearby hotpot" Option B: Self-hosted with vLLM
Install vLLM (or use official Docker: docker pull vllm/vllm-openai:v0.12.0)
pip install vllm
Start model server (strictly follow these parameters)
python3 -m vllm.entrypoints.openai.api_server \ --served-model-name autoglm-phone-9b \ --allowed-local-media-path / \ --mm-encoder-tp-mode data \ --mm_processor_cache_type shm \ --mm_processor_kwargs '{"max_pixels":5000000}' \ --max-model-len 25480 \ --chat-template-content-format string \ --limit-mm-per-prompt '{"image":10}' \ --model zai-org/AutoGLM-Phone-9B \ --port 8000 Option C: Self-hosted with SGLang
Install SGLang or use: docker pull lmsysorg/sglang:v0.5.6.post1
Inside container: pip install nvidia-cudnn-cu12==9.16.0.29
python3
-m
sglang.launch_server
\
--model-path zai-org/AutoGLM-Phone-9B
\
--served-model-name autoglm-phone-9b
\
--context-length
25480
\
--mm-enable-dp-encoder
\
--mm-process-config
'{"image":{"max_pixels":5000000}}'
\
--port
8000
Verify deployment
python scripts/check_deployment_cn.py
\
--base-url http://localhost:8000/v1
\
--model
autoglm-phone-9b
Expected output includes a
Android device (default)
python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b \ "打开小红书搜索美食"
HarmonyOS device
python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b \ --device-type hdc \ "打开设置查看WiFi"
Multilingual model for English apps
python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b-multilingual \ "Open Instagram and search for travel photos" Key CLI parameters Parameter Description Default --base-url Model service endpoint Required --model Model name on server Required --apikey API key for third-party services None --device-type adb (Android) or hdc (HarmonyOS) adb --device-id Specific device serial number Auto-detect Python API Usage Basic agent invocation from phone_agent import PhoneAgent from phone_agent . config import AgentConfig config = AgentConfig ( base_url = "http://localhost:8000/v1" , model = "autoglm-phone-9b" , device_type = "adb" ,
or "hdc" for HarmonyOS
) agent = PhoneAgent ( config )
Run a task
result
agent . run ( "打开淘宝搜索蓝牙耳机" ) print ( result ) Custom task with device selection from phone_agent import PhoneAgent from phone_agent . config import AgentConfig import os config = AgentConfig ( base_url = os . environ [ "MODEL_BASE_URL" ] , model = os . environ [ "MODEL_NAME" ] , apikey = os . environ . get ( "MODEL_API_KEY" ) , device_type = "adb" , device_id = "emulator-5554" ,
specific device
) agent = PhoneAgent ( config )
Task with sensitive operation confirmation
result
agent . run ( "在京东购买最便宜的蓝牙耳机" , confirm_sensitive = True
prompt user before purchase actions
) Direct model API call (for testing/integration) import openai import base64 import os from pathlib import Path client = openai . OpenAI ( base_url = os . environ [ "MODEL_BASE_URL" ] , api_key = os . environ . get ( "MODEL_API_KEY" , "dummy" ) , )
Load screenshot
screenshot_path
"screenshot.png" with open ( screenshot_path , "rb" ) as f : image_b64 = base64 . b64encode ( f . read ( ) ) . decode ( ) response = client . chat . completions . create ( model = "autoglm-phone-9b" , messages = [ { "role" : "user" , "content" : [ { "type" : "image_url" , "image_url" : { "url" : f"data:image/png;base64, { image_b64 } " } , } , { "type" : "text" , "text" : "Task: 搜索附近的咖啡店\nCurrent step: Navigate to search" , } , ] , } ] , ) print ( response . choices [ 0 ] . message . content )
Output format: ... \ndo(action="...", ...)
Parsing model action output import re def parse_action ( model_output : str ) -
dict : """Parse AutoGLM model output into structured action."""
Extract answer block
answer_match
re
.
search
(
r'
Parse do() call
Format: do(action="ActionName", param1="value1", param2="value2")
action_match
re . search ( r'do(action="([^"]+)"(.*?))' , answer , re . DOTALL ) if not action_match : return { "action" : "unknown" , "raw" : answer } action_name = action_match . group ( 1 ) params_str = action_match . group ( 2 )
Parse parameters
params
{ } for param_match in re . finditer ( r'(\w+)="([^"]*)"' , params_str ) : params [ param_match . group ( 1 ) ] = param_match . group ( 2 ) return { "action" : action_name , ** params }
Example usage
output
'
{"action": "Launch", "app": "京东"}
ADB Device Control Patterns Common ADB operations used by the agent import subprocess def take_screenshot ( device_id : str = None ) -
bytes : """Capture current device screen.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "exec-out" , "screencap" , "-p" ] ) result = subprocess . run ( cmd , capture_output = True ) return result . stdout def send_tap ( x : int , y : int , device_id : str = None ) : """Tap at screen coordinates.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "tap" , str ( x ) , str ( y ) ] ) subprocess . run ( cmd ) def send_text_adb_keyboard ( text : str , device_id : str = None ) : """Send text via ADB Keyboard (must be installed and enabled).""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] )
Enable ADB keyboard first
cmd_enable
cmd + [ "shell" , "ime" , "set" , "com.android.adbkeyboard/.AdbIME" ] subprocess . run ( cmd_enable )
Send text
cmd_text
cmd + [ "shell" , "am" , "broadcast" , "-a" , "ADB_INPUT_TEXT" , "--es" , "msg" , text ] subprocess . run ( cmd_text ) def swipe ( x1 : int , y1 : int , x2 : int , y2 : int , duration_ms : int = 300 , device_id : str = None ) : """Swipe gesture on screen.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "swipe" , str ( x1 ) , str ( y1 ) , str ( x2 ) , str ( y2 ) , str ( duration_ms ) ] ) subprocess . run ( cmd ) def press_back ( device_id : str = None ) : """Press Android back button.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "keyevent" , "KEYCODE_BACK" ] ) subprocess . run ( cmd ) def launch_app ( package_name : str , device_id : str = None ) : """Launch app by package name.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "monkey" , "-p" , package_name , "-c" , "android.intent.category.LAUNCHER" , "1" ] ) subprocess . run ( cmd ) Midscene.js Integration For JavaScript/TypeScript automation using AutoGLM: // .env configuration // MIDSCENE_MODEL_NAME=autoglm-phone // MIDSCENE_OPENAI_BASE_URL=https://open.bigmodel.cn/api/paas/v4 // MIDSCENE_OPENAI_API_KEY=your-api-key import { AndroidAgent } from "@midscene/android" ; const agent = new AndroidAgent ( ) ; await agent . aiAction ( "打开微信发送消息给张三" ) ; await agent . aiQuery ( "当前页面显示的消息内容是什么?" ) ; Remote ADB (WiFi Debugging)
Connect device via USB first, then enable TCP/IP mode
adb tcpip 5555
Get device IP address
adb shell ip addr show wlan0
Connect wirelessly (disconnect USB after this)
adb connect 192.168 .1.100:5555
Verify connection
adb devices
192.168.1.100:5555 device
Use with agent
python main.py \ --base-url http://model-server:8000/v1 \ --model autoglm-phone-9b \ --device-id "192.168.1.100:5555" \ "打开支付宝查看余额" Common Action Types The AutoGLM model outputs structured actions: Action Description Example Launch Open an app do(action="Launch", app="微信") Tap Tap screen element do(action="Tap", element="搜索框") Type Input text do(action="Type", text="火锅") Swipe Scroll/swipe do(action="Swipe", direction="up") Back Press back button do(action="Back") Home Go to home screen do(action="Home") Finish Task complete do(action="Finish", result="已完成搜索") Model Selection Guide Model Use Case Languages AutoGLM-Phone-9B Chinese apps (WeChat, Taobao, Meituan) Chinese-optimized AutoGLM-Phone-9B-Multilingual International apps, mixed content Chinese + English + others HuggingFace: zai-org/AutoGLM-Phone-9B / zai-org/AutoGLM-Phone-9B-Multilingual ModelScope: ZhipuAI/AutoGLM-Phone-9B / ZhipuAI/AutoGLM-Phone-9B-Multilingual Environment Variables Reference
Model service
export MODEL_BASE_URL = "http://localhost:8000/v1" export MODEL_NAME = "autoglm-phone-9b" export MODEL_API_KEY = ""
Required for BigModel/ModelScope APIs
BigModel API
export BIGMODEL_API_KEY = "" export BIGMODEL_BASE_URL = "https://open.bigmodel.cn/api/paas/v4"
ModelScope API
export MODELSCOPE_API_KEY = "" export MODELSCOPE_BASE_URL = "https://api-inference.modelscope.cn/v1"
Device configuration
export ADB_DEVICE_ID = ""
Leave empty for auto-detect
export HDC_DEVICE_ID = ""
HarmonyOS device ID
- Troubleshooting
- Model output is garbled or very short chain-of-thought
- Cause
-
- Incorrect vLLM/SGLang startup parameters.
- Fix
-
- Ensure
- --chat-template-content-format string
- (vLLM) and
- --mm-process-config
- with
- max_pixels:5000000
- are set. Check transformers version compatibility.
- adb devices
- shows no devices
- Fix
- :
- Verify USB cable supports data transfer (not charge-only)
- Accept "Allow USB debugging" dialog on phone
- Try
- adb kill-server && adb start-server
- Some devices require reboot after enabling developer options
- Text input not working on Android
- Fix
-
- ADB Keyboard must be installed AND enabled:
- adb shell ime
- enable
- com.android.adbkeyboard/.AdbIME
- adb shell ime
- set
- com.android.adbkeyboard/.AdbIME
- Agent stuck in a loop
- Cause
-
- Model cannot identify a path to complete the task.
- Fix
-
- The framework includes sensitive operation confirmation — ensure
- confirm_sensitive=True
- for purchase/delete tasks. For login/CAPTCHA screens, the agent supports human takeover.
- vLLM CUDA out of memory
- Fix
-
- AutoGLM-Phone-9B requires ~20GB VRAM. Use
- --tensor-parallel-size 2
- for multi-GPU, or use the API service instead.
- Connection refused to model server
- Fix
- Check firewall rules. For remote server:
Test connectivity
curl http://YOUR_SERVER_IP:8000/v1/models
Should return model list JSON
- HDC device not recognized (HarmonyOS)
- Fix
- HarmonyOS NEXT (not earlier versions) is required. Enable developer mode in Settings → About → Version Number (tap 10 times rapidly). iOS Setup For iPhone automation, see the dedicated setup guide:
After configuring WebDriverAgent per docs/ios_setup/ios_setup.md
python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b-multilingual \ --device-type ios \ "Open Maps and navigate to Central Park"