Open-AutoGLM Phone Agent

Skill by

ara.so

— Daily 2026 Skills collection.

Open-AutoGLM is an open-source AI phone agent framework that enables natural language control of Android, HarmonyOS NEXT, and iOS devices. It uses the AutoGLM vision-language model (9B parameters) to perceive screen content and execute multi-step tasks like "open Meituan and search for nearby hot pot restaurants."

Architecture Overview

User Natural Language → AutoGLM VLM → Screen Perception → ADB/HDC/WebDriverAgent → Device Actions

Model

AutoGLM-Phone-9B (Chinese-optimized) or AutoGLM-Phone-9B-Multilingual

Device control

ADB (Android), HDC (HarmonyOS NEXT), WebDriverAgent (iOS)

Model serving

vLLM or SGLang (self-hosted) or BigModel/ModelScope API
Input: Screenshot + task description → Output: structured action commands Installation Prerequisites Python 3.10+ ADB installed and in PATH (Android) or HDC (HarmonyOS) or WebDriverAgent (iOS) Android device with Developer Mode + USB Debugging enabled ADB Keyboard APK installed on Android device (for text input) Install the framework git clone https://github.com/zai-org/Open-AutoGLM.git cd Open-AutoGLM pip install -r requirements.txt pip install -e . Verify ADB connection

Android

adb devices

Expected: emulator-5554 device

HarmonyOS NEXT

hdc list targets

Expected: 7001005458323933328a01bce01c2500

Model Deployment Options Option A: Third-party API (Recommended for quick start) BigModel (ZhipuAI) export BIGMODEL_API_KEY = "your-bigmodel-api-key" python main.py \ --base-url https://open.bigmodel.cn/api/paas/v4 \ --model "autoglm-phone" \ --apikey $BIGMODEL_API_KEY \ "打开美团搜索附近的火锅店" ModelScope export MODELSCOPE_API_KEY = "your-modelscope-api-key" python main.py \ --base-url https://api-inference.modelscope.cn/v1 \ --model "ZhipuAI/AutoGLM-Phone-9B" \ --apikey $MODELSCOPE_API_KEY \ "open Meituan and find nearby hotpot" Option B: Self-hosted with vLLM

Install vLLM (or use official Docker: docker pull vllm/vllm-openai:v0.12.0)

pip install vllm

Start model server (strictly follow these parameters)

python3 -m vllm.entrypoints.openai.api_server \ --served-model-name autoglm-phone-9b \ --allowed-local-media-path / \ --mm-encoder-tp-mode data \ --mm_processor_cache_type shm \ --mm_processor_kwargs '{"max_pixels":5000000}' \ --max-model-len 25480 \ --chat-template-content-format string \ --limit-mm-per-prompt '{"image":10}' \ --model zai-org/AutoGLM-Phone-9B \ --port 8000 Option C: Self-hosted with SGLang

Install SGLang or use: docker pull lmsysorg/sglang:v0.5.6.post1

Inside container: pip install nvidia-cudnn-cu12==9.16.0.29

python3 -m sglang.launch_server \ --model-path zai-org/AutoGLM-Phone-9B \ --served-model-name autoglm-phone-9b \ --context-length 25480 \ --mm-enable-dp-encoder \ --mm-process-config '{"image":{"max_pixels":5000000}}' \ --port 8000 Verify deployment python scripts/check_deployment_cn.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b Expected output includes a ... block followed by do(action="Launch", app="...") . If the chain-of-thought is very short or garbled, the model deployment has failed. Running the Agent Basic CLI usage

Android device (default)

python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b \ "打开小红书搜索美食"

HarmonyOS device

python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b \ --device-type hdc \ "打开设置查看WiFi"

Multilingual model for English apps

python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b-multilingual \ "Open Instagram and search for travel photos" Key CLI parameters Parameter Description Default --base-url Model service endpoint Required --model Model name on server Required --apikey API key for third-party services None --device-type adb (Android) or hdc (HarmonyOS) adb --device-id Specific device serial number Auto-detect Python API Usage Basic agent invocation from phone_agent import PhoneAgent from phone_agent . config import AgentConfig config = AgentConfig ( base_url = "http://localhost:8000/v1" , model = "autoglm-phone-9b" , device_type = "adb" ,

or "hdc" for HarmonyOS

) agent = PhoneAgent ( config )

Run a task

result

agent . run ( "打开淘宝搜索蓝牙耳机" ) print ( result ) Custom task with device selection from phone_agent import PhoneAgent from phone_agent . config import AgentConfig import os config = AgentConfig ( base_url = os . environ [ "MODEL_BASE_URL" ] , model = os . environ [ "MODEL_NAME" ] , apikey = os . environ . get ( "MODEL_API_KEY" ) , device_type = "adb" , device_id = "emulator-5554" ,

specific device

) agent = PhoneAgent ( config )

Task with sensitive operation confirmation

result

agent . run ( "在京东购买最便宜的蓝牙耳机" , confirm_sensitive = True

prompt user before purchase actions

) Direct model API call (for testing/integration) import openai import base64 import os from pathlib import Path client = openai . OpenAI ( base_url = os . environ [ "MODEL_BASE_URL" ] , api_key = os . environ . get ( "MODEL_API_KEY" , "dummy" ) , )

Load screenshot

screenshot_path

"screenshot.png" with open ( screenshot_path , "rb" ) as f : image_b64 = base64 . b64encode ( f . read ( ) ) . decode ( ) response = client . chat . completions . create ( model = "autoglm-phone-9b" , messages = [ { "role" : "user" , "content" : [ { "type" : "image_url" , "image_url" : { "url" : f"data:image/png;base64, { image_b64 } " } , } , { "type" : "text" , "text" : "Task: 搜索附近的咖啡店\nCurrent step: Navigate to search" , } , ] , } ] , ) print ( response . choices [ 0 ] . message . content )

Output format: ...\ndo(action="...", ...)

Parsing model action output import re def parse_action ( model_output : str ) -

dict : """Parse AutoGLM model output into structured action."""

Extract answer block

answer_match

re . search ( r'(.*?)(?:|$)' , model_output , re . DOTALL ) if not answer_match : return { "action" : "unknown" } answer = answer_match . group ( 1 ) . strip ( )

Parse do() call

Format: do(action="ActionName", param1="value1", param2="value2")

action_match

re . search ( r'do(action="([^"]+)"(.*?))' , answer , re . DOTALL ) if not action_match : return { "action" : "unknown" , "raw" : answer } action_name = action_match . group ( 1 ) params_str = action_match . group ( 2 )

Parse parameters

params

{ } for param_match in re . finditer ( r'(\w+)="([^"]*)"' , params_str ) : params [ param_match . group ( 1 ) ] = param_match . group ( 2 ) return { "action" : action_name , ** params }

Example usage

output

'需要启动京东\ndo(action="Launch", app="京东")' action = parse_action ( output )

{"action": "Launch", "app": "京东"}

ADB Device Control Patterns Common ADB operations used by the agent import subprocess def take_screenshot ( device_id : str = None ) -

bytes : """Capture current device screen.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "exec-out" , "screencap" , "-p" ] ) result = subprocess . run ( cmd , capture_output = True ) return result . stdout def send_tap ( x : int , y : int , device_id : str = None ) : """Tap at screen coordinates.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "tap" , str ( x ) , str ( y ) ] ) subprocess . run ( cmd ) def send_text_adb_keyboard ( text : str , device_id : str = None ) : """Send text via ADB Keyboard (must be installed and enabled).""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] )

Enable ADB keyboard first

cmd_enable

cmd + [ "shell" , "ime" , "set" , "com.android.adbkeyboard/.AdbIME" ] subprocess . run ( cmd_enable )

Send text

cmd_text

cmd + [ "shell" , "am" , "broadcast" , "-a" , "ADB_INPUT_TEXT" , "--es" , "msg" , text ] subprocess . run ( cmd_text ) def swipe ( x1 : int , y1 : int , x2 : int , y2 : int , duration_ms : int = 300 , device_id : str = None ) : """Swipe gesture on screen.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "swipe" , str ( x1 ) , str ( y1 ) , str ( x2 ) , str ( y2 ) , str ( duration_ms ) ] ) subprocess . run ( cmd ) def press_back ( device_id : str = None ) : """Press Android back button.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "input" , "keyevent" , "KEYCODE_BACK" ] ) subprocess . run ( cmd ) def launch_app ( package_name : str , device_id : str = None ) : """Launch app by package name.""" cmd = [ "adb" ] if device_id : cmd . extend ( [ "-s" , device_id ] ) cmd . extend ( [ "shell" , "monkey" , "-p" , package_name , "-c" , "android.intent.category.LAUNCHER" , "1" ] ) subprocess . run ( cmd ) Midscene.js Integration For JavaScript/TypeScript automation using AutoGLM: // .env configuration // MIDSCENE_MODEL_NAME=autoglm-phone // MIDSCENE_OPENAI_BASE_URL=https://open.bigmodel.cn/api/paas/v4 // MIDSCENE_OPENAI_API_KEY=your-api-key import { AndroidAgent } from "@midscene/android" ; const agent = new AndroidAgent ( ) ; await agent . aiAction ( "打开微信发送消息给张三" ) ; await agent . aiQuery ( "当前页面显示的消息内容是什么？" ) ; Remote ADB (WiFi Debugging)

Connect device via USB first, then enable TCP/IP mode

adb tcpip 5555

Get device IP address

adb shell ip addr show wlan0

Connect wirelessly (disconnect USB after this)

adb connect 192.168 .1.100:5555

Verify connection

adb devices

192.168.1.100:5555 device

Use with agent

python main.py \ --base-url http://model-server:8000/v1 \ --model autoglm-phone-9b \ --device-id "192.168.1.100:5555" \ "打开支付宝查看余额" Common Action Types The AutoGLM model outputs structured actions: Action Description Example Launch Open an app do(action="Launch", app="微信") Tap Tap screen element do(action="Tap", element="搜索框") Type Input text do(action="Type", text="火锅") Swipe Scroll/swipe do(action="Swipe", direction="up") Back Press back button do(action="Back") Home Go to home screen do(action="Home") Finish Task complete do(action="Finish", result="已完成搜索") Model Selection Guide Model Use Case Languages AutoGLM-Phone-9B Chinese apps (WeChat, Taobao, Meituan) Chinese-optimized AutoGLM-Phone-9B-Multilingual International apps, mixed content Chinese + English + others HuggingFace: zai-org/AutoGLM-Phone-9B / zai-org/AutoGLM-Phone-9B-Multilingual ModelScope: ZhipuAI/AutoGLM-Phone-9B / ZhipuAI/AutoGLM-Phone-9B-Multilingual Environment Variables Reference

Model service

export MODEL_BASE_URL = "http://localhost:8000/v1" export MODEL_NAME = "autoglm-phone-9b" export MODEL_API_KEY = ""

Required for BigModel/ModelScope APIs

BigModel API

export BIGMODEL_API_KEY = "" export BIGMODEL_BASE_URL = "https://open.bigmodel.cn/api/paas/v4"

ModelScope API

export MODELSCOPE_API_KEY = "" export MODELSCOPE_BASE_URL = "https://api-inference.modelscope.cn/v1"

Device configuration

export ADB_DEVICE_ID = ""

Leave empty for auto-detect

export HDC_DEVICE_ID = ""

HarmonyOS device ID

Troubleshooting

Model output is garbled or very short chain-of-thought

Cause

Incorrect vLLM/SGLang startup parameters.

Fix

Ensure

--chat-template-content-format string

(vLLM) and

--mm-process-config

with

max_pixels:5000000

are set. Check transformers version compatibility.

adb devices

shows no devices

Fix

:

Verify USB cable supports data transfer (not charge-only)

Accept "Allow USB debugging" dialog on phone

Try

adb kill-server && adb start-server

Some devices require reboot after enabling developer options

Text input not working on Android

Fix

ADB Keyboard must be installed AND enabled:

adb shell ime

enable

com.android.adbkeyboard/.AdbIME

adb shell ime

set

com.android.adbkeyboard/.AdbIME

Agent stuck in a loop

Cause

Model cannot identify a path to complete the task.

Fix

The framework includes sensitive operation confirmation — ensure

confirm_sensitive=True

for purchase/delete tasks. For login/CAPTCHA screens, the agent supports human takeover.

vLLM CUDA out of memory

Fix

AutoGLM-Phone-9B requires ~20GB VRAM. Use
--tensor-parallel-size 2
for multi-GPU, or use the API service instead.
Connection refused to model server
Fix: Check firewall rules. For remote server:

Test connectivity

curl http://YOUR_SERVER_IP:8000/v1/models

Should return model list JSON

HDC device not recognized (HarmonyOS)
Fix: HarmonyOS NEXT (not earlier versions) is required. Enable developer mode in Settings → About → Version Number (tap 10 times rapidly). iOS Setup For iPhone automation, see the dedicated setup guide:

After configuring WebDriverAgent per docs/ios_setup/ios_setup.md

python main.py \ --base-url http://localhost:8000/v1 \ --model autoglm-phone-9b-multilingual \ --device-type ios \ "Open Maps and navigate to Central Park"

安装

Android

Expected: emulator-5554 device

HarmonyOS NEXT

Expected: 7001005458323933328a01bce01c2500

Install vLLM (or use official Docker: docker pull vllm/vllm-openai:v0.12.0)

Start model server (strictly follow these parameters)

Install SGLang or use: docker pull lmsysorg/sglang:v0.5.6.post1

Inside container: pip install nvidia-cudnn-cu12==9.16.0.29

Android device (default)

HarmonyOS device

Multilingual model for English apps

or "hdc" for HarmonyOS

Run a task

result

specific device

Task with sensitive operation confirmation

result

prompt user before purchase actions

Load screenshot

screenshot_path

Output format: ...\ndo(action="...", ...)

Extract answer block

answer_match

Parse do() call

Format: do(action="ActionName", param1="value1", param2="value2")

action_match

Parse parameters

params

Example usage

output

{"action": "Launch", "app": "京东"}

Enable ADB keyboard first

cmd_enable

Send text

cmd_text

Connect device via USB first, then enable TCP/IP mode

Get device IP address

Connect wirelessly (disconnect USB after this)

Verify connection

192.168.1.100:5555 device

Use with agent

Model service

Required for BigModel/ModelScope APIs

BigModel API

ModelScope API

Device configuration

Leave empty for auto-detect

HarmonyOS device ID

Test connectivity

Should return model list JSON

After configuring WebDriverAgent per docs/ios_setup/ios_setup.md