Skip to content

Blog

Long-form posts on how Handsets works under the hood.

An Android UI dump for LLMs (10× fewer tokens, same actions)

When an LLM agent drives an Android device, the loop looks like:

  1. Take a screenshot (or screen description)
  2. Decide what to tap
  3. Tap it
  4. Repeat

For step 1, the canonical answer is uiautomator dump — the XML hierarchy that uiautomator2, Appium, and most Android automation tools return. On the Settings home screen of an emulator I just measured, that XML is 22.3 KB / 5,762 GPT-4 tokens.

The same screen rendered through Handsets' hs ui -i is 3.3 KB / 729 tokens — about 8× fewer tokens, and 10–13× on simpler screens. The agent's decision quality doesn't change.

Here's how we got there, and why every byte that disappeared is a byte the LLM didn't need.


Three screens, two formats

screen uiautomator dump (XML) hs ui -i ratio
Launcher home 12.0 KB / 3,153 tok 1.1 KB / 246 tok 12.8×
Settings home 22.3 KB / 5,762 tok 3.3 KB / 729 tok 7.9×
Settings → Apps 15.2 KB / 4,050 tok 0.9 KB / 320 tok 12.7×

Token counts are from tiktoken with the GPT-4 encoding; reproducer at the bottom. The ratio is bigger on screens where the layout tree is deeper than the labeled content, and smaller on screens like Settings home where almost every label is a real TextView with a real id.

A typical agent loop step now carries ~1k tokens of UI dump instead of ~5k. Across a 50-step trajectory that's an order of magnitude less context per loop — which is real money once you're paying per token.

The XML you start with

The first ~1.2 KB of uiautomator dump from the launcher home — which covers exactly the outer three layout nodes of the tree:

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>
<hierarchy rotation="0">
  <node index="0" text="" resource-id="" class="android.widget.FrameLayout"
        package="com.google.android.apps.nexuslauncher" content-desc=""
        checkable="false" checked="false" clickable="false" enabled="true"
        focusable="false" focused="false" scrollable="false"
        long-clickable="false" password="false" selected="false"
        bounds="[0,0][1440,3120]">
    <node index="0" text="" resource-id="" class="android.widget.LinearLayout"
          package="com.google.android.apps.nexuslauncher" content-desc=""
          checkable="false" ... >
      <node index="0" text="" resource-id="android:id/content"
            class="android.widget.FrameLayout" ... >

These three nodes are 100% noise for an agent. They have no text, no content description, no interactivity, no clickable affordance. They exist because Android renders surfaces by nesting FrameLayout inside LinearLayout inside FrameLayout. A clickable="false" attribute is a string the LLM has to read in order to learn that nothing is happening here.

The rest of the dump is the same pattern, deeper. By the time the XML reaches an actual tappable widget — say, a TextView with text="Phone" — it has accumulated a dozen ancestors and about 600 bytes of structural padding.

The flat table you want

Here is hs ui -i for the entire launcher home screen:

@(720,383)    long        ViewPager #smartspace_card_pager       desc="At a glance"
@(279,374)    click       TextView #date                         "Fri, May 22"
@(555,2063)   click,long  TextView                               "Gmail"
@(884,2063)   click,long  TextView                               "Photos"
@(1213,2063)  click,long  TextView                               "YouTube"
@(720,1590)               View                                   desc="Home"
@(226,2546)   click,long  TextView                               "Phone"
@(555,2546)   click,long  TextView                               "Messages"
@(884,2546)   click,long  TextView                               "Chrome"
@(1213,2546)  click,long  TextView                               "YouTube"
@(720,2862)   click,long  FrameLayout #search_container_hotseat  desc="Google search"
@(218,2862)   click       ImageView #g_icon                      desc="Google app"
@(1054,2862)  click       ImageView #mic_icon                    desc="Voice search"
@(1222,2862)  click       ImageButton #lens_icon                 desc="Google Lens"

Fourteen lines, 246 tokens. Every line is a thing the agent can decide about. Every line has a coordinate to feed to tap, the action tags, and the label to match against. No closing tags, no namespace prefixes, no attributes whose value is "false".

The four columns, left to right:

  1. Center coordinates@(x,y). What you tap. Not the bounds rectangle.
  2. Behavior tagsclick, long, scroll, check, checked, password. What this widget responds to. Only the positive flags appear.
  3. Class + id — short forms. android.widget.Button collapses to Button; com.android.settings:id/title collapses to #title.
  4. Label"text" or desc="content-description". The accessibility-curated string a human (and the LLM) actually reads.

What we threw away

Six categories of node and attribute disappeared between the XML and the table.

1. Empty layout containers. A FrameLayout / LinearLayout / ConstraintLayout with no text, no content-description, and no clickable/scrollable flag is a structural artifact of Android's renderer. Children carry the labels; the parent's onClick (if any) bubbles up when you tap a child's coords. We drop the entire subtree of layout ancestors.

2. Attributes whose value is the default. checkable="false", enabled="true", focused="false". XML serialises every attribute regardless of value, so a <node> with one interesting property still carries 14 strings of negation. We only emit positive flags, and only the ones that change behavior.

3. Bounds rectangles. bounds="[180,810][900,910]" is four numbers. The agent never needs the rectangle — it taps a point. We compute the center once and store two numbers.

4. Class fully-qualified names. android.widget.ButtonButton. The package prefix is informational; the leaf is the type.

5. Long id paths. com.android.settings:id/dashboard_tile_pref_title#dashboard_tile_pref_title. The package and namespace separator are always the same, so they're free to drop.

6. Decorative View nodes with no labels. A bare <node class="View" /> with no text, no desc, no flags appears in many Material layouts as a divider or background. They don't accept input and they don't have anything to read.

The rule across all six: drop fields the LLM cannot act on, and drop default values that the format makes you write anyway.

What we kept that surprises people

Inherent input widgets, even when empty. An EditText with text="" and desc="" carries no information for a reader, but for an agent it's a target — "this is where I'd type my email." So EditText, Button, Switch, CheckBox, Spinner, SeekBar, WebView, and friends always show up, labeled or not. The filter (from ui_dump.rs) is:

fn is_interactive(node) -> bool {
    if !text.is_empty() || !desc.is_empty() { return true; }
    matches!(class_short,
        "EditText" | "Button" | "ImageButton" | "Switch" | "CheckBox"
        | "RadioButton" | "ToggleButton" | "Spinner" | "SeekBar"
        | "RatingBar" | "WebView"
        | "AutoCompleteTextView" | "MultiAutoCompleteTextView"
        | "DatePicker" | "TimePicker" | "NumberPicker")
}

If you forget about empty inputs, the agent can dump the screen and report "no email field here" when in fact the field is sitting there waiting to be focused. We learned that the hard way.

Why a table, not JSON

Both are valid serialisations of the same data. JSON has the advantage of structure that parsers expect; tables have the advantage that they tokenize as well as JSON without paying for the structural overhead. For the same Settings home screen:

variant bytes tokens
hs ui --json 20,777 5,353
hs ui -i (table) 3,339 729

A 7× difference, almost entirely from JSON's per-row repetition: { "coords": [...], "tags": [...], "class": "...", "id": "...", "text": "..." }. Tokenizers handle column-aligned text very efficiently — the column header is implicit, so the per-row cost is just the values.

For tool output that an LLM reads one screen at a time — i.e. you don't need a parser, you need the model to understand it — tables beat JSON. (For programmatic consumption by other code, hs ui --json is right there.)

Generalising the lesson

If you build any tool that feeds an LLM, the playbook is roughly:

  1. Drop fields the model can't act on. Anything the LLM reads but never references in its reply is pure tax. Grep the model's actual responses for which keys it cites; the rest is removable.
  2. Drop default values. A serialisation that emits enabled="true" on every node is paying for a non-decision on every node.
  3. Pre-compute the thing the model would compute anyway. Center coords instead of bounds rectangles. Short names instead of FQNs. Behavior tags instead of nine boolean attributes.
  4. Prefer tabular to nested when the data is regular. Structural overhead of JSON / XML compresses badly through a tokenizer.
  5. Keep labels first-class. A user-facing string ("Continue", "Sign in") is what the model picks. Make it the cheapest column to read.

Most of this isn't Android-specific. It applies to any "give the LLM the state of the world" tool — search result lists, database rows, filesystem trees. The savings compound: at 10× fewer tokens, you fit 10× more screens of trajectory into the context, or pay 10× less per loop step.

Caveats

  • hs ui -i is for action selection, not forensic UI debugging. If you need every node — including layout containers, because you're diagnosing why a button isn't tappable — use hs ui --xml or hs ui --xml --all.
  • The filter is conservative: a non-clickable TextView with text still shows up, because labels next to the actual button are how the model often refers to it ("the field under 'Email'"). Dropping label-only nodes shrinks the dump further but breaks selectors like hs find 'EditText:below(TextView[text=Email])'.
  • Filing one bug report against an LLM agent loop usually requires the full XML at the moment of failure. hs ui --xml exists for exactly that.
  • uiautomator dump and the Handsets daemon both hold the system's UiAutomation connection exclusively, so to capture both formats of the same screen you have to hs drop, run uiautomator dump, then hs use again. This is a real annoyance, not a benchmark gotcha.

Reproducing

# Set up
curl -fsSL https://raw.githubusercontent.com/elliotgao2/handsets/main/install.sh | bash
hs use

# Canonical XML — the hs daemon must release UiAutomation first
hs drop
adb shell uiautomator dump /sdcard/u.xml && adb pull /sdcard/u.xml /tmp/u.xml

# hs format
hs use
hs ui -i > /tmp/hs.txt

# Bytes
wc -c /tmp/u.xml /tmp/hs.txt

# Tokens (pip install tiktoken)
python3 - <<'PY'
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
for p in ("/tmp/u.xml", "/tmp/hs.txt"):
    t = open(p).read()
    print(f"{p}: {len(t):>8d} bytes  {len(enc.encode(t)):>6d} tokens")
PY

If your screen produces a materially different ratio, I'd be curious to see it. An app with a deep custom layout tree should produce a bigger ratio, not smaller; if it's smaller, the filter probably has a hole.


Handsets is a CLI for driving Android devices, built for LLM agents and shell scripts. hs ui -i is one verb; the rest are designed under the same constraint. MIT.

Why adb screencap is slow

adb shell screencap -p is the default way to grab an Android screen from a host machine. On a stock Android 15 emulator at 1440×3120, it takes a median of 2.12 seconds per call — with individual calls ranging from 660 ms to over 3 seconds.

That's fine for a debugging screenshot. It's a disaster if you're a UI automation loop or an LLM agent that takes a screenshot between every step.

The same screen, captured through hs see at the default settings, takes 12 ms — about 170× faster. But "170× faster" is a fragile claim if the comparison isn't honest. Here's where the time actually goes, and what Handsets does about it.


The numbers

All measurements: Android 15 emulator (sdk_gphone64_arm64, 1440×3120, SDK 35). Each variant warmed (3 calls) then sampled 20 times. Wall-clock end-to-end from the host's perspective.

variant median p10 p90 output
adb exec-out screencap -p > x.png 2122 ms 661 3007 1973 KB
adb exec-out screencap > x.raw 675 ms 664 683 17.5 MB
hs see x.png (PNG, full res) 584 ms 569 594 1978 KB
hs see x.jpg (JPEG q80, full) 24 ms 23 25 138 KB
hs do 'screenshot' (JPEG q80, 768 long) 12 ms 11 14 22 KB

The first row is the apples-to-apples adb baseline most people compare against. The last row is what an agent loop actually uses. The middle rows let us isolate where each saving comes from.

A few observations even before any explanation:

  • screencap -p has a huge variance — almost 5× between p10 and p90.
  • screencap without -p (raw RGBA) is consistent at ~675 ms even though it ships 17.5 MB instead of 2 MB. Whatever causes the variance is not transport bandwidth.
  • All the hs paths are tight (~5% spread). Whatever's happening in the warm daemon is deterministic.

Where does adb screencap -p spend its time?

To break it apart, run screencap on the device only — capturing to a file rather than piping to host stdout — so we can see what's actually happening down there.

phase median
screencap /sdcard/x.raw (capture + raw write) 75 ms
screencap -p /sdcard/x.png (capture + PNG encode) 624 ms
adb pull /sdcard/x.raw (17.5 MB transport) 60 ms
adb pull /sdcard/x.png (2 MB transport) 25 ms

An on-device capture takes about 75 ms — that's the cost of SurfaceFlinger producing a frame. Adding -p adds ~549 ms of pure PNG encoding on the device's CPU. Transport is in the noise.

In other words, almost 90% of adb screencap -p's on-device time is PNG encoding — single-threaded zlib-style compression on the slowest CPU in the system, for an image you're probably going to delete in five seconds.

That accounts for the 660-ms best case. The other ~1500 ms of variance in adb exec-out screencap -p shows up on top of that, and is a combination of:

  1. screencap being spawned cold each call (fresh process, fresh SurfaceFlinger client connection).
  2. adb exec-out piping the PNG byte-by-byte through several buffering layers (the on-device adbd, the USB transport, the host adb server, your shell's stdout).
  3. The emulator's VM doing whatever else it does.

On a physical device the variance is smaller but the PNG floor doesn't move.

Three reasons it's slow

1. PNG encoding is wildly mismatched to the use case. PNG is a great archive format. It's a terrible "snapshot a frame for an LLM" format. JPEG q80 of the same image is ~14× smaller, and orders of magnitude faster to encode (Skia's JPEG path uses libjpeg-turbo with NEON SIMD; the PNG path is software zlib with no comparable acceleration). No agent loop cares about lossless screenshots — they care about "what's on the screen right now."

2. screencap is a fresh process every call. The binary starts, opens the SurfaceFlinger client, captures, encodes, exits. For one screenshot that's fine. For an agent doing 20 screenshots a minute, you're paying process startup and SurfaceFlinger handshake every single time.

3. adb exec-out pipes through several layers that don't love 2 MB. The PNG comes out of screencap in one fwrite, gets chunked by adbd into TCP/USB frames, goes through your host adb server, and arrives at your shell as stdout. Each hop has buffering that under load doesn't combine well. This is the variance source: best case 660 ms, worst 3 s.

What Handsets does differently

Three things, each of which fixes one of the problems above.

1. A warm VirtualDisplay mirror in a long-running daemon

hs use spawns a small JVM process on the device under app_process (shell UID, hidden-API restrictions lifted). The daemon creates a VirtualDisplay that mirrors the default display into an ImageReader at a configurable resolution, and keeps it open between commands.

When you call hs see x.jpg, the latest frame is already sitting in memory. There's no SurfaceFlinger snapshot to wait for, no screencap process to start. The daemon acquires the most recent Image from the ImageReader (already produced asynchronously by the listener thread on the previous frame), JPEG-encodes it, and ships the bytes back.

The relevant detail in the mirror code is that the listener thread does the expensive copyPixelsFromBuffer — the GPU-fence-blocking call — without holding the capture lock, then briefly takes the lock just to swap pointers. Capture threads only ever read the most recent fully-written bitmap and never wait on GPU work. A first call at a new resolution pays a one-time ~50 ms cost to create the mirror; the cache holds the four most-recent sizes.

2. JPEG is the default; PNG is opt-in

hs see x.jpg is JPEG. hs see x.png is PNG. The file extension picks the format. Agents get JPEG by default because we know what they're going to do with it: ship it to a model. Debugging screenshots can ask for PNG.

This single change accounts for most of the win: 24 ms (hs see x.jpg) vs 584 ms (hs see x.png). Both go through the same warm mirror at full resolution; the only difference is the encoder.

3. Default to 768-long-edge for the agent loop

Most LLM agents don't need 1440×3120 pixels. They need "enough to see the screen." The raw wire command screenshot (without max=1) defaults to a 768-long-edge JPEG, which is 22 KB on disk and 12 ms end-to-end.

The downscale happens inside the mirror itself — the VirtualDisplay is created at the output resolution, so we're not allocating a 1440×3120 bitmap just to throw 80% of it away. Bigger sizes have their own warm mirror cached separately (hs see x.jpg triggers the full-res mirror), so you can mix and match without paying for the largest one every time.

Layer-by-layer wins

Adding up the changes:

change saves
Warm VirtualDisplay vs cold screencap ~75 ms per call (skips the capture)
JPEG q80 vs PNG ~550 ms per call (encode dominates)
TCP forward vs exec-out pipe ~1500 ms when adb is in a bad mood (variance)
768-long-edge default another ~10 ms (smaller encode + smaller transport)

The first three together get you to hs see x.jpg's 24 ms. The last shaves the agent default to 12 ms.

When this matters (and when it doesn't)

Matters if you're:

  • An LLM agent that screenshots after every action (typical loop: act, screenshot, dump UI, decide, repeat).
  • A test framework that wants to record a frame every X ms.
  • A monitoring system polling for visual changes.
  • Anything where screenshot latency is on the user-perceived path.

Probably doesn't matter if you're:

  • Taking one screenshot a day for a bug report.
  • Recording a video — hs see (the bare GUI viewer) uses MediaCodec H.264 streaming via H264Streamer.java, a separate path.
  • Working over a slow remote adb tcpip link where the wire, not the encode, is the bottleneck.

For the agent case specifically: a 12-ms screenshot lets you treat screenshots as free relative to the rest of the loop (UI dump is ~150 ms, an LLM round-trip is seconds). Two-second screenshots make screenshotting the dominant cost, and you start skipping them — at which point your agent gets flakier.

Caveats

  • These numbers are from an emulator. Physical devices have lower variance on the adb exec-out path (typically 600–900 ms instead of 2–3 s) and faster on-device JPEG encoding. The relative ordering doesn't change.
  • The first call after a resolution change pays a one-time ~50 ms cost to create the new VirtualDisplay mirror.
  • If a foreground window has FLAG_SECURE, both adb and Handsets produce an all-black frame. Handsets detects this and returns a named error pointing at the offending window instead of silently handing you a black PNG.
  • The daemon runs as the shell UID via app_process, with hidden-API restrictions lifted. That's necessary because createVirtualDisplay rejects the system Context's op-package on Android 14+; we have to forge a com.android.shell package context. The init comments in Screenshot.java explain the three-tier strategy.

Reproducing

# Set up
curl -fsSL https://raw.githubusercontent.com/elliotgao2/handsets/main/install.sh | bash
hs use

# Headline benchmark
python3 - <<'PY'
import statistics, subprocess, time
def t(cmd):
    for _ in range(3): subprocess.call(cmd, shell=True, stdout=subprocess.DEVNULL)
    s = []
    for _ in range(20):
        t0 = time.perf_counter_ns()
        subprocess.call(cmd, shell=True, stdout=subprocess.DEVNULL)
        s.append((time.perf_counter_ns() - t0) / 1e6)
    s.sort()
    return statistics.median(s)
print(f"adb -p:     {t('adb exec-out screencap -p > /tmp/a.png'):.1f} ms")
print(f"hs jpg:     {t('hs see /tmp/h.jpg'):.1f} ms")
print(f"hs default: {t(\"hs do 'screenshot' > /tmp/h2.jpg\"):.1f} ms")
PY

If you reproduce something materially different, open an issue with your device model and Android version — the numbers above are honest but I'm curious where they hold and where they don't.


Handsets is a CLI for driving Android devices, built for LLM agents and shell scripts. MIT.