Vision Framework
Detect text, faces, barcodes, objects, and body poses in images and video using
on-device computer vision. Patterns target iOS 26+ with Swift 6.2,
backward-compatible where noted.
See
references/vision-requests.md
for complete code patterns and
references/visionkit-scanner.md
for DataScannerViewController integration.
Contents
Two API Generations
Request Pattern (Modern API)
Text Recognition (OCR)
Face Detection
Barcode Detection
Document Scanning (iOS 26+)
Image Segmentation
Object Tracking
Other Request Types
Core ML Integration
VisionKit: DataScannerViewController
Common Mistakes
Review Checklist
References
Two API Generations
Vision has two distinct API layers. Prefer the modern API for new code.
Aspect
Modern (iOS 18+)
Legacy
Pattern
let result = try await request.perform(on: image)
VNImageRequestHandler
+ completion handler
Request types
Swift types — structs and classes (
RecognizeTextRequest
,
DetectFaceRectanglesRequest
)
ObjC classes (
VNRecognizeTextRequest
,
VNDetectFaceRectanglesRequest
)
Concurrency
Native async/await
Completion handlers or synchronous
perform
Observations
Typed return values
Cast
results
from
[Any]
Availability
iOS 18+ / macOS 15+
iOS 11+
The modern API uses the
ImageProcessingRequest
protocol. Each request type
has a
perform(on:orientation:)
method that accepts
CGImage
,
CIImage
,
CVPixelBuffer
,
CMSampleBuffer
,
Data
, or
URL
. Most requests are
structs; stateful requests for video tracking (e.g.,
TrackObjectRequest
,
TrackRectangleRequest
,
DetectTrajectoriesRequest
) are final classes.
Request Pattern (Modern API)
All modern Vision requests follow the same pattern: create a request struct,
call
perform(on:)
, and handle the typed result.
import
Vision
func
recognizeText
(
in
image
:
CGImage
)
async
throws
->
[
String
]
{
var
request
=
RecognizeTextRequest
(
)
request
.
recognitionLevel
=
.
accurate
request
.
recognitionLanguages
=
[
Locale
.
Language
(
identifier
:
"en-US"
)
]
let
observations
=
try
await
request
.
perform
(
on
:
image
)
return
observations
.
compactMap
{
observation
in
observation
.
topCandidates
(
1
)
.
first
?
.
string
}
}
Legacy Pattern (Pre-iOS 18)
Use
VNImageRequestHandler
with completion-based requests when targeting
older deployment versions.
import
Vision
func
recognizeTextLegacy
(
in
image
:
CGImage
)
throws
->
[
String
]
{
var
recognized
:
[
String
]
=
[
]
let
request
=
VNRecognizeTextRequest
{
request
,
error
in
guard
let
observations
=
request
.
results
as
?
[
VNRecognizedTextObservation
]
else
{
return
}
recognized
=
observations
.
compactMap
{
$0
.
topCandidates
(
1
)
.
first
?
.
string
}
}
request
.
recognitionLevel
=
.
accurate
let
handler
=
VNImageRequestHandler
(
cgImage
:
image
)
try
handler
.
perform
(
[
request
]
)
return
recognized
}
Text Recognition (OCR)
Modern: RecognizeTextRequest (iOS 18+)
var
request
=
RecognizeTextRequest
(
)
request
.
recognitionLevel
=
.
accurate
// .fast for real-time
request
.
recognitionLanguages
=
[
Locale
.
Language
(
identifier
:
"en-US"
)
,
Locale
.
Language
(
identifier
:
"fr-FR"
)
,
]
request
.
usesLanguageCorrection
=
true
request
.
customWords
=
[
"SwiftUI"
,
"Xcode"
]
// domain-specific terms
let
observations
=
try
await
request
.
perform
(
on
:
cgImage
)
for
observation
in
observations
{
guard
let
candidate
=
observation
.
topCandidates
(
1
)
.
first
else
{
continue
}
let
text
=
candidate
.
string
let
confidence
=
candidate
.
confidence
// 0.0 ... 1.0
let
bounds
=
observation
.
boundingBox
// normalized coordinates
}
Legacy: VNRecognizeTextRequest
let
request
=
VNRecognizeTextRequest
(
)
request
.
recognitionLevel
=
.
accurate
request
.
recognitionLanguages
=
[
"en-US"
,
"fr-FR"
]
request
.
usesLanguageCorrection
=
true
Key differences:
Modern API uses
Locale.Language
for languages; legacy
uses string identifiers. Both support
.accurate
(best quality) and
.fast
(real-time suitable) recognition levels.
Face Detection
Detect face rectangles, landmarks (eyes, nose, mouth), and capture quality.
// Modern API
let
faceRequest
=
DetectFaceRectanglesRequest
(
)
let
faces
=
try
await
faceRequest
.
perform
(
on
:
cgImage
)
for
face
in
faces
{
let
boundingBox
=
face
.
boundingBox
// normalized CGRect
let
roll
=
face
.
roll
// Measurement
let
yaw
=
face
.
yaw
// Measurement
}
// Landmarks (eyes, nose, mouth contours)
var
landmarkRequest
=
DetectFaceLandmarksRequest
(
)
let
landmarkFaces
=
try
await
landmarkRequest
.
perform
(
on
:
cgImage
)
for
face
in
landmarkFaces
{
let
landmarks
=
face
.
landmarks
let
leftEye
=
landmarks
?
.
leftEye
?
.
normalizedPoints
let
nose
=
landmarks
?
.
nose
?
.
normalizedPoints
}
Coordinate System
Vision uses a normalized coordinate system with origin at the bottom-left.
Convert to UIKit (top-left origin) before display:
func
convertToUIKit
(
_
rect
:
CGRect
,
imageHeight
:
CGFloat
)
->
CGRect
{
CGRect
(
x
:
rect
.
origin
.
x
,
y
:
imageHeight
-
rect
.
origin
.
y
-
rect
.
height
,
width
:
rect
.
width
,
height
:
rect
.
height
)
}
Barcode Detection
Detect 1D and 2D barcodes including QR codes.
var
request
=
DetectBarcodesRequest
(
)
request
.
symbologies
=
[
.
qr
,
.
ean13
,
.
code128
,
.
pdf417
]
let
barcodes
=
try
await
request
.
perform
(
on
:
cgImage
)
for
barcode
in
barcodes
{
let
payload
=
barcode
.
payloadString
// decoded content
let
symbology
=
barcode
.
symbology
// .qr, .ean13, etc.
let
bounds
=
barcode
.
boundingBox
// normalized rect
}
Common symbologies:
.qr
,
.aztec
,
.pdf417
,
.dataMatrix
,
.ean8
,
.ean13
,
.code39
,
.code128
,
.upce
,
.itf14
.
Document Scanning (iOS 26+)
RecognizeDocumentsRequest
provides structured document reading with layout
understanding beyond basic OCR. Returns
DocumentObservation
objects with a
nested
Container
structure for paragraphs, tables, lists, and barcodes.
var
request
=
RecognizeDocumentsRequest
(
)
let
documents
=
try
await
request
.
perform
(
on
:
cgImage
)
for
observation
in
documents
{
let
container
=
observation
.
document
// Full text content
let
fullText
=
container
.
text
// Structured access to paragraphs
for
paragraph
in
container
.
paragraphs
{
let
paragraphText
=
paragraph
.
text
}
// Tables and lists
for
table
in
container
.
tables
{
/ structured table data /
}
for
list
in
container
.
lists
{
/ structured list data /
}
// Embedded barcodes detected within the document
for
barcode
in
container
.
barcodes
{
/ barcode data /
}
// Document title if detected
if
let
title
=
container
.
title
{
print
(
title
)
}
}
For simpler document camera scanning, use VisionKit's
VNDocumentCameraViewController
which provides a full-screen camera UI with
auto-capture, perspective correction, and multi-page scanning.
Image Segmentation
Modern: GeneratePersonSegmentationRequest (iOS 18+)
var
request
=
GeneratePersonSegmentationRequest
(
)
request
.
qualityLevel
=
.
accurate
// .balanced, .fast
let
mask
=
try
await
request
.
perform
(
on
:
cgImage
)
// mask is a PersonSegmentationObservation with a pixelBuffer property
let
maskBuffer
=
mask
.
pixelBuffer
// Apply mask using Core Image: CIFilter.blendWithMask()
Legacy: VNGeneratePersonSegmentationRequest
let
request
=
VNGeneratePersonSegmentationRequest
(
)
request
.
qualityLevel
=
.
accurate
// .balanced, .fast
request
.
outputPixelFormat
=
kCVPixelFormatType_OneComponent8
let
handler
=
VNImageRequestHandler
(
cgImage
:
cgImage
)
try
handler
.
perform
(
[
request
]
)
guard
let
mask
=
request
.
results
?
.
first
?
.
pixelBuffer
else
{
return
}
// Apply mask using Core Image: CIFilter.blendWithMask()
Quality levels:
.accurate
-- best quality, slowest (~1s), full resolution
.balanced
-- good quality, moderate speed (~100ms), 960x540
.fast
-- lowest quality, fastest (~10ms), 256x144, suitable for real-time
Instance Segmentation (iOS 18+)
Separate masks per person for individual effects.
// Modern API (iOS 18+)
let
request
=
GeneratePersonInstanceMaskRequest
(
)
let
observation
=
try
await
request
.
perform
(
on
:
cgImage
)
let
indices
=
observation
.
allInstances
for
index
in
indices
{
let
mask
=
try
observation
.
generateMask
(
forInstances
:
IndexSet
(
integer
:
index
)
)
// mask is a CVPixelBuffer with only this person visible
}
// Legacy API (iOS 17+)
let
request
=
VNGeneratePersonInstanceMaskRequest
(
)
let
handler
=
VNImageRequestHandler
(
cgImage
:
cgImage
)
try
handler
.
perform
(
[
request
]
)
guard
let
result
=
request
.
results
?
.
first
else
{
return
}
let
indices
=
result
.
allInstances
for
index
in
indices
{
let
instanceMask
=
try
result
.
generateMaskedImage
(
ofInstances
:
IndexSet
(
integer
:
index
)
,
from
:
handler
,
croppedToInstancesExtent
:
false
)
}
See
references/vision-requests.md
for mask composition and Core Image filter
integration patterns.
Object Tracking
Modern: TrackObjectRequest (iOS 18+)
TrackObjectRequest
is a stateful request that maintains tracking context
across frames. Conforms to both
ImageProcessingRequest
and
StatefulRequest
.
// Initialize with a detected object's bounding box
let
initialObservation
=
DetectedObjectObservation
(
boundingBox
:
detectedRect
)
var
request
=
TrackObjectRequest
(
observation
:
initialObservation
)
request
.
trackingLevel
=
.
accurate
// For each video frame:
let
results
=
try
await
request
.
perform
(
on
:
pixelBuffer
)
if
let
tracked
=
results
.
first
{
let
updatedBounds
=
tracked
.
boundingBox
let
confidence
=
tracked
.
confidence
}
Legacy: VNTrackObjectRequest
let
trackRequest
=
VNTrackObjectRequest
(
detectedObjectObservation
:
initialObservation
)
trackRequest
.
trackingLevel
=
.
accurate
let
sequenceHandler
=
VNSequenceRequestHandler
(
)
// For each frame:
try
sequenceHandler
.
perform
(
[
trackRequest
]
,
on
:
pixelBuffer
)
if
let
result
=
trackRequest
.
results
?
.
first
{
let
updatedBounds
=
result
.
boundingBox
trackRequest
.
inputObservation
=
result
}
Other Request Types
Vision provides additional requests covered in
references/vision-requests.md
:
Request
Purpose
ClassifyImageRequest
Classify scene content (outdoor, food, animal, etc.)
GenerateAttentionBasedSaliencyImageRequest
Heat map of where viewers focus attention
GenerateObjectnessBasedSaliencyImageRequest
Heat map of object-like regions
GenerateForegroundInstanceMaskRequest
Foreground object segmentation (not person-specific)
DetectRectanglesRequest
Detect rectangular shapes (documents, cards, screens)
DetectHorizonRequest
Detect horizon angle for auto-leveling photos
DetectHumanBodyPoseRequest
Detect body joints (shoulders, elbows, knees)
DetectHumanBodyPose3DRequest
3D human body pose estimation
DetectHumanHandPoseRequest
Detect hand joints and finger positions
DetectAnimalBodyPoseRequest
Detect animal body joint positions
DetectFaceCaptureQualityRequest
Face capture quality scoring (0–1) for photo selection
TrackRectangleRequest
Track rectangular objects across video frames
TrackOpticalFlowRequest
Optical flow between video frames
DetectTrajectoriesRequest
Detect object trajectories in video
All modern request types above are iOS 18+ / macOS 15+.
Core ML Integration
Run custom Core ML models through Vision for automatic image preprocessing
(resizing, normalization, color space conversion).
// Modern API (iOS 18+)
let
model
=
try
MLModel
(
contentsOf
:
modelURL
)
let
request
=
CoreMLRequest
(
model
:
.
init
(
model
)
)
let
results
=
try
await
request
.
perform
(
on
:
cgImage
)
// Classification model
if
let
classification
=
results
.
first
as
?
ClassificationObservation
{
let
label
=
classification
.
identifier
let
confidence
=
classification
.
confidence
}
// Legacy API
let
vnModel
=
try
VNCoreMLModel
(
for
:
model
)
let
request
=
VNCoreMLRequest
(
model
:
vnModel
)
{
request
,
error
in
guard
let
results
=
request
.
results
as
?
[
VNClassificationObservation
]
else
{
return
}
let
topResult
=
results
.
first
}
let
handler
=
VNImageRequestHandler
(
cgImage
:
cgImage
)
try
handler
.
perform
(
[
request
]
)
For model conversion and optimization, see the
coreml
skill.
VisionKit: DataScannerViewController
DataScannerViewController
provides a full-screen live camera scanner for text
and barcodes. See
references/visionkit-scanner.md
for complete patterns.
Quick Start
import
VisionKit
// Check availability (requires A12+ chip and camera)
guard
DataScannerViewController
.
isSupported
,
DataScannerViewController
.
isAvailable
else
{
return
}
let
scanner
=
DataScannerViewController
(
recognizedDataTypes
:
[
.
text
(
languages
:
[
"en"
]
)
,
.
barcode
(
symbologies
:
[
.
qr
,
.
ean13
]
)
]
,
qualityLevel
:
.
balanced
,
recognizesMultipleItems
:
true
,
isHighFrameRateTrackingEnabled
:
true
,
isHighlightingEnabled
:
true
)
scanner
.
delegate
=
self
present
(
scanner
,
animated
:
true
)
{
try
?
scanner
.
startScanning
(
)
}
SwiftUI Integration
Wrap
DataScannerViewController
in
UIViewControllerRepresentable
. See
references/visionkit-scanner.md
for the full implementation.
Common Mistakes
DON'T:
Use the legacy
VNImageRequestHandler
API for new iOS 18+ projects.
DO:
Use modern struct-based requests with
perform(on:)
and async/await.
Why:
Modern API provides type safety, better Swift concurrency support, and cleaner error handling.
DON'T:
Forget to convert normalized coordinates before drawing bounding boxes.
DO:
Use
VNImageRectForNormalizedRect(::_:)
or manual conversion from bottom-left origin to UIKit top-left origin.
Why:
Vision uses normalized coordinates (0...1) with bottom-left origin; UIKit uses points with top-left origin.
DON'T:
Run Vision requests on the main thread.
DO:
Perform requests on a background thread or use async/await from a detached task.
Why:
Image analysis is CPU/GPU-intensive and blocks the UI if run on the main actor.
DON'T:
Use
.accurate
recognition level for real-time camera feeds.
DO:
Use
.fast
for live video,
.accurate
for still images or offline processing.
Why:
Accurate recognition is too slow for 30fps video; fast recognition trades quality for speed.
DON'T:
Ignore the
confidence
score on observations.
DO:
Filter results by confidence threshold (e.g., > 0.5) appropriate for your use case.
Why:
Low-confidence results are often incorrect and degrade user experience.
DON'T:
Create a new
VNImageRequestHandler
for each frame when tracking objects.
DO:
Use
VNSequenceRequestHandler
for video frame sequences.
Why:
Sequence handler maintains temporal context for tracking; per-frame handlers lose state.
DON'T:
Request all barcode symbologies when you only need QR codes.
DO:
Specify only the symbologies you need in the request.
Why:
Fewer symbologies means faster detection and fewer false positives.
DON'T:
Assume
DataScannerViewController
is available on all devices.
DO:
Check both
isSupported
(hardware) and
isAvailable
(user permissions) before presenting.
Why:
Requires A12+ chip;
isAvailable
also checks camera access authorization.
Review Checklist
Uses modern Vision API (iOS 18+) unless targeting older deployments
Vision requests run off the main thread (async/await or background queue)
Normalized coordinates converted before UI display
Confidence threshold applied to filter low-quality observations
Recognition level matches use case (
.fast
for video,
.accurate
for stills)
Language hints set for text recognition when input language is known
Barcode symbologies limited to only those needed
DataScannerViewController
availability checked before presentation
Camera usage description (
NSCameraUsageDescription
) in Info.plist for VisionKit
Person segmentation quality level appropriate for use case
VNSequenceRequestHandler
used for video frame tracking (not per-frame handler)
Error handling covers request failures and empty results
References
Vision request patterns:
references/vision-requests.md
VisionKit scanner integration:
references/visionkit-scanner.md
Apple docs:
Vision
|
VisionKit
|
RecognizeTextRequest
|
DataScannerViewController