Vision with Core ML

WWDC 2018

Posted by Den on August 21, 2018 · 10 mins read
Vision with Core ML

Vision with Core ML

WWDC 2018

Vision with Core ML

WWDC 2018

The Storyline

Create an app helping shoppers identify item

  • Train a custom classifier
  • Build an iOS app
  • Keep an eye on pitfalls

Our Training Regimen

  • Take pictures
  • Sort into folders — the folder names are used as labels
  • How much data do I need
    - Minimum of 10 per category but more is better
    - Highly imbalanced datasets are har to train
  • Augmentation adds some robustness on top of it but doesn’t replace variety

Under the Hood

Vision FeaturePrint.Scene

Vision Frameworks FeaturePrint for Image Classfication

  • Available through ImageClassifier training in Create ML
  • Trained on a very large dataset
  • Capable of predicting over 1000 categories
  • Powers user facing features in Photos
  • Continuous improvement (You might want to retrain in the future)
  • Already on device 
    - Smaller disk footprint for your custom model
  • Optimized for Apple devices

Refining the App

Only classify when needed !

  • Don’t run expensive tasks when not needed
  • AM I holding still?
  • Using registration
    - Cheap and fast
    - Camera holds still
    - Subject is not moving
  • VNTranslationalImageRegistrationRequest

Always have a backup plan

  • Classifications can be wrong
  • Event when confidence is high > plan for it
  • Alternative identification
    - Barcode reading


Sample Link

  • Using Registration for Scene Stability
  • Use the VNSequenceRequestHandler with VNTranslationalImageRegistrationRequest
  • Compare against previous frame 
    sequenceRequestHandler.perform([request], on: previousBuffer!)
  • Registration is returned as pixels in the alignmentObservation.alignmentTransform
  • Analyze only when scene is stable
  • Create an VNImageRequestHandler for the current frame and pass in the orientation
  • Perform Barcode and Image Classification together
    try imageRequestHandler.perform([barcodeDetection, imageClassification])
  • Manage your buffers
  • Some Vision requests can take longer
  • Perform longer task asynchronously
  • Do not queue up more buffers than the camera can provide
    - We only operate with a one deep queue in this example

1. Take photos

2. Make a Machine Leaning Model

Using the storyboard

3. Settings

4. Run

Why not use just Core ML?

  • Vision does all the scaling and color conversion for you

Object Recognition

  • YOLO (You Only Look Once)
  • Fast Object Detection and Classification
    - Label and Bounding Box
    - Finds multiple and different objects
  • Train for custom objects
    - Training is more involved than ImageClassifier


Sample Link


  • Result of a VNCoreMLModelRequest
  • New observation subclass VNRecognizedObjectObservation
  • YOLO based models made easy


! Tracking is faster and smoother than re-detection !

  • Use tracking to follow a detected object
  • Tracking is a lighter algorithm
  • Applies temporal smoothing

Image Orientation

  • Not all algorithms are orientation agnostic
  • Images are not always upright
    - EXIF orientation defines what is upright
    - When using a URL as input Vision reads the EXIF orientation from file
  • Live from a capture feed
    - Orientation has to be inferred from UIDevice.current.orientation
    - Needs to be mapped to a CGImagePropertyOrientation

Vision Coordinate System

Confidence Score

  • A log of algorithm can express how certain they are about the results
  • Confidence is expressed 0 ~ 1.0
  • The scale is not uniform across request types

Confidence Score Conclusion

  • Does 1.0 mean it’s certainly correct ???? 
    - It fulfilled the criteria of the algorithm but our perception can differ
  • Where the threshold is depends on the use case
    - Labeling requires high confidence — Observe how your classifier behaves
    - Search might want to include lower confidence scores as they are probable