Metal for Game Developers

WWDC 2018

Posted by Den on September 03, 2018 · 21 mins read
Metal for Game Developers

Metal for Game Developers

WWDC 2018

Metal for Game Developers

WWDC 2018

Harnessing Parallelism

  • Scalable multi-threaded encoding is key
  • Metal makes multi-threaded CPU command generation easy an fast
  • Metal automatically parallelized GPU tasks

Multi-Threaded Rendering

MTLCommandBuffer


With MTLParallerRenderCommandEncoder

GPU Parallelism

Taking Explicit Control

  • Metal offers more direct approaches for even less overhead
  • Disable Metal’s automatic reference counting
  • Allocate resource cheaply using heaps
  • Control GPU parallelism with fences and events

Resource Heaps

MLTHeap

  • Control time of memory allocation
  • Fast reallocation and aliasing of resources
  • Cheaper resource binding
  • Simple API

Without dependency tracking

MTLFence and MTLEvent

Explicit execution order

Building GPU-Driven Pipelines

  • Game are moving more and more logic onto GPU
    - Efficient processing of large datasets
    - Growing scene graph complexity
  • Metal 2 enables you to move entire render loop to the GPU
    - Argument Buffers — offload parameter management 
    - Indirect Command Buffers — offload rendering loop

Argument Buffers

Indirect Command Buffers (ICB)

  • Allow GPU to build draw calls
    - Massively parallel generation of commands
  • Reuse ICB in multiple frames
    - Modify contents
  • Remove expensive CPU and GPU synchronization

Game Rendering Loop

GPU Driven rendering loop

Encoding Drawcall Example

Executing ICB Example

Split the geometry into chunk of a few hundred triangles and analyze those chunks in a separate compute kernel

Tile-Based Deferred Rendering (TBDR)

  • High performance, low power
  • Tightly integrated with Metal
  • A11 Bionic takes TBDR to the next level

Metal 2 Features on A11

Programmable Blending

  • Metal provides read-write access to pixels in tile memory
    - Implement custom blend operations
  • Combine passes that read/write the same pixel
    - Eliminates system memory bandwidth between passes
    - Optimizes deferred shading

Deferred Shading

Imageblocks

  • Flexible per-pixel storage in tile memory
    - Layout declared in shading language
  • Change pixel layout within a pass
    - Merge render passes with different layouts

Deferred Shading

Tile Shading

  • New programmable stage in the render pass
    -Dispatch a configurable threadgroup per tile
  • Interleave render and compute processing
    - Read rendering results directly from tile memory

Tiled Forward Shading

Persistent Threadgroup Memory

  • Threadgroup memory available to render passes
    - Available to fragment and tile shaders
    - Buffer contents persistent for the lifetime of tile
  • Share data across pixels
    - Communicate tile-scoped data between draws and tile dispatches

Tile Foraward Shading

Deferred Shading and Multi-Layer Alpha Blending

With changing Imageblocks requires Tile Shading

Multi-Sample Color Coverage Control

  • MSAA is efficient on A-series GPUs
    - Samples stored in tile memory for fast blending and resolves
  • MSAA is more efficient on A11
    - Track unique colors in a pixel
  • Tile shading gives control over color overage
    - Resolve in-place, in fast tile memory

Multi-Sample Rendering with Transparent Particles

Without Color Coverage Control
With Tile Shading and Color Coverage Control

Fortnite: Battle Royale

Shipping an Unreal Engine 4 console game on iOS with Metal

- Technical Chanllenges

  • one map, larger than 6km²
  • Time-of-day, destruction, player-built structures
  • 100 players
  • 50,000+ replicating actors
  • Crossplay with console and desktop players

One Game, All Platforms

  • Crossplay means we are limited in out ability to scale down the game
  • If it affects gameplay, we can’t change it
  • If a player can hide behind an object, we must render it

Metal

  • Draw call performance allows us to render complex scenes with 1000 of dynamic objects
  • Access to hardware features, e.g. programmable blending, allows for big wins on the GPU
  • Feature set lets us support artist authored materials, physically based rendering, dynamic lighting and shadows, GPU particle simulation

Rendering Features Used on iOS

  • Movable directional light with cascaded shadow maps
  • Movable skylight
  • Physically based matrials
  • HDR and Tonemapping
  • GPU particle simulation
  • Artist authored materials with vertex animation

Scalability

  • Scalability across platforms
    - Minimum LOD for meshes
    - Character LODs, animation update rates, etc.
  • Define 3 performance buckets, assigned devices based on performance
    - Low (iPhone 6s, iPad Air 2, iPad min 4)
    - Mid (iPhone 7, iPad Pro)
    - High (iPhone 8, iPhone X, iPad Pro 2nd Gen)

Resolution

  • Backbuffer resolution is what the UI renders at, determined by contentScaleFactor
  • 3D resolution is separate, upscaled before rendering UI (adds some cost)
  • Preferred scaling backbuffer, only scaled 3D separately on iPhone 6s, iPad Air 2, and iPad (5th gen)
  • Tuned per-device

Shadows

Dynamic shadows have a large impact on both GPU and CPU performance

Foliage

Grass and foliage scale; both density and cull distance

Memory

  • Memory doesn’t always correlate with performace
    - iPhone 8 has less physical memory than iPhone 7+ but is faster
  • Low memory devices
    - No foliage, no shadows, 16k max GPU particles, reduced pool for cosmetics, reduced texture memory pool
  • High memory devices
    - Foliage, shadows, 64k max GPU particles, increased pool for cosmetics, increased texture memory pool

Memory Optimizations

Framerate Targets

  • 30fps at the highest visual fidelity possible
  • However Maxing out the CPU and GPU generates heat which causes the device to downclock
    We also want to conserve battery life
    So we target 60fps for the environment but vsync at 30 fps

Performance Tracking

  • Capture environment performance at a selection of POIs over time
  • Daily 100 player playtests to capture dynamic performance
    - Key performance stats tracked
    - Instrumented profiles gathered from one device
    - Replays saved off for later investigation

Threading

Draw Calls

  • Draw calls were our main performance bottleneck
  • Metal’s performance really helped here, 3–4x faster than OpenGL, allowed us to ship without more aggressive work to reduce draw calls
  • Pulled in cull distance on decorative objects
  • Added HLODs to combine draw calls

Hierarchical LOD

  • Combine draw calls for a group of meshes
  • Allow us to render the entire map while skydiving
  • Even on the ground, POIs are visible from up to 2km away
  • Added mid-range HLODs to further reduce draw calls in complex scenes like Tilted Towers

Pipeline State Objects

  • Minimize how many are created at runtime to prevent hitching
  • Follow best practives
    - Compile functions offline
    - Build library offline
    - Group functions into a single library you can ship with your game
  • Ideally create all of the PSOs you need at load time
  • What if the set of PSOs you might need is large?
    - Artist authored shaders x Lighting scenarios x RT formats x MSAA x Stencil state (e.g. LOD dithering) x Input layout x Scalability level x and more
  • Minimize permutations where you can!
    - Sometimes a dynamic branch is fine
  • Identity the most common subset you are likely to need and create those at load
  • We use automation to run the game and gather PSOs used by device across the map, load cosmetics, fire weapons, etc.
  • We also store PSOs created during daily playtests
  • Not perfect, but…
    - Number of PSOs created during gameplay is in the single digits on average
    - We only create a small subset of the permutation matrix on load

Resource Allocation

  • Creating resources on the fly can hitch due to streaming or creation of dynamic objects
  • Treat resource allocation like memory allocation and use the same strategies to minimize “malloc” and “free”
  • We use a binned allocation strategy for smaller allocations

Programmable Blending

  • Optimization to reduce the number of resolves and restores for features that need to read the depth buffer, e.g. decals and soft particle blending
  • During forward pass, write linear depth to alpha channel (we render to FP16 render target)
  • During decal and translucent passes, if needed, read [[ color(0) ]] .a as linear depth
  • MSAA resolve happens before postprocessing and tonemapping
  • Bilinear filtering of HDR values can lead to very aliased edges, e.g. dark ceiling in shadow against a bright sky

Solution

Use programmable blending for the pre- tonemap pass to avoid resolving the MSAA color buffer to memory!

  • Pre-tonemap before resolve
  • Perform the normal MSAA resolve
  • The first postprocessing pass reverses the “pre-tonemap”

Parallel Rendering

  • On macOS we generate command buffers in parallel
    - Not using parallel command encoders right now, separate command buffers easier to integrate in existing parallel rendering architecture
    - Parallel command encoders would be needed on iOS
  • Experiment with parallel rendering on iOS
    - For example, rendering thread on fast core versus four encoding threads on efficient cores

Metal Heaps

  • Use MTLHeap instead of buffer sub-allocation
  • Would also reduce hitches due to texture streaming
  • Requires precise fences, particularly which stages read which resources
    - Full barriers between passes underutilize the GPU
    - Requires some rework of our rendering architecture to make this
    bulletproof