Metal for Game Developers

WWDC 2018

Metal for Game Developers

WWDC 2018

Harnessing Parallelism

Scalable multi-threaded encoding is key
Metal makes multi-threaded CPU command generation easy an fast
Metal automatically parallelized GPU tasks

Multi-Threaded Rendering

MTLCommandBuffer

With MTLParallerRenderCommandEncoder

GPU Parallelism

Taking Explicit Control

Metal offers more direct approaches for even less overhead
Disable Metal’s automatic reference counting
Allocate resource cheaply using heaps
Control GPU parallelism with fences and events

Resource Heaps

MLTHeap

Control time of memory allocation
Fast reallocation and aliasing of resources
Cheaper resource binding
Simple API

Without dependency tracking

MTLFence and MTLEvent

Explicit execution order

Building GPU-Driven Pipelines

Game are moving more and more logic onto GPU
- Efficient processing of large datasets
- Growing scene graph complexity
Metal 2 enables you to move entire render loop to the GPU
- Argument Buffers — offload parameter management
- Indirect Command Buffers — offload rendering loop

Argument Buffers

Indirect Command Buffers (ICB)

Allow GPU to build draw calls
- Massively parallel generation of commands
Reuse ICB in multiple frames
- Modify contents
Remove expensive CPU and GPU synchronization

Game Rendering Loop

GPU Driven rendering loop

Encoding Drawcall Example

Executing ICB Example

Split the geometry into chunk of a few hundred triangles and analyze those chunks in a separate compute kernel

Tile-Based Deferred Rendering (TBDR)

High performance, low power
Tightly integrated with Metal
A11 Bionic takes TBDR to the next level

Metal 2 Features on A11

Programmable Blending

Metal provides read-write access to pixels in tile memory
- Implement custom blend operations
Combine passes that read/write the same pixel
- Eliminates system memory bandwidth between passes
- Optimizes deferred shading

Deferred Shading

Imageblocks

Flexible per-pixel storage in tile memory
- Layout declared in shading language
Change pixel layout within a pass
- Merge render passes with different layouts

Deferred Shading

Tile Shading

New programmable stage in the render pass
-Dispatch a configurable threadgroup per tile
Interleave render and compute processing
- Read rendering results directly from tile memory

Tiled Forward Shading

Persistent Threadgroup Memory

Threadgroup memory available to render passes
- Available to fragment and tile shaders
- Buffer contents persistent for the lifetime of tile
Share data across pixels
- Communicate tile-scoped data between draws and tile dispatches

Tile Foraward Shading

Deferred Shading and Multi-Layer Alpha Blending

With changing Imageblocks requires Tile Shading

Multi-Sample Color Coverage Control

MSAA is efficient on A-series GPUs
- Samples stored in tile memory for fast blending and resolves
MSAA is more efficient on A11
- Track unique colors in a pixel
Tile shading gives control over color overage
- Resolve in-place, in fast tile memory

Multi-Sample Rendering with Transparent Particles

With Tile Shading and Color Coverage Control

Fortnite: Battle Royale

Shipping an Unreal Engine 4 console game on iOS with Metal

- Technical Chanllenges

one map, larger than 6km²
Time-of-day, destruction, player-built structures
100 players
50,000+ replicating actors
Crossplay with console and desktop players

One Game, All Platforms

Crossplay means we are limited in out ability to scale down the game
If it affects gameplay, we can’t change it
If a player can hide behind an object, we must render it

Metal

Draw call performance allows us to render complex scenes with 1000 of dynamic objects
Access to hardware features, e.g. programmable blending, allows for big wins on the GPU
Feature set lets us support artist authored materials, physically based rendering, dynamic lighting and shadows, GPU particle simulation

Rendering Features Used on iOS

Movable directional light with cascaded shadow maps
Movable skylight
Physically based matrials
HDR and Tonemapping
GPU particle simulation
Artist authored materials with vertex animation

Scalability

Scalability across platforms
- Minimum LOD for meshes
- Character LODs, animation update rates, etc.
Define 3 performance buckets, assigned devices based on performance
- Low (iPhone 6s, iPad Air 2, iPad min 4)
- Mid (iPhone 7, iPad Pro)
- High (iPhone 8, iPhone X, iPad Pro 2nd Gen)

Resolution

Backbuffer resolution is what the UI renders at, determined by contentScaleFactor
3D resolution is separate, upscaled before rendering UI (adds some cost)
Preferred scaling backbuffer, only scaled 3D separately on iPhone 6s, iPad Air 2, and iPad (5th gen)
Tuned per-device

Shadows

Dynamic shadows have a large impact on both GPU and CPU performance

Foliage

Grass and foliage scale; both density and cull distance

Memory

Memory doesn’t always correlate with performace
- iPhone 8 has less physical memory than iPhone 7+ but is faster
Low memory devices
- No foliage, no shadows, 16k max GPU particles, reduced pool for cosmetics, reduced texture memory pool
High memory devices
- Foliage, shadows, 64k max GPU particles, increased pool for cosmetics, increased texture memory pool

Memory Optimizations

Framerate Targets

30fps at the highest visual fidelity possible
However Maxing out the CPU and GPU generates heat which causes the device to downclock.
We also want to conserve battery life
So we target 60fps for the environment but vsync at 30 fps

Performance Tracking

Capture environment performance at a selection of POIs over time
Daily 100 player playtests to capture dynamic performance
- Key performance stats tracked
- Instrumented profiles gathered from one device
- Replays saved off for later investigation

Threading

Draw Calls

Draw calls were our main performance bottleneck
Metal’s performance really helped here, 3–4x faster than OpenGL, allowed us to ship without more aggressive work to reduce draw calls
Pulled in cull distance on decorative objects
Added HLODs to combine draw calls

Hierarchical LOD

Combine draw calls for a group of meshes
Allow us to render the entire map while skydiving
Even on the ground, POIs are visible from up to 2km away
Added mid-range HLODs to further reduce draw calls in complex scenes like Tilted Towers

Pipeline State Objects

Minimize how many are created at runtime to prevent hitching
Follow best practives
- Compile functions offline
- Build library offline
- Group functions into a single library you can ship with your game
Ideally create all of the PSOs you need at load time
What if the set of PSOs you might need is large?
- Artist authored shaders x Lighting scenarios x RT formats x MSAA x Stencil state (e.g. LOD dithering) x Input layout x Scalability level x and more
Minimize permutations where you can!
- Sometimes a dynamic branch is fine
Identity the most common subset you are likely to need and create those at load
We use automation to run the game and gather PSOs used by device across the map, load cosmetics, fire weapons, etc.
We also store PSOs created during daily playtests
Not perfect, but…
- Number of PSOs created during gameplay is in the single digits on average
- We only create a small subset of the permutation matrix on load

Resource Allocation

Creating resources on the fly can hitch due to streaming or creation of dynamic objects
Treat resource allocation like memory allocation and use the same strategies to minimize “malloc” and “free”
We use a binned allocation strategy for smaller allocations

Programmable Blending

Optimization to reduce the number of resolves and restores for features that need to read the depth buffer, e.g. decals and soft particle blending
During forward pass, write linear depth to alpha channel (we render to FP16 render target)
During decal and translucent passes, if needed, read [[ color(0) ]] .a as linear depth
MSAA resolve happens before postprocessing and tonemapping
Bilinear filtering of HDR values can lead to very aliased edges, e.g. dark ceiling in shadow against a bright sky

Solution

Use programmable blending for the pre- tonemap pass to avoid resolving the MSAA color buffer to memory!

Pre-tonemap before resolve
Perform the normal MSAA resolve
The first postprocessing pass reverses the “pre-tonemap”

Parallel Rendering

On macOS we generate command buffers in parallel
- Not using parallel command encoders right now, separate command buffers easier to integrate in existing parallel rendering architecture
- Parallel command encoders would be needed on iOS
Experiment with parallel rendering on iOS
- For example, rendering thread on fast core versus four encoding threads on efficient cores

Metal Heaps

Use MTLHeap instead of buffer sub-allocation
Would also reduce hitches due to texture streaming
Requires precise fences, particularly which stages read which resources
- Full barriers between passes underutilize the GPU
- Requires some rework of our rendering architecture to make this
bulletproof

← Previous Post Next Post →