DreamPlace

DREAMPlace Evolution — 1.0 · 2.0 · 3.0
EDA · VLSI DREAMPlace Evolution
v1.0 — DAC 2019
v2.0 — CSTIC 2020
v3.0 — ICCAD 2020
DAC 2019 · Lin, Pan, Ren, Khailany

DREAMPlace 1.0
GPU Placement via Deep Learning Toolkit

The original paper: cast chip placement as neural network training, run it on a GPU with PyTorch, and get a 40× speedup for free.

Venue: DAC 2019
Framework: PyTorch (GPU+CPU)
GP speedup: >30×
DP: NTUplace3 (CPU)
Benchmark: ISPD 2005
The Big Idea

A chip netlist has millions of logic cells that need physical coordinates. Finding good ones is NP-hard. DREAMPlace's breakthrough: the math of placement is identical to the math of training a neural network.

Cell (x,y) positions ↔ model weights. Wirelength + density cost ↔ loss function. Gradient ↔ gradient. Nesterov optimizer ↔ Nesterov optimizer. Every GPU kernel PyTorch uses for deep learning now runs placement instead.

Result: a 1-million-cell chip placed in under one minute on an NVIDIA V100, with no quality loss compared to multi-threaded CPU placers.

Step-by-Step Flow
01
Initialization
Random Gaussian Scatter at Center
+

Every cell is placed at the center of the die with tiny Gaussian noise (0.1% of chip width/height). This is simpler than older "bound-to-bound" methods and 21% faster — with less than 0.04% quality difference.

Cells clustered at center — ready for spreading
Think of dumping all LEGO pieces onto the center of a table. Everything is there — just not organized. The optimizer will spread them.

Each cell's (x, y) is stored as a PyTorch tensor parameter — exactly like a neural network weight — and will be updated by the optimizer each iteration.

02
Global Placement · Cost
Smooth Wirelength Gradient (WAW)
+

The goal is to minimize total wire length between connected cells. But raw half-perimeter bounding box (HPWL) is non-differentiable at corners — bad for gradient descent. DREAMPlace uses the Weighted Average Wirelength (WAW) approximation, a smooth, differentiable proxy.

Net bounding boxes — each color is one net

GPU acceleration: instead of one thread per net (unbalanced), DREAMPlace decomposes nets into per-pin threads with atomic accumulation. This saturates all CUDA cores and yields a 1.9× speedup over naive approaches.

The wirelength gradient tells each cell: "your connected neighbors are over there — move closer."
03
Global Placement · Constraint
Electrostatic Density Spreading (DCT)
+

Minimizing wirelength alone would cram every cell into one point. The density constraint prevents this. DREAMPlace models cells as positively charged particles in an electrostatic field — they naturally repel and spread across the die.

The electric potential field is computed by solving Poisson's equation via 2D DCT/IDCT (the same math as JPEG compression). This is massively parallel on GPU. The N-point FFT variant used is 1.4× faster than the 2N-point version on chip-scale grid sizes.

Density heatmap — red = overcrowded bins
The density gradient tells each cell: "you're in a crowded bin — move away." Combined with the wirelength gradient, cells seek a balance between being close to their neighbors and being spread evenly.
04
Global Placement · Optimizer
Nesterov Accelerated Gradient Loop
+

Both gradients (wirelength + λ·density) are summed and fed into Nesterov's accelerated gradient descent — the same optimizer used in state-of-the-art deep learning. PyTorch's autograd engine handles everything; no manual gradient coding required.

Cells spreading toward equilibrium (animated)

The Lagrangian penalty λ starts small (wirelength dominates) and ramps up (density enforced). This is analogous to curriculum learning in ML — easy objective first, then harder constraints. Loop stops when density overflow < 7%.

A 1-million-cell design converges in roughly one minute on a Tesla V100.
05
Post-Processing
Legalization + Detailed Placement
+

Global placement leaves cells approximately correct but possibly overlapping and off-grid. Legalization snaps every cell to a legal placement site and removes all overlaps. DREAMPlace's GPU legalizer is ~10× faster than NTUplace3's CPU version.

Before / after legalization

Detailed Placement (DP) then does local swaps and moves to squeeze out the last few percent of wirelength savings. In v1.0, this still uses NTUplace3 on CPU — a limitation fixed in v2.0.

GP = finding the right neighborhood. LG = parking in the exact spot. DP = optimally rearranging all the parked cars.
Performance
Global Placement
40× faster
40×
Legalization
10×
10×
Full Flow
~20×
~20×

vs. multi-threaded CPU RePlAce on ISPD 2005, NVIDIA Tesla V100. No quality degradation.

CSTIC 2020 · Lin, Pan, Ren, Khailany

DREAMPlace 2.0
GPU-Accelerated Global and Detailed Placement

v2.0 closes the last CPU bottleneck: detailed placement is now fully GPU-accelerated (ABCDPlace), and macro placement is supported. 14× speedup over the full flow.

Venue: CSTIC 2020
New: GPU Detailed Placement
New: Movable Macros
Full flow speedup: 14×
Benchmark: ISPD 2005
What's New in 2.0
GPU Detailed Placement (ABCDPlace)
Three batch-parallel DP techniques replace the CPU NTUplace3 bottleneck: independent set matching, local reordering, and global swap. All parallelized on GPU.
🔲
Movable Macro Support
v1.0 only placed standard cells. v2.0 handles large movable macros (memories, IPs) in GP and LG with dedicated macro legalization algorithms.
🔁
Abacus Legalizer
A row-based Abacus legalization algorithm is added alongside the greedy NTUplace3 legalizer, giving more choices for different design styles.
📦
LEF/DEF Format Support
Industrial LEF/DEF file formats now supported in addition to Bookshelf, making it practical to use on real production netlists.
Full Placement Flow — Step by Step
01
Global Placement (unchanged core)
Electrostatic GP + Nesterov Optimizer
+

The GP engine is inherited from v1.0: WAW wirelength, electrostatic density via DCT/IDCT, Nesterov optimizer. All run on GPU via PyTorch tensor ops.

The key architectural improvement: the software is cleanly decoupled into high-level Python algorithms and low-level C++/CUDA operators. Researchers can swap in new solvers (Adam, SGD with momentum) from PyTorch with almost no extra code.

v2.0's philosophy: decouple the algorithm from the hardware kernel. This dramatically reduces coding overhead when adding new features.
02
New — Macro Handling
Two-Phase Macro Legalization
+

Real chips have large rectangular macros (SRAMs, analog blocks) that can't simply be snapped to a standard cell row. v2.0 adds a dedicated two-phase legalization: macro legalization first (ignoring standard cells), then standard cell legalization afterward.

Macros (large) vs. standard cells (small) coexisting
Macros are like parked trucks in a parking lot. You park the trucks first, then fit all the cars around them.

Macro orientation is refined greedily: each macro is flipped in four directions; it is kept flipped if HPWL improves. This continues until improvement is below 0.02%.

03
New — GPU Detailed Placement
ABCDPlace: Batch-Parallel DP
+

The biggest addition in v2.0. Detailed Placement refines cell positions with local moves after legalization. v1.0 used CPU NTUplace3 for this, which was the remaining bottleneck.

v2.0 introduces three GPU-parallel DP techniques, collectively called ABCDPlace:

Three DP techniques running in parallel batches on GPU

1. Independent Set Matching (ISM): finds non-conflicting cell pairs that can be swapped to reduce wirelength, then swaps all in parallel.
2. Local Reordering: within each row segment, finds the optimal ordering of a small window of cells — batched across all rows simultaneously.
3. Global Swap/Move: identifies globally beneficial cell swaps using a coarse grid search, then executes them in parallel batches.

These three techniques were previously sequential CPU operations. ABCDPlace runs them all in parallel across thousands of GPU threads, achieving ~16× speedup over the CPU equivalent.
04
Software Architecture
Python + C++/CUDA Decoupled Stack
+

v2.0 formalizes the software architecture into two distinct layers. High-level algorithms (placement flow, optimizer logic, legalization orchestration) are written in Python — easy to modify and experiment with. Low-level operators (wirelength kernel, density kernel, DCT, legalization primitives) are in C++/CUDA — highly optimized and fast.

Software architecture: Python orchestrates CUDA kernels
This is exactly how PyTorch itself is built: a Python API on top of C++/CUDA. DREAMPlace inherits this design so researchers can iterate on algorithms without touching GPU code.
05
Results
14× Speedup Across Entire Flow
+

By GPU-accelerating DP as well as GP and LG, v2.0 achieves a 14× speedup over 40-thread CPU RePlAce on the full placement flow — with matching solution quality.

The framework also supports both CPU-only and GPU modes, making it accessible on workstations without an NVIDIA GPU. All benchmarks pass on Bookshelf and LEF/DEF formats.

The 14× speedup over 40 CPU threads means DREAMPlace 2.0 is effectively faster than running 560 CPU threads — if such a machine existed.
Performance vs v1.0
Global Placement
40× (inherited)
40×
Detailed Placement
16× new!
16×
Legalization
10× (improved)
10×
Full Flow
14× vs 40 threads
14×

vs. RePlAce running 40 CPU threads, ISPD 2005 benchmarks, NVIDIA Tesla V100.

ICCAD 2020 · Gu, Jiang, Lin, Pan

DREAMPlace 3.0
Multi-Electrostatics + Region Constraints

v3.0 tackles real industrial chip constraints: voltage islands, fence regions, and high-density designs that break earlier ePlace-based placers. Introduces multi-electrostatics, virtual blockages, and self-adaptive robustness techniques.

Venue: ICCAD 2020
New: Region Constraints (Fence)
New: Multi-Electrostatics System
New: Self-Adaptive Optimizer
HPWL improvement: >13%
The Problem v3.0 Solves

Modern chips are not a single flat sea of cells. They have fence regions (voltage islands, power domains) where specific cells must stay within designated rectangular sub-regions. Earlier ePlace-based placers used a single unified electrostatic field — which is completely blind to these boundaries.

Ignoring region constraints during global placement causes catastrophic quality loss at legalization, since the legalizer must then forcibly move huge numbers of cells to satisfy the constraints — destroying all the carefully optimized wirelength.

Additionally, designs with strict density limits (for routability) caused previous ePlace methods to diverge or require tedious manual parameter tuning per benchmark. v3.0 fixes all of this.

What's New in 3.0
🏗️
Multi-Electrostatics System
Instead of one unified field, each fence region gets its own independent electric field. Cells in region k only interact with other cells in region k.
🧱
Virtual Blockage Insertion
Invisible "blockages" are inserted outside each region's boundary to prevent cells from leaking out during optimization.
🔄
Self-Adaptive Rollback
If the optimizer detects divergence, it automatically rolls back to the last good state and injects entropy (noise) to escape. No manual tuning needed.
📐
Quadratic Density Penalty
A smarter density penalty function that automatically accelerates convergence under aggressive density targets without manual λ scheduling.
Step-by-Step Flow
01
Problem Formulation
Region-Constrained Placement Setup
+

The input now includes a set of fence regions — rectangular sub-areas R₁, R₂, … Rₖ. Each movable cell is tagged with its required region. Cells not in a fence region go anywhere in the global region.

Chip die divided into fence regions (voltage islands)
Think of the chip as a city with zoning laws. Some buildings must stay in their zoning district. The placer must respect these zones while still minimizing total road length.

The constrained optimization problem is: minimize total wirelength, subject to: (a) density ≤ target in every bin, and (b) cells in region k must stay within region k's boundaries.

02
Core Innovation
Multi-Electrostatics + Field Isolation
+

The key insight: instead of one global electric field, create K+1 independent electrostatic systems — one for each fence region, plus one for unconstrained cells.

Each system has its own charge density map, potential field (solved via DCT/IDCT), and electric field. Cells in region k only "feel" the repulsion from other cells in region k. This guarantees cells are pushed toward their legal region during optimization.

Separate electrostatic fields per region — no cross-region interaction

Virtual Blockage Insertion: outside each region's boundary, invisible fixed "blockage cells" are placed. These generate repulsive force that pushes the region's real cells inward, preventing them from migrating outside their fence.

The unified field of v1.0/v2.0 was like one big room where everyone mingles. v3.0 creates separate rooms with walls — cells in room k can't push cells in room j, and can't wander into room j either.
03
Robustness Innovation
Self-Adaptive Quadratic Density Penalty
+

High-density constraints (when the chip is packed very tightly) caused earlier ePlace placers to either diverge or require careful manual tuning of the penalty schedule λ. v3.0 replaces the linear penalty with a quadratic density penalty that automatically adjusts its aggressiveness based on current overflow.

Quadratic penalty ramp — steeper at high density overflow

Each region gets its own independent density controller — the penalty for region k depends only on region k's overflow, not the global average. This allows tight regions to be aggressively spread without disturbing sparser regions.

Analogy: instead of turning up heat in the whole building to heat one cold room, each room has its own thermostat.
04
Robustness Innovation
Divergence-Aware Optimizer with Entropy Injection
+

When strict density constraints force very tight packing, the Nesterov optimizer can get stuck in bad local minima or diverge entirely. v3.0 introduces a divergence detector that monitors the loss curve and automatically intervenes when things go wrong.

Two recovery mechanisms:

Self-Adaptive Rollback: If divergence is detected, the optimizer rolls back to the last checkpoint (the best known state) and reduces the step size.

Entropy Injection: Small random perturbations are added to cell positions to help escape saddle points — similar to adding noise in simulated annealing or using dropout in deep learning.

In ML terms: this is like having a learning rate scheduler + gradient noise injection that activates automatically when training goes unstable — no human tuning needed.
Result: ~10% runtime improvement and ~1% HPWL improvement on ICCAD 2014 and ISPD 2019 benchmarks compared to vanilla DREAMPlace — purely from better optimization stability.
05
Post-Processing
Region-Aware Legalization + DP
+

Legalization in v3.0 is region-aware: cells are legalized within their assigned fence region boundaries. The row-based Abacus legalizer (from v2.0) is extended to check region membership at every step.

Cells legalized within their assigned fence regions

Because global placement already respected region constraints, the legalizer needs to make only small adjustments — preserving most of the wirelength quality from GP. Detailed placement follows using the ABCDPlace GPU DP from v2.0.

v3.0 achieves >13% HPWL improvement and >11% overflow reduction on ISPD 2015 region-constrained benchmarks vs. the best prior region-aware placers (Eh?Placer and NTUplace4dr).
Quality vs. Prior Region-Aware Placers
HPWL improvement
>13% vs Eh?Placer
>13%
Overflow reduction
>11% top5
>11%
Runtime (vs DPv1)
~10% faster
~10%
HPWL (vs DPv1)
~1% better
~1%

ISPD 2015 fence-region benchmarks. Runtime/HPWL delta vs. plain DREAMPlace on ICCAD 2014 / ISPD 2019.

Comments

Popular Posts