DreamPlace
DREAMPlace 1.0
GPU Placement via Deep Learning Toolkit
The original paper: cast chip placement as neural network training, run it on a GPU with PyTorch, and get a 40× speedup for free.
A chip netlist has millions of logic cells that need physical coordinates. Finding good ones is NP-hard. DREAMPlace's breakthrough: the math of placement is identical to the math of training a neural network.
Cell (x,y) positions ↔ model weights. Wirelength + density cost ↔ loss function. Gradient ↔ gradient. Nesterov optimizer ↔ Nesterov optimizer. Every GPU kernel PyTorch uses for deep learning now runs placement instead.
Result: a 1-million-cell chip placed in under one minute on an NVIDIA V100, with no quality loss compared to multi-threaded CPU placers.
Every cell is placed at the center of the die with tiny Gaussian noise (0.1% of chip width/height). This is simpler than older "bound-to-bound" methods and 21% faster — with less than 0.04% quality difference.
Each cell's (x, y) is stored as a PyTorch tensor parameter — exactly like a neural network weight — and will be updated by the optimizer each iteration.
The goal is to minimize total wire length between connected cells. But raw half-perimeter bounding box (HPWL) is non-differentiable at corners — bad for gradient descent. DREAMPlace uses the Weighted Average Wirelength (WAW) approximation, a smooth, differentiable proxy.
GPU acceleration: instead of one thread per net (unbalanced), DREAMPlace decomposes nets into per-pin threads with atomic accumulation. This saturates all CUDA cores and yields a 1.9× speedup over naive approaches.
Minimizing wirelength alone would cram every cell into one point. The density constraint prevents this. DREAMPlace models cells as positively charged particles in an electrostatic field — they naturally repel and spread across the die.
The electric potential field is computed by solving Poisson's equation via 2D DCT/IDCT (the same math as JPEG compression). This is massively parallel on GPU. The N-point FFT variant used is 1.4× faster than the 2N-point version on chip-scale grid sizes.
Both gradients (wirelength + λ·density) are summed and fed into Nesterov's accelerated gradient descent — the same optimizer used in state-of-the-art deep learning. PyTorch's autograd engine handles everything; no manual gradient coding required.
The Lagrangian penalty λ starts small (wirelength dominates) and ramps up (density enforced). This is analogous to curriculum learning in ML — easy objective first, then harder constraints. Loop stops when density overflow < 7%.
Global placement leaves cells approximately correct but possibly overlapping and off-grid. Legalization snaps every cell to a legal placement site and removes all overlaps. DREAMPlace's GPU legalizer is ~10× faster than NTUplace3's CPU version.
Detailed Placement (DP) then does local swaps and moves to squeeze out the last few percent of wirelength savings. In v1.0, this still uses NTUplace3 on CPU — a limitation fixed in v2.0.
vs. multi-threaded CPU RePlAce on ISPD 2005, NVIDIA Tesla V100. No quality degradation.
DREAMPlace 2.0
GPU-Accelerated Global and Detailed Placement
v2.0 closes the last CPU bottleneck: detailed placement is now fully GPU-accelerated (ABCDPlace), and macro placement is supported. 14× speedup over the full flow.
The GP engine is inherited from v1.0: WAW wirelength, electrostatic density via DCT/IDCT, Nesterov optimizer. All run on GPU via PyTorch tensor ops.
The key architectural improvement: the software is cleanly decoupled into high-level Python algorithms and low-level C++/CUDA operators. Researchers can swap in new solvers (Adam, SGD with momentum) from PyTorch with almost no extra code.
Real chips have large rectangular macros (SRAMs, analog blocks) that can't simply be snapped to a standard cell row. v2.0 adds a dedicated two-phase legalization: macro legalization first (ignoring standard cells), then standard cell legalization afterward.
Macro orientation is refined greedily: each macro is flipped in four directions; it is kept flipped if HPWL improves. This continues until improvement is below 0.02%.
The biggest addition in v2.0. Detailed Placement refines cell positions with local moves after legalization. v1.0 used CPU NTUplace3 for this, which was the remaining bottleneck.
v2.0 introduces three GPU-parallel DP techniques, collectively called ABCDPlace:
1. Independent Set Matching (ISM): finds non-conflicting cell pairs that can be swapped to reduce wirelength, then swaps all in parallel.
2. Local Reordering: within each row segment, finds the optimal ordering of a small window of cells — batched across all rows simultaneously.
3. Global Swap/Move: identifies globally beneficial cell swaps using a coarse grid search, then executes them in parallel batches.
v2.0 formalizes the software architecture into two distinct layers. High-level algorithms (placement flow, optimizer logic, legalization orchestration) are written in Python — easy to modify and experiment with. Low-level operators (wirelength kernel, density kernel, DCT, legalization primitives) are in C++/CUDA — highly optimized and fast.
By GPU-accelerating DP as well as GP and LG, v2.0 achieves a 14× speedup over 40-thread CPU RePlAce on the full placement flow — with matching solution quality.
The framework also supports both CPU-only and GPU modes, making it accessible on workstations without an NVIDIA GPU. All benchmarks pass on Bookshelf and LEF/DEF formats.
vs. RePlAce running 40 CPU threads, ISPD 2005 benchmarks, NVIDIA Tesla V100.
DREAMPlace 3.0
Multi-Electrostatics + Region Constraints
v3.0 tackles real industrial chip constraints: voltage islands, fence regions, and high-density designs that break earlier ePlace-based placers. Introduces multi-electrostatics, virtual blockages, and self-adaptive robustness techniques.
Modern chips are not a single flat sea of cells. They have fence regions (voltage islands, power domains) where specific cells must stay within designated rectangular sub-regions. Earlier ePlace-based placers used a single unified electrostatic field — which is completely blind to these boundaries.
Ignoring region constraints during global placement causes catastrophic quality loss at legalization, since the legalizer must then forcibly move huge numbers of cells to satisfy the constraints — destroying all the carefully optimized wirelength.
Additionally, designs with strict density limits (for routability) caused previous ePlace methods to diverge or require tedious manual parameter tuning per benchmark. v3.0 fixes all of this.
The input now includes a set of fence regions — rectangular sub-areas R₁, R₂, … Rₖ. Each movable cell is tagged with its required region. Cells not in a fence region go anywhere in the global region.
The constrained optimization problem is: minimize total wirelength, subject to: (a) density ≤ target in every bin, and (b) cells in region k must stay within region k's boundaries.
The key insight: instead of one global electric field, create K+1 independent electrostatic systems — one for each fence region, plus one for unconstrained cells.
Each system has its own charge density map, potential field (solved via DCT/IDCT), and electric field. Cells in region k only "feel" the repulsion from other cells in region k. This guarantees cells are pushed toward their legal region during optimization.
Virtual Blockage Insertion: outside each region's boundary, invisible fixed "blockage cells" are placed. These generate repulsive force that pushes the region's real cells inward, preventing them from migrating outside their fence.
High-density constraints (when the chip is packed very tightly) caused earlier ePlace placers to either diverge or require careful manual tuning of the penalty schedule λ. v3.0 replaces the linear penalty with a quadratic density penalty that automatically adjusts its aggressiveness based on current overflow.
Each region gets its own independent density controller — the penalty for region k depends only on region k's overflow, not the global average. This allows tight regions to be aggressively spread without disturbing sparser regions.
When strict density constraints force very tight packing, the Nesterov optimizer can get stuck in bad local minima or diverge entirely. v3.0 introduces a divergence detector that monitors the loss curve and automatically intervenes when things go wrong.
Two recovery mechanisms:
Self-Adaptive Rollback: If divergence is detected, the optimizer rolls back to the last checkpoint (the best known state) and reduces the step size.
Entropy Injection: Small random perturbations are added to cell positions to help escape saddle points — similar to adding noise in simulated annealing or using dropout in deep learning.
Legalization in v3.0 is region-aware: cells are legalized within their assigned fence region boundaries. The row-based Abacus legalizer (from v2.0) is extended to check region membership at every step.
Because global placement already respected region constraints, the legalizer needs to make only small adjustments — preserving most of the wirelength quality from GP. Detailed placement follows using the ABCDPlace GPU DP from v2.0.
ISPD 2015 fence-region benchmarks. Runtime/HPWL delta vs. plain DREAMPlace on ICCAD 2014 / ISPD 2019.


Comments
Post a Comment