Featured Project I

3 min read.

Reliability vs Perform.

Support Ostrich –> Parity data
Athlete Ostrich –> Regular data

The ostrich is awesome – among the fastest animals in the world. The image shows an ostrich race where a support ostrich is there to keep runners safe. But if the track has only a few lanes, letting the support ostrich occupy an entire lane would be wasteful.

ostrich race

Reliability can Benefit Performance

Memory and processors talk to each other through a channel. The speed of this channel (bandwidth) largely determines memory performance. To make memory reliable, modern systems also send extra protection information, called parity or ECC data. This is analogous to the ostrich race example: the track lanes are like the memory channel, and the support ostrich is like parity data. The same kind of issue appears in memory systems: The channel is already limited, but some of its space is used for parity instead of regular data. In DDR5, this parity can add about 25% bandwidth overhead, a high cost!

Contribution

Most recent memory technologies, such as DDR5, HBM3, and memory-centric accelerators, focus on increasing total bandwidth. But another important question is often overlooked: are we using the available bandwidth efficiently?

Our work starts from this question. Instead of treating reliability as something that only consumes resources, we ask whether reliability-related bandwidth can be reclaimed and used more wisely. One key idea is to reorganize parity in a more compact manner. By reducing the bandwidth needed for parity, we can free up part of the channel for regular data transfer. In theory, this can improve bandwidth utilization by up to 1.25×, while still maintaining reliability and adding negligible hardware overhead.

ASPA design

Limitation

  • No design is perfect!
  • This design works best for sequential memory accesses (e.g., reading a 4KB memory page). But it is less suitable for highly random memory accesses.
  • This design also works best for DRAM chips with narrower I/O widths. But for other chip types, it may require more hardware changes.
  • We are currently exploring additional solutions that can support broader memory systems and access patterns.

Takeaway

Reliability is often viewed as a cost: we spend extra resources to make systems safer. But reliability can also be a source of opportunity. Our solution is to repurpose reliability resources for performance improvement.