Program – IEEE CASS Seasonal School on Domain Specific Accelerator Architectures

Domain Specific Accelerator (DSA) architectures for Signal Processing, Communications and Machine Learning

Date: 20 September 2021
Time: 8AM to 10.00AM Pacific time 
Speaker: Dr. Kiran Gunnam

This lecture covers an introduction to Domain-Specific architectures for Communications, Signal Processing, and Machine Learning Systems. Various topics include pipelining and parallel processing, retiming, unfolding, folding, systolic architecture design, and algorithmic transformations. The emphasis is on how to design high-speed, low-area, and low-power DSA architectures for a broad range of applications. This lecture also covers the state-of-the-art in machine learning accelerators and emerging opportunities.

LDPC-based Advanced Error Correction Coding Architectures

Date: 20 September 2021
Time: 10:00AM to 12:00PM Pacific time 
Speaker: Dr. Kiran Gunnam

This lecture covers Low-Density Parity-Check (LDPC) code-based Advanced Error Correction Coding Architectures.LDPC codes now have been firmly established as coding techniques for communication and storage channels. This talk gives an overview of the development of low complexity iterative LDPC solutions for storage and communication channels. Complexity is reduced by developing new or modified algorithms and new hardware architectures viz. system-level hardware architecture, statistical buffer management and queuing, local-global interleaver, LDPC decoder, and error floor mitigation schemes.

Hitchhikers Guide to Wafer-Scale AI

Date: 21 September 2021
Time: 8AM to 12.00PM Pacific time 
Speaker: Mr. Michael James

Artificial Intelligence is transforming technology. In this talk, I will show how AI made wafer-scale computing a reality. We will take a deep-dive technical exploration of the Cerebras CS-1, an AI supercomputer. The processor’s architecture will guide us through a survey of computational issues that arise in the deep-learning landscape. We will look at applications ranging from real-time high resolution image processing to multi-trillion parameter language models. We will also look at new AI algorithms inspired by the novel computer architecture and explore implications that biologically plausible learning algorithms may have on future computer designs.

The Groq Tensor Streaming Processor (TSP) and the Value of Deterministic Instruction Execution

Date: 22 September 2021
Time: 8AM to 12.00PM Pacific time 
Speaker: Mr. Andrew Ling

The explosion of machine learning and its many applications has motivated a variety of new domain-specific architectures to accelerate these deep learning workloads. The Groq Tensor Streaming Processor (TSP) is a functionally-sliced microarchitecture with memory units interleaved with vector and matrix functional units. This architecture takes advantage of dataflow locality of deep learning operations. The TSP is built based on two key observations: (1) machine learning workloads exhibit abundant data parallelism, which can be readily mapped to tensors in hardware, and (2) a deterministic processor with a stream programming model enables precise reasoning and control of hardware components to achieve good performance and power efficiency. The TSP is designed to exploit parallelism inherent in machine-learning workloads including instruction-level parallelism, memory concurrency, data and model parallelism. It guarantees determinism by eliminating all reactive elements in the hardware, for example, arbiters and caches. The instruction ordering is entirely software controlled and the underlying hardware cannot reorder these events and they must complete in a fixed amount of time. This has several consequences for system design: zero variance latency, low latency, high throughput at batch size 1 and reduced total cost of ownership (TCO) for data centers with diverse service level agreements (SLAs). Early ResNet50 image classification results demonstrate 20.4K processed images per second with a batch size of one. This is a 4X improvement compared to other modern GPUs and accelerators. The first ASIC implementation of the TSP architecture yields a computational density of more than 1 TOp/s per square mm of silicon. The TSP is a 25x29mm 14nm chip operating at a nominal clock frequency of 900MHz. In this talk we discuss the TSP and the design implications of its architecture. The talk will cover our work published at ISCA 2020: https://groq.com/isca-2020-conference/

Domain Specific Accelerator (DSA) architectures for efficient execution of sparse workloads on FPGA platforms

Date: 23 September 2021
Time: 8AM to 12.00PM Pacific time 
Speaker: Dr. Abhishek Jain

DSAs for machine learning (ML) such as Google TPU, Microsoft Brainwave, Xilinx xDNN are becoming prominent because of high energy-efficiency and performance. These DSAs perform dense linear algebra efficiently by minimizing data movement, exploiting high data reuse, regular memory access pattern and data locality. DSAs for domains like graph analytics and HPC are emerging at a rapid pace as well where most of the computations revolve around sparse linear algebra, specifically Sparse Matrix Vector Multiplication (SpMV). Designing high performance and energy efficient DSAs for SpMV is challenging due to highly irregular and random memory access patterns, poor temporal and spatial locality and very low data reuse opportunities. Many SpMV DSAs exploit distributed on-chip memory blocks to store vector entries for avoiding off-chip random memory access. However, a switching network architecture or a crossbar is usually required as a building block for routing matrix non-zero elements to on-chip memory blocks. In this presentation, we will discuss about the network architectures and switches employed in most of the SpMV DSAs, our SpMV DSA based on 2D-mesh network, design choices for the FPGA implementation of the DSA, deployment scenarios, and scalability aspects. In more detail for our use-case, we will highlight the importance and challenges of achieving energy efficient data movement using scalable on-chip network architectures.

Efficient Hardware Implementations for Accelerating DNN-based Inference

Date: 24 September 2021
Time: 8AM to 12.00PM Pacific time 
Speaker: Dr. Partha Maji

When it comes to enabling AI in embedded devices, hardware implementation is viewed as a critical part of any end-user application design effort. To ensure the embedded AI products meet the required functionality, consumes low power, and is secure and reliable, a lot of challenges are faced by the device manufacturers during the optimisation and design phase. The process starts with model selection and typically followed by a sequence of selectively chosen optimisations techniques that are applied to fine-tune the selected model to suit a particular target architecture. This tutorial focuses on the intricate details of this optimisation process for embedded AI including a deep understanding of commonly used DNN architectures, and a wide variety of implementation strategies for different target hardware platforms including CPUs, GPUs, and NPUs. The tutorial also highlights open problems and challenges for future research in this area.