Groq Wants To Reimagine High Performance Computing

Article originally published by Forbes

One of the more fun aspects of my job is talking with entrepreneurs who are developing disruptive technology. Jonathan Ross is the CEO and founder of Groq, a maker of next-generation chips that specialize in inference processing and accelerating systems with real-time artificial intelligence (AI) and high performance computing (HPC).

Ross previously invented the tensor processing unit (TPU) that drives Google’s machine learning (ML) software. He founded Groq in 2017, sensing an opportunity as the growth of AI presented computational challenges to traditional computing architecture.

Solving AI inference with legacy designs is a band-aid

AI training creates a neural network or ML algorithm using a training dataset. AI inference refers to the process of using a trained neural network model to make a prediction.

The critical observation is that across this workflow, no single processor, whether a CPU, GPU, Field Programmable Gate Arrays (FPGA), or Tensor Processing Units (TPU), is an optimal solution. A case of one size does not fit all.

That has not stopped folks from trying, such as 4,000 core CPUs, reconfigurable FPGAs, or GPUs with more capable cores or cores that are independently programmable. These tweaked, existing designs deliver marginal improvements and will not satisfy ML’s thirst for compute.

Groq decided to do something radically different, innovating contrary to conventional semiconductor industry wisdom. Ross expanded on Groq’s mission considering what’s available in the market today. “We decided our mission is to drive the cost of compute to zero. And everyone hated it. But, if you look at the history of compute, that’s what’s happened. When we say, ‘Driving the cost of compute to zero,’ we’re still selling our solutions as competitive industry price point. But when we are delivering orders of magnitude performance improvements–200x, 600x, 1000x–we’re giving 200, 600, 1000 times the performance per dollar. So, it’s approaching free.”

AI inference has reached a bottleneck

Using existing architectures and connecting many CPUs solves training challenges. AI inference is much more difficult because it is real-time, latency-sensitive, and needs high performance and efficiency.

Over time CPUs have become larger and more complex, with multiple cores, multiple threads, on-chip networks, and control circuitry. Developers tasked with accelerating software performance and output must deal with complicated programming models, security problems, and loss of visibility into compiler control due to layers of processing abstraction. In short, standard computing architectures have hardware features and elements that offer no inference performance advantages.

GPU architectures are designed for DRAM bandwidth and built on multi-data, or multi-task, fixed-structure processing engines. GPUs perform massively parallel processing tasks, but there is memory access latency, and ML is already pushing the limits of external memory bandwidth.

A less complex chip design is the answer

Groq designed a chip that delivers predictable and repeatable performance with low latency and high throughput across the system called the tensor streaming processor (TSP).

The new, simpler processing architecture is designed specifically for the performance requirements of machine learning applications and other compute-intensive workloads.

GroqChip1 compared to a typical CPUGROQ

Explaining the approach to the chip design, Ross commented, “It started off with software. Every chip that is made today is made by hardware engineers. That would be like having auto mechanics design cars - everything is about engine optimization. In an ideal case, you’d have a driver design a car and have mechanics help build it. I am a software engineer, and a lot of our team members are software engineers, so we approached building GroqChip from the ground up as a user of the chip, as opposed to someone who is optimizing for the building of the chip. It’s resulted in a very different architecture that is very easy to program.”

This software-defined hardware approach is what makes Groq radically different. The compiler was the team’s sole focus for the first six months, and only after that did the team start working start on the chip architecture.

The result is that execution planning happens in the software, with Groq™ Compiler controlling operating of the hardware and freeing valuable silicon space for more processing. The control provided by this architecture leads to predictable performance and better and faster model deployment.

“Customers usually have some code and they’ve been trying to get working on a GPU, FPGA, or a CPU–and they can’t get the performance they need. They come to Groq and we are able to get it running quickly and performing. That’s the power of Groq™ Compiler,” Ross shared.

Groq also offers GroqWare™ Suite, a host of tools designed to simplify the developer experience. A developer typically develops, compiles, and deploys a program to hardware, runs it multiple times, profiles it, and then goes back to optimize and recompile, often numerous times. With Groq, there is no need to deploy it to hardware. Instead, developers can use GroqView™ Profiler, which visualizes compute and memory usage, tightening the development loop. There is also GroqFlow™, a simple interface for getting models running on GroqNode™ servers by adding one line of code.

Because Groq Compiler orchestrates everything, data flows in at the right time and place, ensuring calculations occur immediately, with no delay.

The compiler understands each instruction speed and instructs the hardware precisely what to do and when. There is no abstraction between the compiler and GroqChip. In a traditional architecture, it takes power and time to move data from DRAM into the processor, and the processing performed on the same workload is variable.

Groq Compiler controls the flow of instructions to the hardware making processing fast and predictable, so developers can run the same model multiple times on GroqChip and receive precisely the same result each time.

Ross commented, “When developers compile on GroqChip, we share the exact performance, with no variation from run to run.” This kind of deterministic performance is essential to a growing range of applications.

Solving the challenge of “batch size 1”

Running a batch size of one or computations on a single image during inference processing is required for real-time responsiveness in applications such as natural language processing

Batch size 1 introduces performance and responsiveness complexities to machine learning applications, particularly conventional inference platforms based on GPUs.

The Groq architecture does not experience latency at batch size 1. The single-threaded, single-core architecture in the TSP delivers maximum performance at any batch size. Groq claims the TSP is nearly 2.5 times faster than GPU-based platforms at large batch sizes and 17.6 times faster at batch size 1.

Supercomputing-as-a-service

Today running models in the cloud involves requesting a block of time. Since most customers do not know how long it will take to run the models, blocks of time are purchased based on a good guess.

The deterministic performance will make Supercomputing-as-a-service possible because Groq Compiler can determine the exact time required to complete the modelling task.

This future potential will disrupt the market by providing the ability to bill for precisely the time it will take rather than the redundancies, saving customers in TCO and resources across the board.

With this big vision, Ross shared, “I’ve always said–get comfortable being uncomfortable. Fear is a sign you’re doing something that matters.”

Wrapping up

As a former "chip guy" myself, I find GroqChip and its architecture "a thing of beauty."

We have known for a while that reaping the benefits of AI, innovative infrastructure, and predictive intelligence will require a much simpler and more scalable processing architecture than a legacy solution.

General purpose CPUs are great at serial workloads, but it comes with significant overhead in orchestrating hundreds or even thousands, consuming most of the gains. ML is parallel processing. While you think a GPU would shine, there is also extraneous hardware in GPUs that eats into the gains.

Groq understood that a less complex chip design was the answer and cracked the code by giving you answers quicker than a CPU, with superior throughput and parallel performance than a GPU, with the ability to do one PetaOp or one quadrillion operations per second.

That is an astounding fifteen zeros after the one!

GroqJames Stephens