Microprocessor Report (MPR) Subscribe

Nvidia Shares Its Deep Learning

Xavier Neural-Network Accelerator Now Available as Open Source

March 26, 2018

By Mike Demler


Designers no longer need to worry about the costs of deep-learning acceleration: Nvidia is making the technology available for free. The company has extracted the deep-learning accelerator (NVDLA) from its Xavier autonomous-driving processor and is offering it for use under a royalty-free open-source license. It’s managing the NVDLA project as a directed community, which it supports with comprehensive documentation and instructions. Users can also download NVDLA hardware and software components from GitHub. Nvidia delivers the NVDLA core as synthesizable Verilog RTL code, along with a step-by-step SoC-integrator manual, a run-time engine, and a software manual.

The company’s strategy in creating the open-source project is to foster more-widespread adoption of neural-network inference engines. It expects to thereby benefit from greater demand for its expensive GPU-based training platforms. Most neural-network developers train their models on Nvidia GPUs, and many use the Cuda deep-neural-network (cuDNN) library and software-development kit (SDK) to run models built in Caffe2, Pytorch, TensorFlow, and other popular frameworks.

The NVDLA is configurable for applications ranging from tiny IoT devices to image-processing inference engines in self-driving cars, but Nvidia’s first RTL release is the “full” model, which is similar to the unit in Xavier. It includes 2,048 INT8 multiply-accumulators (MACs), but they’re configurable at run time as 1,024 INT16 or FP16 units. In a 16nm design optimized to run the ResNet-50 neural network, the full model processes 269 frames per second (fps) and consumes 291mW on average. This performance is roughly half that of the previous-generation Titan-X GPU, which burns 250W (TDP). The company thinks the full NVDLA model is now ready for design prototyping, but it expects to make changes before declaring it tapeout ready—likely in 2Q18.

Also next quarter, Nvidia plans to offer early access to a small NVDLA version that integrates 64 fixed-configuration INT8 MACs. This design can process 7fps on ResNet-50, but it consumes just 17mW (average). The full and small models are just two end-point examples of the accelerator’s configurability, and designers are free to fine-tune the architecture.

Easing Integration

As Figure 1 shows, the NVDLA works with a host CPU (headless mode) or an attached microcontroller (headed mode), either of which handles fine-grain task scheduling for the accelerator’s function units. Although a dedicated controller increases performance by offloading most tasks from the host CPU, the host must still handle coarse-grain scheduling on the NVDLA hardware. For example, it must synchronize with other system components that run tasks on the inference engine. It’s also responsible for allocating the memory that stores input data and neural-network weights, as well as for mapping the NVDLA’s access to external DRAM through a memory-management unit (MMU).

 

Figure 1. Small and large NVDLA configurations. DBBIF=data-backbone interface. For maximum performance, the inference engine can run with an attached microcontroller. For small, low-power designs, it can communicate directly with the host CPU.

The NVDLA allows each layer function block to run independently, or it can fuse operations into a higher-performance pipeline. In the independent mode, each layer incurs the latency required to read input data and write results to memory. In the fused mode, the function blocks communicate through small FIFOs, avoiding the round-trip to memory. The inference engine runs neural-network graphs by executing a sequence of command-execute-interrupt operations. The microcontroller or host CPU issues the commands that activate the function units in each layer (or group of layers), and when those units complete their operations, they send back an interrupt to indicate readiness for the next set of operations. If no interlayer dependencies exist, the units can operate independently.

Each NVDLA function block integrates double-buffered configuration registers, enabling it to preload the next layer so it can begin execution as soon as the current layer operations complete. Small NVDLA configurations connect to external memory through a single AXI interface, but larger configurations use two. The first interface allows access to shared system memory; the second is optional, but high-performance configurations will use it to access a dedicated high-bandwidth SRAM. Although designers must use a third-party memory compiler to build the necessary SRAMs, the Verilog code includes a SystemC behavioral model of the memories that’s fit for simulation.

The NVDLA uses three interfaces to connect to other system components. The connections include a one-bit interrupt signal, which the neural-network layer function blocks assert when tasks complete or errors occur. The synchronous 32-bit configuration-space bus (CSB) allows a CPU to access the NVDLA configuration registers. To connect the CSB to Amba and other standard buses, however, designers must add another shim-logic layer. The data-backbone interface (DBBIF) is a configurable AXI-compatible port that links to system memory.

A Modular Design

Although other DLAs allow designers to size the MAC array, the NVDLA’s modular architecture allows separate configuration of each layer (see MPR 6/19/17, “Xavier Simplifies Self-Driving Cars”). As Figure 2 shows, the function blocks include a convolution core (MAC array), single-point data processor (SDP) for activation functions, planar data processor (PDP) for pooling layers, and cross-channel data processor (CDP) that runs normalization functions.

 

Figure 2. NVDLA architecture. Designers can configure the function blocks for each neural-network layer. The convolution core handles most of a CNN’s operations, and it’s configurable for 32 to 2,048 MACs.

The 2D MAC engine scales in binary multiples, starting at 32 MACs. The width is adjustable from 8 to 64 MACs, and the depth is adjustable from 4 to 64 MACs. A 256-entry FIFO between the convolution accumulator and SDP (or memory) stores 16 elements per entry. Each element is 32 bits wide. Dedicated data-memory and Rubik reshape functions accelerate memory-to-memory transformations for tensor copying and reshaping.

The convolution core has a buffer memory to store input feature data and weights. By eliminating accesses to external memory, the buffer reduces latency and power. The full configuration uses 512KB, and the small one uses 128KB. The buffer is a four-port design, employing dedicated read and write ports for both data and weights. The core includes a unique sparse-compression feature, which reduces memory bandwidth when reading or writing sparse arrays of feature data and weights.

The NVDLA instruction set differs from other DLAs by supporting four convolution modes: direct, Winograd, image direct, and batch. The direct mode is the most basic convolution operation, enabling parallelization up to the MAC-array width. To further optimize direct convolution, programmers can apply a Winograd transform to the input data. This algorithm boosts convolutional-neural-network (CNN) performance and power efficiency by reducing the number of MAC operations. For a typical 3x3 filter, as an example, Winograd cuts the number of MAC operations by a factor of 2.25x.

The image-input mode optimizes performance for the first layer of a computer-vision network. In this type of neural network, the first layer typically processes input features from three image channels (e.g., RGB). For such cases, the NVDLA integrates logic that maintains nearly 50% average MAC utilization, even if the width setting is large (e.g., 16). To reduce bandwidth for fully connected layers, the NVDLA has a batching feature that allows multiple sets of activations to run at the same time. By allowing multiple activation sets to share the same weight data, this technique increases performance and reduces memory bandwidth.

Shapeshifting

The NVDLA directly calculates linear activation functions, such as biasing and scaling, but the SDP uses a lookup table for (optional) nonlinear activations, such as rectified-linear-unit (ReLU) and sigmoid functions. The activation engine is adjustable for 1 to 16 outputs per cycle. The linear activations can take bias and scaling factors from the configuration registers and apply them just once for the entire CNN, or it can perform memory reads to apply them on a per-channel, per-layer, or per-pixel basis. The PDP is run-time configurable to support variable-size and function-pooling operations, including average pooling, min pooling, and max pooling. The CDP runs the local-response-normalization (LRN) function, which is useful for normalizing tensors that comprise outputs from unbounded ReLU activations.

The data-memory and reshape blocks have a bridge-DMA engine to handle data transfers between system DRAM and the NVDLA’s memory interface. The reshape engine runs transformations such as contracting, merging, and slicing/splitting feature sets. For example, the splitting operation can separate regions of interest in an image. The Rubik reshape functions operate on data cubes. 

The reshape functions include a contract mode that supports deconvolution as well as merge/split modes. The split mode transforms a data cube into the four-dimensional (NHWC) tensor format used in TensorFlow. The N dimension of NHWC is the number of images in a batch, H is the number of pixels in the vertical (height) dimension, W is the number of pixels in the horizontal (width) dimension, and C is the number of channels (e.g., one for grayscale and three for RGB). Merge transforms a series of planes to a feature-data cube.

A Complete Software Stack

Along with a detailed hardware model, the NVDLA comes with a comprehensive software stack. A parser converts networks trained in Caffe or TensorFlow to the intermediate format that the TensorRT compiler uses. That compiler builds the neural-network architecture model, creating a set of layer descriptions optimized for a specific NVDLA implementation. It then outputs the model in a loadable file, which contains weight data transformed to NVDLA format. The loadable file in turn comprises information about each layer’s dependencies, the layer’s input and output tensors, and the configuration of each block for a layer’s operations.

The NVDLA software includes a run-time environment that allows the inference engine to work with an attached microcontroller or submit jobs to a host CPU. The run-time environment comprises a user-mode-driver (UMD) API and a kernel-mode-driver (KMD) API, as Figure 3 shows. The UMD is the main interface between user applications and the inference engine. It loads the network model, binds input and output tensors to a defined set of memory-data structures, and submits inference jobs to the KMD. The KMD uses the layer-dependency graph to schedule inference tasks on the DLA, and it configures the DLA’s registers to handle each layer’s functions.

 

Figure 3. NVDLA run-time engine. The NVDLA software includes user- and kernel-mode drivers. The former provides the interface to the host CPU, and the latter uses the neural-network layer dependencies to submit tasks to the inference engine.

For compatibility across compute platforms, Nvidia specifies that developers must wrap the UMD and KMD in portability layers that abstract the OS-dependent interface. The portability layer allows the NVDLA software stack to run on either FreeRTOS or Linux-based systems. The NVDLA web site provides detailed specifications for this layer.

Tools Aid Hardware/Software Development

The NVDLA core comes with a performance-model spreadsheet, Synopsys Design Compiler synthesis scripts, a SystemC simulation model, simulation test benches, and a register-accurate virtual-platform model for software development. The performance model includes measurements of average frame run time, frames per second, hardware MAC utilization, and network MAC utilization. The test benches comprise a version that allows RTL verification using the free Verilator simulator, but designers can perform more-thorough analyses using SystemVerilog on commercial tools. The initial test-bench suite comes with basic memory-copy, register-write, and interrupt tests. Short functional tests evaluate the convolution, pooling, and activation blocks. Longer tests run complete AlexNet and GoogleNet layer tests.

Nvidia has published area, performance, and power estimates for a range of small to large NVDLA configurations, as Table 1 shows. These estimates assume a 1.0GHz clock frequency and manufacturing in a 28nm or 16nm process for a design optimized to run the popular ResNet-50 neural network. All the sample configurations operate exclusively from external DRAM. On-chip SRAM will boost performance above these measurements, especially for larger networks.

 

Table 1. NVDLA sample configurations. All configurations have MACs that implement INT8, INT16, or FP16 calculations. Performance and power specifications are for a nominal 1.0GHz clock frequency. *For designs without on-chip SRAM; †for 16nm designs. (Source: Nvidia)

The company uses the large 2,048-MAC design in its 12nm Xavier autonomous-driving SoC, which delivers five trillion operations per second (TOPS) when running at 1.25GHz. Smartphones could use models with 256 to 1,024 MAC units, although the bar is rising quickly in that market. The smaller configurations offer performance more suitable for IoT edge devices. At 128 MACs and below, however, the performance per watt and per area (mm2) fall off dramatically, as Table 1 shows. Unfortunately, the small configurations carry the same overhead and control logic as the large configurations, making them much less efficient.

The open-source package includes models for each configuration, and designers can evaluate them on the GreenSocs QBox virtual-platform development system. QBox uses the open-source QEMU emulator to run transaction-level processor models written in SystemC. The NVDLA emulation models operate along with a Linux driver on an ARMv8 CPU model, providing a register-accurate platform suitable for software development.

Nvidia also enables designers to evaluate the sample configurations by installing Verilog RTL models on FPGA prototyping platforms. The FPGA models use Amazon Web Services’ EC2 F1 platform, which designers can lease hourly. The Amazon FPGA images (AFIs) run in the cloud on 16nm Xilinx UltraScale+ FPGAs. Each FPGA integrates approximately 2.5 million logic elements and 6,800 DSP engines. EC2 customers only pay for compute time. Amazon offers the FPGA-development kit at no charge, and users can reprogram the FPGA an unlimited number of times with no additional fees.

The NVDLA Verilog model works for simulations as well as synthesis. The core uses a single clock and power domain, and it supports coarse- and fine-grain power gating. The model includes sample bus adapters for connecting to an AXI4 system bus. Because the SRAMs come as behavioral models, designers must use a third-party memory compiler for physical implementation. The design employs single- and dual-ported SRAMs.

Custom-Fit Models

The NVDLA provides many more configuration options than other licensable DLAs, allowing designers to not only size the MAC array but also choose the hardware and features they need for each neural-network layer. As Table 2 shows, a designer can use 32 to 2,048 MACs, and these MACs can perform fixed- or floating-point math. Whereas competitors offer a single convolution mode, the NVDLA offers four, boosting MAC efficiency for certain networks. The run-time floating-point option supports greater precision where needed. The wide range of MAC configurations enables designers to use the same base DLA architecture in small IoT devices and large image-recognition engines, reducing software-development costs. Small DLAs are increasingly popular for tasks such as voice recognition in IoT edge devices (see MPR 2/26/18, “RISC-V Enables IoT Edge Processor”).

 

Table 2. Comparison of selected DLA cores. Designers can size the NVDLA convolution engine in binary multiples starting from 32 MACs, but unlike competing alternatives, it permits them to customize the hardware for all of the neural-network layers. *In a 16nm process; †the v-MP6000UDX performance is for a 256-core configuration. (Source: vendors)

By comparison, the Ceva and Videantis DLAs are limited to INT8 or INT16 calculations. The smallest NeuPro model integrates 512 MACs, so it will compete against the large NVDLA configurations. Ceva’s accelerator scales to twice the MACs per core of a single NVDLA. Nvidia’s instruction set allows large inference jobs to be split across multiple units, however, and the company uses two NVDLA cores in its Xavier processor. But raw MAC count is a deceptive specification, because inference engines rarely approach full utilization. The NVDLA excludes the controller CPU, which Ceva includes in the NeuPro VPU, so high-performance designs will require a licensed CPU from another vendor. Designers can avoid additional costs by employing an open-source RISC-V CPU, which Nvidia is also using as a replacement for its internal Falcon controllers (see MPR 8/1/16, “RISC-V Update”).

Designers can use the v-MP6000UDX in multiples of 64 MACs per core, so it’s a better alternative to the smaller NVDLA configurations. But adding more MACs to a Videantis design requires duplicating all the core logic. In comparison, the NVDLA offers configuration options for each set of neural-network layer functions, and the batch mode increases efficiency by aggregating functions for fully connected layers. But as noted above, the NVDLA carries significant overhead that degrades its efficiency in configurations below 256 MACs, which could give competitors an advantage at this level. Neither Ceva nor Videantis has disclosed die area or power numbers, however, making direct comparisons impossible.

Shaking the Market at Its Core

By releasing its DLA as a highly configurable open-source core, Nvidia is attempting to disrupt the market for licensable intellectual property (IP). No other DLA-IP supplier can match its AI expertise or technology breadth. Most neural-network developers already train their models on the Cuda GPU platform, but by offering the NVDLA at no charge to its large user base, the company aims to extend that leadership to inference engines. It hopes that proliferating the NVDLA will boost neural-network adoption, create a quasi-standard for inference engines, and stimulate a virtuous circle of increasing demand for its training platforms. We expect the availability of open-source models will encourage many neural-network developers to make the NVDLA their inference target—at least for the initial model.

Many designers shy away from open-source IP, but unlike academic projects, the NVDLA launched with extensive documentation and tools. The core comes with characterization data, a how-to manual for SoC integrators, simulation models, synthesis scripts, test benches, a software stack, and detailed documentation. Moreover, the core was designed for a commercial SoC and is being tested in initial silicon. Other IP vendors, however, have much more experience in providing cores for a variety of third-party SoC designs. Whereas NVDLA is optimized for its original large configuration, other IP vendors offer designs that are optimized for small configurations as well.

Nvidia is executing on the four-quarter roadmap it published when it launched the NVDLA web site. It has met its 3Q17 and 4Q17 milestones, and the next step is to release the small model and updated software tools by the end of 2Q18. The company hasn’t published a roadmap beyond that point, so its commitment to an ongoing schedule of new improvements is uncertain. Deep learning is a rapidly evolving field, and the established IP vendors typically release new DLAs at least annually. If Nvidia doesn’t upgrade the NVDLA, it could rapidly become out of date. Using the open-source NVDLA also limits differentiation, although SoC designers can add their own modifications and improvements.

Although “free” is always attractive, users of other DLA cores get better support in return for their licensing fees; that support can be critical to a successful design. Designers of high-volume SoCs generally focus on PPA (performance, power, and area) more than license fees, so the NVDLA must compete on those parameters as well. But it also offers extensive configurability, along with a complete tool set and software stack, yielding a package that’s likely to draw some paying customers away from the incumbent IP suppliers. It will also be attractive to the rapidly expanding group of AI startups. The most critical metric for Nvidia’s open-source project will be how many of its users deliver successful commercial products.

Price and Availability

The NVDLA is freely available under Nvidia’s open-source license. Designers can download the core and supporting materials from the project web site and from GitHub. For more information, access www.nvdla.org.

Events

Linley Fall Processor Conference 2018
Covers processors and IP cores used in embedded, communications, automotive, IoT, and server designs.
October 31 - November 1, 2018
Hyatt Regency, Santa Clara, CA
More Events »

Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »