Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
The datasets generated and/or analyzed during the current study are available in the FDA-NCI Clinical Proteomics Program Databank.
Real-time analysis of patient data during medical procedures can provide vital diagnostic feedback that significantly improves chances of success. With sensors becoming increasingly fast, frameworks such as Deep Neural Networks are required to perform calculations within the strict timing constraints for real-time operation. However, traditional computing platforms responsible for running these algorithms incur a large overhead due to communication protocols, memory accesses, and static (often generic) architectures. In this work, we implement a low-latency Multi-Layer Perceptron (MLP) processor using Field Programmable Gate Arrays (FPGAs). Unlike CPUs and Graphics Processing Units (GPUs), our FPGA-based design can directly interface sensors, storage devices, display devices and even actuators, thus reducing the delays of data movement between ports and compute pipelines. Moreover, the compute pipelines themselves are tailored specifically to the application, improving resource utilization and reducing idle cycles. We demonstrate the effectiveness of our approach using mass-spectrometry data sets for real-time cancer detection.
We demonstrate that correct parameter sizing, based on the application, can reduce latency by 20% on average. Furthermore, we show that in an application with tightly coupled data-path and latency constraints, having a large amount of computing resources can actually reduce performance. Using mass-spectrometry benchmarks, we show that our proposed FPGA design outperforms both CPU and GPU implementations, with an average speedup of 144x and 21x, respectively.
In our work, we demonstrate the importance of application-specific optimizations in order to minimize latency and maximize resource utilization for MLP inference. By directly interfacing and processing sensor data with ultra-low latency, FPGAs can perform real-time analysis during procedures and provide diagnostic feedback that can be critical to achieving higher percentages of successful patient outcomes.
Keywords: FPGA, Machine learning, Multi-layer perceptrons, Real-time, Inference, Cancer, Mass-spectrometry
Machine learning (ML) plays an integral part in solving many key scientific problems. It has been applied to such varied domains as finding cancer treatments [1], weather simulations [2], and design of new nanocatalysts [3]. Machine learning methods have been used in medical applications for many years [4]. First of all, machine learning can be used in monitoring patients’ vital signs. Muller, et al. introduced a Brain-Computer Interface (BCI) [5] based technique that can extract appropriate features from continuous EEG signal [6]. The proposed system collects a number of trials from patients while they are asked to perform fixed tasks. Using the collected datasets as training data, the system infers the typical EEG patterns in real-time. In [7], Shoeb and Guttag presented a machine learning algorithm to recognize epileptic seizure from the scaled EEG signal. Machine learning is also widely used in medical image retrieval, enhancement, processing and mapping as introduced in [8–12]. Zacharaki, et al. performed a study on distinguishing different types of brain tumors based on Support Vector Machine Recursive Feature Elimination (SVM-RFE) algorithm [13]. Pereira et al. introduced an approach to use machine learning algorithms to decode variables of interest from fMRI data [14]. Apart from that, machine learning is also an effective supplement in diagnosing diseases [15, 16]. Ozcift, et al. constructed Rotation Forest (RF) ensemble classifiers for multiple machine learning algorithms in a computer-aided diagnosis (CADx) system that is used for diagnosing diseases like diabetes and Parkinson [17]. Li and Zhou proposed a semi-supervised learning algorithm, Co-Forest, that uses undiagnosed samples along with only a small amount of diagnosed ones as training datasets for CAD systems targeting breast cancer diagnosis [18].
Multi-Layer Perceptrons (MLPs) are an important subset of ML that not only optimize existing applications and practices but also widen that pool by enabling designs to meet more stringent requirements of reliability, performance, complexity, and portability that are not possible from traditional approaches. MLPs consist of multiple layers of firing neurons, with each layer using responses of previous neurons as stimulus. Use of MLPs is divided into two stages: training and inference. Training is an iterative process that determines neuron connection strengths in a given MLP. We assign initial values to connection strengths and then apply training cases as inputs to the model. The correct results of training cases are known beforehand, which we refer to as expected values. Corresponding measured outputs of the model are compared with these expected values, and connection strengths are then updated by a (static or dynamic) factor in order to minimize the error between the two sets of results. This process is repeated until measured results converge to values that reduce errors to within acceptable bounds. Training is typically an offline operation and thus does not impact analysis timeframes. Inference, on the other hand, is performed in real-time and refers to the process of performing classification (or regression) on a set of test input cases using the trained model. In our work, we focus on improving inference latency in order to achieve small analysis timeframes.
Traditional processors running inference algorithms cannot meet the required timing constraints in most cases due to the large overhead of communication protocols, memory accesses, and static (often generic) architectures. Data must first be moved from sensors to CPU buffers, typically by using serial ports, thus limiting the bandwidth. For CPU-based implementations, processing individual test cases results in a large number of cache misses. This is because MLPs have virtually no data reuse for the same test case, and batch processing is not possible due to latency bounds. Moreover, meaningful model sizes tend to be larger than the higher cache levels (closer to the CPU) and hence entire models cannot fit in the L1 or L2 cache. For GPU implementations, further memory transactions are required to move data to and from the device memory over the PCIe bus, which increases the overhead. Low data reuse and batch-less processing also hurt GPU performance since the computations may not be sufficiently parallel to fully utilize the thousands of available cores. For cores that are assigned work, a significant number of cycles are likely to be idle as threads wait for off-chip data to be accessed. ASIC based designs have typically managed to fill this gap by providing massive amounts of resources and specialized pipelines that are tailored for Deep Neural Networks (DNNs); one example is the Google TPU [19]. However, as the number of diverse applications and their associated models grows, these ASICs effectively address a domain, rather than a particular application, and hence are unable utilize application-specific optimizations at the level needed.
Reconfigurable architectures, such as FPGAs, are becoming increasingly popular since their logic can be configured to construct application-specific architectures [20–25]. For MLPs in particular, arbitrary sized models can be easily implemented, scaled, and even transferred to newer generation technologies by recompiling the design. Like GPUs, FPGAs are Commercial-Off-The-Shelf (COTS), which means users can buy the device, generate logic for their desired model, and ensure that the resulting compute pipelines are optimal not only for MLPs, but also for their specific MLP application. Moreover, since the underlying computation structure does not change, complex HDL codes do not have to be written. Instead, simple scripts can be used to create the required architecture. As FPGA on-chip memory grows, reaching several megabytes in the current generation, users can move all model parameters to the on-chip SRAM and remove idle cycles caused by fetching weights from off-chip DRAM. Moreover, FPGAs offer support for most common serial and parallel protocols and can thus further minimize the latency of memory and I/O transactions by supplying data directly (from sensors) to compute pipelines. All these features result in a transition of the computation from memory bound to compute bound.
Improving latency (and hence performance) of compute-bound MLP inference requires all pipelines to operate both stall free and at high bandwidth. The former is achieved by ensuring modules in the design source/sink data at rates that are constrained by the latter. Since modules in MLP architectures are tightly coupled and operate within the same clock domain, constraints can occur even for indirect connectivity. Therefore, by selecting higher interface bandwidths for particular ports, the complexity (and hence latency) of potentially multiple modules can become significantly large. This increase in complexity can outweigh the benefits of higher module throughput and thus increase the overall latency of the computation.
FPGA-based MLP implementations have received significantly less attention than other DNNs such as Convolutional Neural Networks. The most prominent design is Microsoft’s FPGA-based inference architecture, Brainwave [26], targeting low compute-to-data ratio DNN applications such as MLPs, Long Short-Term Memories (LSTMs), and Gated Recurrent Units (GRUs). The memory bandwidth bound is alleviated by using on-chip block RAM for weight matrix storage. For an 8-bit integer implementation on Stratix V FPGAs, Brainwave achieves 2.0 TOps/s, while on Stratix 10 FPGAs, they claim to have 31 TOps/s performance running at 500 MHz. In [27, 28], the authors proposed FPGA-based MLP architectures, but their work serves as a proof-of-concept and is also constrained by off-chip memory bandwidth. Sharma et al. presented an automated DNN design generation framework, DNNWeaver [29], which also depends on DRAM access speed for performance. Moreover, the DNNWeaver compute units are constrained to Digital Signal Processor (DSP)-only implementations and logic cells are not used for ALUs. Gomperts et al. introduce a general purpose architecture for MLP based on FPGAs [30]. Their design generates individual processing elements for each layer, which is not feasible for large neural networks where resources on a single FPGA may not even be sufficient to compute a single layer in parallel.
With regard to stand-alone operation with a direct interface to sensors, FPGAs have shown support for various forms of connectivity without host support. Apart from General Purpose I/O pins [31, 32] that can implement virtually any communication protocol, they also offer direct chip-chip connectivity through Multi-Gigabit Transceivers (MGTs) and network connectivity through dedicated Ethernet controllers. MGTs are serial links that provide low-latency, high-bandwidth, and low-energy interfaces [33–38] and enable communication at rates of 100 Gbps per MGT. Current high-end FPGAs can have close to 100 MGTs. These MGTs can be used to connect multiple FPGAs together and perform computations for models that are too large for a single device. Large multi-FPGA clusters have long been a staple of computational finance and certain other industries. George et al. [39] presented an academic example of a 64 FPGA cluster with direct FPGA-to-FPGA links in a 3D torus network. Similarly, Microsoft’s original Catapult System [40, 41] consists of 1632 nodes with FPGAs directly interconnected via MGTs in a series of 6×8 tori.
In our work, we explore the application-aware optimization space for compute-bound MLP inference processors using our proposed FPGA-based architecture. By identifying modules in the critical path and their interconnectivity, we determine and optimize parameters for a given application in order to minimize the latency of the entire model evaluation and meet real-time constraints. This strategy enables our design to be feasible for applications such as real-time medical diagnosis.
In this study, we have designed and implemented a real-time Multi-Layer Perceptron inference processor for medical diagnosis using FPGAs. Our focus is on achieving ultra-low analysis latency to meet real-time constraints. The proposed system consists of a modular approach with standardized interfaces that enables hardware of individual functions to be easily modified based on changes to the inference model, design practices, or resource availability. We demonstrate that the ability to be application specific enables FPGA-based designs to choose architecture parameters that minimize latency and maximize utilization of computing resources. Using mass spectrometry benchmarks for cancer detection, we show that FPGAs can outperform traditional computing technologies such as CPUs and GPUs for real-time MLP applications.
We have tested our designs on the Intel Arria 10 FPGA (AX115H3F34E2SGE3), which has 427,200 Adaptive Logic Modules (ALMs), 1518 DSP blocks, 53 megabits of on-chip memory, and a maximum of 624 user General Purpose Inputs/Outputs (GPIOs) [42]. The GPU used is a Tesla P100 which has 3594 CUDA cores and a 12 gigabyte High Bandwidth Memory (HBM2) with a peak bandwidth of 549 GB/s. For the baseline, we use an eight-core 2.6 GHz Intel Xeon E5-2650v2 CPU.
FPGA designs are implemented by using Altera OpenCL 16.0.2 [43] to ensure that standard optimizations are applied and the impact of varying design parameters on latency can be fairly determined. Resource usage is measured by using the Quartus Prime Pro 16.0 compiler. GPU reference designs are compiled by using TensorFlow [44] r1.4, python 3.6.2, cuDNN 6.0 and CUDA 8.0. CPU code is also compiled by using TensorFlow r1.4. Both GPU and CPU designs use single-precision floating point, while the FPGA implementation uses fixed-point arithmetic, with arbitrary variable sizes that are optimized for each computation stage.
We demonstrate the effectiveness of the FPGA-based MLP inference processor by evaluating models for detecting cancer using protein profiles. The two datasets used are obtained from [45] and use Surface-Enhanced Laser Desorption and Ionization (SELDI) protein mass spectrometry to generate data for ion intensities at 15154 mass/charge values. The first is Ovarian-8-7-02 (Ovarian), which contains 91 normal and 162 cancer cases. The second is Prostate Cancer Studies (JNCI), which contains 69 cancer and 63 normal cases. Cancer and normal data for both benchmarks are further divided into training and test sets, as shown in Table 1 . For both benchmarks, we propose a model with two hidden layers, with the output determining whether the patient is normal or has cancer. We train the proposed models in single-precision floating point using Tensorflow and find that prediction accuracy for both benchmarks is > 90%. For the FPGA implementation, we convert trained parameters to 8 bits for weights and 32 bits for biases. This conversion is achieved by scaling them, based on a given layer’s maximum and minimum values, in order to utilize the entire integer range.
Proposed MLP models for mass spectrometry benchmarks
| Model | Dimensions | Training cases | Test cases | Accuracy | F1 Score |
|---|---|---|---|---|---|
| Ovarian | 15154 × 512 × 512 ×2 | 160 | 92 | 98.9% | 0.99 |
| JNCI | 15154 × 64 × 512 ×2 | 100 | 32 | 90.6% | 0.92 |