Advanced Hardware and Software Technologies for Ultra-long FFT s Hahn Kim, Jeremy Kepner, M. Michael Vai, Crystal Kahn HPEC 2005 21 September 2005 HPEC 2005-1 This work is sponsored by the Department of the Air Force under contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government
Outline Background FPGA-Based Hardware Technology Parallel Software Technology Conclusions HPEC 2005-2
Introduction Real-time Embedded ISR 1 Gigapoint complex FFT 150M ops ~1M words Radar ELINT SIGINT 1 Kilopoint complex FFT 50K ops ~1K words Non Real-Time Data Analysis HSI Monte Carlo IMINT HPEC 2005-3
Motivation for Real-time, Ultra-long FFT s Wide-Band Digital Receiver RF Receiver ADC 1 G-pt FFT Signal Processing Application Channelized Narrow-Band Digital Receiver ADC 100 M-pt FFT RF Receiver Analog Channelizer...... Signal Processing Application ADC 100 M-pt FFT Can use FFT to detect weak signals Reduce noise floor Longer intervals result in higher gain Real-time, ultra-long FFT processor is a critical enabling component HPEC 2005-4
FFT Technology Space 10 GSPS 04 RFEL HyperSpeed IP (FPGA) 32 K-pt @ 3.2 GSPS Sustained Data Rate 1 GSPS 100 MSPS 10 MSPS 04: LL SYCO (FPGA) 8 K-pt @ 1.2 GSPS 99: LL Discoverer II (ASIC) 128 pt @ 0.5 GSPS 03: Xilinx IP (FPGA) 16 K-pt @ 0.2 GSPS OFDM Communication Standards Req. 02: Transtech DSP Quixilica IP (FPGA) 8 K-pt @ 0.16 GSPS 03: DSP Architectures DSP-24 (ASIC) 64 M-pt 1 K-pt @ 0.06 GSPS @ 0.064 GSPS 01: CRI Pathfinder II (ASIC) 1 M-pt @ 0.04 GSPS 03: Analog Devices ADSP-21262 (DSP) 1 K-pt @ 0.02 GSPS 04 RFEL HyperLength IP (FPGA) 256 M-pt @ 0.01 GSPS State-of-The-Art FFT Technology 1G-pt @ 1.2 GSPS 128 M-pt @ 0.128 GSPS 64 G-pt @ 0.04 GSPS 04 LL pmatlab (128 Proc. w/ GigE) 4 G-pt @ 0.01 GSPS HPEC 2005-5 100 10K 1M 100M 10G 1T 100T FFT Length (Points) Objective: Extend state-of-the-art to ultra-long FFT s
Outline Background FPGA-Based Hardware Technology Parallel Software Technology Conclusions HPEC 2005-6
80G Ultra-long FFT Implementation Challenges ory Requirement (Bytes) 8G 0.8G 80M 8M 0.8M 80K 8K 8 K-pt FPGA Embedded ory Only 32 K-pt 100 M-pt Off-Chip ory Required 1 G-pt 10K 100K 1M 10M 100M 1G 10G FFT Length (Points) Beyond ~32 K-pt FFT, off-chip memory for FIFO and twiddle factors is required Full duplex memory access is a challenge Lincoln architecture selected for ultra-long FFT A Systolic FFT Architecture for Real Time FPGA Systems, HPEC 2004. HPEC 2005-7
Real-Time 1G-Point FFT Architecture MN-pt FFT can be implemented with an M-pt FFT and an N-pt FFT E.g. 1 G-pt FFT M = 32 K-pt, N = 32 K-pt Each 32 K-pt FFT fits into an FPGA Corner Turn 32K FFT Corner Turn Twiddle Factors 32K FFT Corner Turn 9 8 7 6 5 4 3 2 1 0 Corner Turn Example (for an 8-Point FFT):. 9 8 3 2 1 0 7 6 5 4 9.. 8 7 3 6 2 5 1 4 0 Corner Turn (Matrix Transpose).. 8.. 9........ 4 0 5 1 6 2 7 3 Lincoln has developed a corner turn architecture that operates at 1 GSPS HPEC 2005-8
Real-Time FFT Architecture 1 Gigapoint FFT Configuration ADC 1 GSPS HOLD DMUX and Switch Cache and MUX FPGA DMUX and Switch Cache and MUX FPGA DMUX and Switch Cache and MUX Real-Time Corner Turn 32K FFT Real-Time Corner Turn 32K FFT Real-Time Corner Turn Reconfigurable architecture allows for multiple implementations: e.g. 1 G-pt @ 1 GSPS or 100 M-pt @ 100 MSPS X 10 channels HPEC 2005-9
Real-time Example FFT Implementation Current FFT Capabilities Future Ultra-long FFT Capabilities HPEC 2005-10 13 6 13 Symbiotic Communications (SYCO) real-time processor 8 K-pt FFT 450 GOPS @ 130 Watts 208 FFT butterflies No on-board memory Rapid Prototyping of a Realtime Range Compression Processor, HPEC 2005 Poster Session * Initial estimate based on double-precision operation Processor enhancement Provides on-board memory banks for performing real-time corner turns Performs form-factor optimization Develop a universal FFT architecture 100M-Pt FFT X 10 channels 1G-Pt FFT
Outline Background FPGA-Based Hardware Technology Parallel Software Technology Conclusions HPEC 2005-11
Motivation for Out-of-Core Technology Data are often larger than memory on a single processor. ory Disk Can address memory limitation with: 1. Multiple CPUs 2. Using disk as memory Hughes, G.F. Wise Drives. IEEE Spectrum. Aug 2002. Out-of-core technology uses memory as a window into data stored on disk. HPEC 2005-12
MatlabMPI & pmatlab Software Layers Application Input Analysis Output Parallel Library Parallel Hardware Vector/Matrix Comp Conduit Library Layer (pmatlab) Messaging (MatlabMPI) Kernel Layer Math (Matlab) Task User Interface Hardware Interface Can build a parallel library with a few messaging primitives MatlabMPI provides this messaging capability: M PI_Send(dest,co m m,tag,x); X = MPI_Recv(source,com m,tag); HPEC 2005-13 Can build a application with a few parallel structures and functions pmatlab provides parallel arrays and functions X = ones(n,mapx); Y = zeros(n,mapy); Y(:,:) = fft(x);
Out-of-Core ory Management: pmatlab extreme Virtual ory (XVM) Virtual ory + Transparent to user (ease of use) Disk access patterns are often sub-optimal for most algorithms Out-of-Core + Provides control of disk at object level (performance) Exposes swapping mechanism to user Goal of pmatlab XVM is to add out-ofcore capability to pmatlab that is: 1. Easy to use 2. Has high performance 3. Can transparently distribute data pmatlab + Provides mechanism for transparently distributing data across multiple CPUs HPEC 2005-14
Data Organization 1. Data starts as a vector with length MN MN 2. Divide into M vectors with length N N N 3. Reorganize as a MxN matrix M 4. Distribute rows across processors P0 P1 HPEC 2005-15
Hierarchical Matrices and Maps Matrix Global map Out-of-core map HPEC 2005-16 32K P0 P1 P2 P3 32K + + P0 P1 P2 P3 Global matrix Out-of-core matrices 32K Core block (out-of-core) Core block (in-core) 8K 2K Hierarchical matrices span processors and the storage hierarchy User constructs hierarchy by specifying global and out-of-core maps
Ultra-long FFT 0 FFT twiddle corner turn FFT Ultra-long FFT 1 2 3 0 1 2 3 Software-based Implementations MATLAB pmatlab C/MPI pmatlab XVM Testbed Lincoln HPC cluster (LLGrid) 80 dual CPU nodes 2.8-3.06 GHz Pentium 4 Xeon 4 GB memory Gigabit Ethernet HPEC 2005-17
Scalability 1 Gigapoint FFT pmatlab XVM supports ultra-long FFT s with little degradation in performance Maximum problem size = size of available disk space 1 TB represents a 64 G-pt double-precision, complex FFT HPEC 2005-18
Outline Background FPGA-Based Hardware Technology Parallel Software Technology Conclusions HPEC 2005-19
System Development Methodologies System Prototyping Modify word size, precision, etc. N Bit-true simulation Satisfy requirements? Y Initial design Architecture Analysis 1 Gigapoint Performance Verification Compare HPEC 2005-20
Summary Presented FPGA architecture for real-time, ultra-long FFT s Can implement 1 G-pt FFT with smaller FFT s Use SYCO processor to implement FFT s Developed real-time corner turn architecture Future Develop a universal ultra-long FFT architecture Allows multiple configurations in same hardware 1 G-Pt FFT 100 M-Pt FFT X 10 channels Application-specific precision and dynamic analysis HPEC 2005-21
Summary Presented parallel software architecture for ultra-long FFT s Added out-of-core capability to pmatlab Supports ultra-long FFT s with little degradation in performance Demonstrated 64 G-pt FFT (1 TB) Future Expand cluster to enable 64 T-pt FFT (1 PB) HPEC 2005-22
Acknowledgements Ken Senne Gerry Banner Leslie Alger Bob Bond Cy Chan Hector Chan Tim Currie Preston Jackson Andy McCabe Peter Michaleas Michael Moore Charles Rader Albert Reuther Jonathan Scalera Nadya Travinin Edmund Wong HPEC 2005-23