SA-FEMIP: A Self-Adaptive Features Extractor and Matcher IP-Core Based on Partially Reconfigurable FPGAs for Space Applications

Original

Availability:
This version is available at: 11583/2571938 since: 2015-11-16T15:58:59Z

Publisher:
IEEE / Institute of Electrical and Electronics Engineers

Published
DOI:10.1109/TVLSI.2014.2357181

Terms of use:
openAccess
This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository

Publisher copyright

(Article begins on next page)
SA-FEMIP: a Self-Adaptive Features Extractor and Matcher IP-core based on Partially Reconfigurable FPGAs for Space Applications

Stefano Di Carlo, Senior Member, IEEE, Giulio Gambardella, Student Member, IEEE, Paolo Prinetto, Senior Member, IEEE, Daniele Rolfo, Student Member, IEEE, Pascal Trotta, Student Member, IEEE.

Abstract—Video-Based Navigation (VBN) is increasingly used in space-applications to enable autonomous Entry, Descent and Landing of aircrafts. VBN algorithms require real-time performances and high computational capabilities, especially to perform Features Extraction and Matching (FEM). In this context, Field-Programmable Gate Arrays (FPGAs) can be employed as efficient hardware accelerators. This paper proposes an improved FPGA based FEM module. On-line self-adaptation of the parameters of both the image noise filter and the features extraction algorithm is adopted to improve the algorithm robustness. Experimental results demonstrate the effectiveness of the proposed self-adaptive module. It introduces a marginal resource overhead and no timing performance degradation when compared to the reference state-of-the-art architecture.

Index Terms—Video-based navigation, image processing, FPGA, hardware acceleration, space applications.

I. INTRODUCTION

UPCOMING plans for solar system exploration in the next 30 years are expected to include landing, sample and return missions to moons, planets, asteroids, and comets [1]. In recent space missions Spirit, Opportunity, and Curiosity, the spacecraft descending trajectory and the final landing point were precomputed and fixed during the mission planning, enabling to reach a maximum landing precision, quantified in terms of area in which the spacecraft is likely to land (landing ellipse) of 20 km² [2]. Spacecraft autonomous precision landing capabilities able to reduce the landing ellipse to sub-kilometer accuracy would provide safe and affordable access to landing sites that promise the highest science return and pose minimal risk to the spacecraft [3]. Video-Based Navigation (VBN) is an area of computer vision that exploits image frames captured by cameras and image processing algorithms to assist navigation in several application domains, including robotics, unmanned vehicles, and avionics [4] [5]. The wide availability of cameras on spacecrafts makes VBN a very interesting approach for the implementation of autonomous Entry, Descent and Landing (EDL) control systems for next generation space missions.

VBN algorithms extract geometrical information from a set of real-time sampled image frames. They basically perform two activities named Feature Extraction and Matching (FEM), and Motion Estimation (ME). During FEM, each frame is processed to detect those pixels that represent features of interest for the image (e.g., corners or edges of surfaces). The detected features are then compared to extract those that can be recognized in two consecutive images (matching points). Eventually, the ME algorithms analyze the detected matching points and estimate the relative position and orientation of the camera (fixed with respect to the moving object). To increase accuracy, ME algorithms require very accurate matching points distributed across the entire frame [6]. While ME algorithms are not computationally intensive, FEM algorithms require high computation capability to guarantee high frame rates and therefore high accuracy. Hence, very efficient hardware accelerators for this task are mandatory.

This paper proposes SA-FEMIP, an optimized FPGA-based self-adaptive FEM architecture based on the well known Harris feature extractor algorithm [7]. This architecture extends the solution proposed by the authors in [8] by enabling self-adaptation of the parameters of the image noise filter and the feature extraction algorithm. Self-adaptation enables to better optimize the FEM algorithm to the environmental conditions, thus increasing the robustness with respect to noise and variations of external conditions that are typical of the space environment. Adaptation is obtained introducing very marginal overhead and guaranteeing high operational rates. This is achieved by resorting to the Dynamic Partial Reconfiguration (DPR) capabilities of modern space-qualified FPGAs. Experimental results clearly show that SA-FEMIP enables increased accuracy and performance compared to previous architectures, two key characteristics for the implementation of VBN systems for next generation space missions.

The rest of the paper is organized as follows: Section II overviews related works, while Sections III, IV and V introduce the proposed architecture and its main improvements and peculiarities. Section VI shows the results obtained from an extensive experimental campaign and, finally, Section VII summarizes the main contributions and concludes the paper.

II. RELATED WORKS

Feature extraction is the most complex activity performed by FEM algorithms. Several feature extraction algorithms have been proposed in the literature (e.g., Beaudet [9], SUSAN [10], Harris [7], SURF [11] and SIFT [12]). From the algorithmic point of view, SURF and SIFT are probably the most robust solutions since they are scale- and rotation-invariant. This
means that features can be matched between two consecutive frames even if they have differences in terms of scale and/or rotation. However, due to their complexity, hardware implementations are very resource hungry. As an example, [13] and [14] propose two FPGA-based implementations of the SURF algorithm. The architecture proposed in [13] consumes almost 100% of the LUTs available on a medium sized Xilinx Virtex 6 FPGA, without guaranteeing real-time performances. Similarly, the architecture proposed in [14] consumes about 90% of the internal memory of a Xilinx Virtex 5 FPGA. It saves logic resources, but it is able to real-time process images with a resolution limited to 640x480 pixels. Another example is presented in [15], where an FPGA-based implementation of the SIFT algorithm is presented. It is able to real-time process 640x480 pixel images, consuming about 30,000 LUTs and 97 internal DSPs in a Xilinx Virtex 5 FPGA.

Among the available feature extraction algorithms, Harris is probably the best trade-off between precision and complexity [16]. Under the assumption of small differences between consecutive frames (i.e., high frame rates or small camera displacements), its accuracy is comparable to SURF and SIFT, with a significant lower complexity. Since high frame-rates are mandatory during the EDL phase to allow real-time correction of the descending trajectory, Harris is a very good candidate to implement a high-speed and low-area FEM accelerator block for space-applications [17]. For each pixel \((x, y)\) of a frame, Harris computes the so called corner response \(R(x, y)\) according to the following equation:

\[
R(x, y) = \text{Det}(N(x, y)) - k \cdot \text{Tr}^2(N(x, y)) \tag{1}
\]

where \(k\) is an empirical correction factor equal to 0.04, and \(N(x, y)\) is the second-moment matrix, which depends on the spatial image derivatives \(L_x\) and \(L_y\), in the respective directions (i.e., \(x\) and \(y\)) [7]. Pixels with high corner response have high probability to represent a corner (i.e., an image feature) of the selected frame and can be selected to search for matching points between consecutive frames.

In [8] we presented an FPGA-based FEM core called FEMP, based on the standard Harris algorithm, overcoming limitations of previous state-of-the-art implementations [17]–[19]. It guarantees high frame rates (up to 33 fps), with very limited hardware resources and without resorting to external co-processors. Nevertheless, FEMP parameters are fixed at design time and do not allow to adapt to the continuous environmental changes (e.g., light, noise, etc.) that are typical of extreme space missions. The architecture presented in this paper, named SA-FEMP, overcomes this limitation by introducing adaptation capability to the architecture presented in [8], thus obtaining high FEM robustness with respect to both noise and image characteristic variations.

III. SA-FEMP ARCHITECTURE

This section shortly introduces the SA-FEMP architecture discussing where and how adaptation to environmental conditions has been introduced.

\(\text{Det}(X)\) denotes the determinant of matrix \(X\), and \(\text{Tr}(X)\) denotes the trace of matrix \(X\).
the central row (row number 4) can be processed and filtered. It is worth to remember here that, using a 7x7 kernel matrix, a 3-pixel wide border of the image is not filtered, and related pixels are therefore discarded during filtering. For each pixel to filter, a 7x7 image patch is extracted from the Row Buffer and stored in the Slide Window Buffer (i.e., a buffer composed of 49 10-bit registers). This can be efficiently done if one considers that the image is received in a raster way as shown in Fig. 2c. At each clock cycle, a full Row Buffer column is shifted into the Sliding Window Buffer (Fig. 2c). After the 7th clock cycle, the first image block is ready and the Sliding Window Buffer is convolved with the Kernel Mask. At each following clock cycle, a new Row Buffer column enters the Sliding Window Buffer and a new filtered pixel of the row is produced. While this process is carried out, new pixels continue to feed the Row Buffer, thus implementing a fully pipelined computation. From (2), taking into account the considered kernel size (i.e., 7x7 pixels), 49 multiplications are required to produce a filtered pixel \( F I(x, y) \). In our architecture, all multiplications are executed in parallel within a single clock cycle. Since kernel factors have been internally represented through constants, 49 constant-multipliers are instantiated. After that, an adder tree (similar to the one presented in [24]) adds the 49 multiplication results to produce the filtered pixel.

![Fig. 2: Gaussian Filter](image)

The main drawback of this architecture is that a fixed Gaussian filter variance \( \sigma^2_f \) works well if the noise level of the processed frames is known a priori. As an example, a high filter variance is useful for high noise levels. Instead, for low noise levels the images are oversmoothed, thus reducing the accuracy of the feature extraction and matching modules [21]. To overcome this problem, SA-FEMIP exploits FPGA DPR to adapt the variance \( \sigma^2_f \) of the Reconfigurable Gaussian Filter frame-by-frame, based on the estimated noise affecting the input frame (see Section IV). This represents the first main contribution to the FEM adaptation introduced in the SA-FEMIP architecture.

The Adaptive Harris Feature Extractor implements the Harris corner detector. It processes the filtered pixels, received from the Reconfigurable Gaussian Filter, and computes the feature frames. Each feature is represented by its coordinates \((x, y)\) in the frame, and by the related corner response \( R(x, y) \), computed according to eq. (1). The computed corner responses must be thresholded in order to identify those features that potentially represent a real corner. However, the value of the threshold strongly depends on the image environment (e.g., Mars or Moon) and condition (e.g., brightness, noise, contrast). To provide a certain level of adaptation, [8] introduced a self-adaptive threshold. The threshold \( TH \) is initialized at 0 at startup (i.e., all features are accepted). It is then updated based on the number of features extracted from the current image, and applied to the next frame. In particular, for each frame, the number of selected features \( NF \) is compared with the number of expected features \( TF \) set to 3,000 in our specific test application. If the two numbers are equal with a tolerance \( \delta \) the threshold is already optimized. If not, the new threshold is computed as \( TH = TH + ((TF - NF) \times (0.5/TF) \times TH) \). The reader may refer to [8] for additional details.

Computing the threshold for the next frame based on information on the current frame is acceptable thanks to the high frame rate of the proposed architecture, that guarantees marginal differences in consecutive frames. However, if the image presents a small rugged region, the extracted features, and subsequently the extracted matching points, will be concentrated in that limited region. This leads to poor information extracted from the input frames, and therefore to errors in the Motion Estimation phase. This drawback derives from the usage of a single global threshold for an entire frame. The Adaptive Harris Features Extractor (AHFE) component proposed in Section V, implements an adaptive cell-based thresholding that relies on frame partitioning to apply different thresholds to different portions of the frame. This ensures that the extracted features uniformly cover the overall frame.

Eventually, the Feature Matcher of Fig. 1 receives and stores in an internal buffer the features extracted by the Adaptive Harris Feature Extractor. All stored features are then analyzed to discard the ones that are too close to each other. Only the strongest feature in a 3x3 pixel neighborhood is considered and stored in an internal buffer called NMS Buffer (Non-Maxima-Suppression (NMS) phase). The NMS Buffer stores up to 1,000 features coordinates using 4 BRAMs. It is split in two sub-buffers named Frame 1 Features buffer and Frame 2 Features buffer. Each feature is stored in two consecutive images that must be analyzed and matched. Features stored in the two NMS sub-Buffers are compared using a cross-correlation-based approach (Matching phase) to identify the matching points. Only potentially correlated features are actually compared. Analyzing the speed of a space-module during the descending phase, and considering the high input frame rate used to sample images, we identified that a feature can perform a maximum movement of 17 pixels between two consecutive images [25] [2]. Thus, two features can be considered as potentially correlated if they are both in a 35x35 pixel neighborhood in the two considered images. Cross-Correlation is therefore only computed on these features, thus reducing the computational load and speeding up the matching task. Basically, the Feature Matcher scans the Frame 1 Features buffer and the Frame 2 Features buffer looking for two correlated features. It compares the coordinates...
A reconfigurable module is a portion of an FPGA design enclosed in an FPGA reconfigurable module (RM in Fig. 3). The proposed approach, the Gaussian Filter (NVE), (ii) the Gaussian Filter Manager, and (iii) the Gaussian Filter. The Gaussian Filter is implemented as described in Section III, but, in order to enable its reconfiguration, the 49 multipliers are enclosed in an FPGA reconfigurable module (RM in Fig. 3). A reconfigurable module is a portion of an FPGA design that can be reconfigured at run-time, without impacting the behavior of the rest of the design. While the Gaussian Filter processes the input image, the NVE estimates the noise level. The NVE is implemented adopting an architecture similar to the one presented in [33]. However, we compute here the noise standard deviation ($\sigma_n$) instead of the noise variance ($\sigma^2_n$). This change has not functional effects since $\sigma_n$ is not actually used to perform calculations like in [33] during the filtering process. When a full frame has been processed, the NVE provides the current estimated $\sigma_n$ to the Reconfiguration Manager. The Reconfiguration Manager exploits this value to look-up into a bitstream address table and to select the proper configuration for the multipliers inside the RM. The multipliers reconfiguration is accomplished by reading the multipliers configuration bitstream from the external memory, choosing the configuration associated with the estimated standard deviation. The bitstream is then used to program the reconfigurable module of the FPGA resorting to the FPGA internal Configuration Port (i.e., ICAP [34] in Xilinx FPGAs). It is worth to highlight here that using the noise standard deviation instead of the noise variance strongly reduces the NVE area, avoiding the multiplier required to compute $\sigma_n^2$. During the reconfiguration process, the Reconfiguration Manager must access the external memory to retrieve the RM configuration bitstream. To avoid the stall of the processing chain, this access must be scheduled when no other module requires information from the external memory. As shown in the timing diagram of Fig. 4, the external memory is accessed by the Reconfigurable Gaussian Filter in write mode to store the computed filtered pixel values. As described in Section III, during this phase the Reconfigurable Gaussian Filter and the Adaptive Harris Feature Extractor work in pipeline, while the noise variance is computed (Image Filtering, Features Extraction and Noise Estimation activities in Fig. 4). At the end of the feature extraction, the NMS phase takes place, and, finally, the Feature Matcher performs the matching phase where it accesses the external memory in read mode to retrieve the data needed to compute the cross-correlation. It is worth noting that the Image Filtering and the Features Extraction slots are not perfectly aligned due to the latency in loading the internal pipeline of the Reconfigurable Gaussian Filter. Looking at Fig. 4, the external memory is always idle during the NMS phase ($t_{\text{idle}}$ in Fig. 4). This time slot can be used to reconfigure the filter (R task in Fig. 4) without stalling the processing chain. This means that no timing overhead is introduced in the feature extraction and matching task. 

Fig. 3: Reconfigurable Gaussian Filter hardware architecture
Finally, since for each value of $\sigma_f^2$ a configuration bitstream must be stored in the external memory, the range of possible $\sigma_f^2$ must be discretized according to the available external memory space (see Section VI for detailed information about the size of each bitstream).

V. Adaptive Harris Feature Extractor

This section introduces the hardware architecture of the Adaptive Harris Feature Extractor (Fig. 5), focusing on the novel thresholding approach that ensures to uniformly distribute the extracted features on the input frame.

![Adaptive Harris Features Extractor internal architecture](image)

The first two modules of the Adaptive Harris Feature Extractor, $L_x$ and $L_y$, compute the spatial image derivatives of the filtered image in the horizontal ($L_x$) and vertical ($L_y$) direction, respectively. This operation is performed by convolving the filtered image, received from the Reconfigurable Gaussian Filter, with the 3x3 Prewitt kernel [21], using an architecture similar to the one proposed for the Gaussian Filter. Then, the Corner Response Calculator module computes the determinant and the trace of the second-moment matrix $N(x,y)$, which are required to calculate the Harris corner response $R(x,y)$ associated with each input pixel. Finally, the Adaptive Cell-based Thresholding (ACTH) module thresholds the computed corner responses, asserting the val_feat signal when the current processed pixel is above the threshold and therefore represents an actual feature.

Selecting a well distributed set of features within the frame improves the motion estimation accuracy. In order to level the distribution of the extracted features on the processed frames, the ACTH module splits the input image in 64 cells of 128x128 pixels each. It then tries to extract the same number of features from each cell. This goal is achieved exploiting a local threshold for each cell, instead of using a single global threshold for the overall image, as done in [8]. The chosen number of cells represents a good trade-off between accuracy and memory requirements. As it will be discussed in Section VI, it ensures to uniformly cover the input image and, at the same time, to avoid the introduction of a large number of cells that would require a lot of memory to store the related information items.

The ACTH module analyzes information related to the current frame implementing the decision process described in Alg. 1, and computes the local thresholds to use for the following frame. The threshold adaptation process requires to know, for each cell composing the frame, (i) the number of extracted features ($NF$), (ii) the current threshold ($TH$) initialized at the highest possible value at startup (i.e., no features are extracted), and (iii) the current number of expected features ($TF$). In our tests, $TF$ has been initialized to 48 to fix the overall number of expected features per frame ($OTF$) to about 3,000 features. This value limits the size of the internal buffer used to store the extracted features in the Feature Matcher module. Since $NF$, $TH$ and $TF$ must be defined for each cell of the frame, they are stored in the form of 8x8 matrices, with the matrix elements associated to the defined frame cells.

Alg. 1 can be split in two main parts. The former (from row 8 to 27) updates the cell threshold values. For every cell $(i,j)$, it compares the number of extracted features $NF[i,j]$ with the number of expected features $TF[i,j] (Disp$ at row 10). If these two values differ no more than a defined tolerance (i.e., the difference is contained in the range $[-\varepsilon,-\delta]$) the threshold is not changed (row 23). Otherwise, the threshold is updated adding $Step$ to its current value (rows 13 and 21). One additional test is performed when the number of extracted features is lower than the number of expected ones (from row 14 to 18). In particular, the updated threshold (new $TH[i,j]$) is considered valid if it is higher than a lower bound value ($LowTH$). If not, the threshold is not changed (row 15). This avoids to over-reduce the threshold value and to provide in output weak features that could be potentially associated with the noise in the input frame. In fact, if a cell represents a flat part of the planetary surface, a high value of the image gradient, and consequently a high value of the computed corner response, is mainly due to the noise.

The second part of Alg. 1 (from row 28 to 51) optimizes the number of features extracted for each cell in order to obtain a total number of features for the frame as close as
possible \( OTF \). To do that, it is worth to remember that all cells that reach the threshold lower bound cannot further update their threshold. If, with this threshold, the number of extracted features for the cell \( NF[i,j] \) is lower than the number of expected features for the cell \( TF[i,j] \) there is a certain amount of features corresponding to \( |Disp| \) that can be redistributed to other cells with threshold higher than the lower bound. To exploit this, each cell with threshold lower than the lower bound is marked through the \( LowTH_{cell}[i,j] \) flag (row 16) and the number of unused features of these cells is accumulated in the \( TF_{slack} \) parameter (row 17) in order to be redistributed to the other cells, according to the decision process described from row 28 to 51. The \( TF_{slack} \) represents the number of expected features that can be borrowed to the cells that have not reached the threshold lower bound (i.e., \( LowTH_{cell}[i,j] = 0 \)). The number of features to borrow to each cell \( (TF_{slackcell}) \) is computed dividing \( TF_{slack} \) by the number of cells composing the image. To ensure a high number of extracted features, the algorithm always borrows at least one feature to each cell with \( LowTH_{cell}[i,j] = 0 \) (rows 29 to 33). If the total number of extracted features \( Curr\_EF \) is lower or equal to \( OTF \) (from row 36 to 41), if the cell has not reached the threshold lower bound the number of expected features for the cell is increased of \( TF_{slackcell} \) (row 38). Otherwise, it is left unchanged (row 40).

Using this approach, the total number of extracted features \( (Curr\_EF) \) could increase more and more due to the borrow mechanism, that increases the \( TF[i,j] \) values. To allow a decrease of the \( TF[i,j] \) values, and so to maintain the overall number of extracted features around \( OTF \), if \( Curr\_EF \) exceeds \( OTF \), the target feature value of each cell is decreased by 1 (from row 43 to 47).

The hardware architecture of the ACTH module is shown in Fig. 6.

![Fig. 6: Adaptive Cell-based Thresholding hardware architecture](image)

It is composed of four main modules: (i) the Features Counter, (ii) the Thresholds & Target Features Updater, (iii) the \( TH \) sh vector, and (iv) the \( NF \) sh vector. The Thresholds & Target Features Updater module implements Alg. 1, while the Features Counter performs the actual thresholding of each corner response \( R(x,y) \) received from the Corner Response Calculator (see Fig. 5). This module reads the thresholds associated with each image cell (i.e., \( TH[i,j] \) in Alg. 1) that are stored in the \( TH \) sh vector, and compares them with the received corner responses, asserting the \( val\_feat \) signal if \( R(x,y) \) is higher than the threshold associated with the image cell containing the currently processed pixel.

![Fig. 7: TH and NF shifter vector hardware architecture](image)
The TH sh vector module is implemented as in Fig. 7. It is composed of eight 8-positions shift registers connected as circular buffers. Each shift register stores eight threshold values associated with a row of image cells (it is worth to remember that the image is split in 64 cells organized in 8 rows with 8 cells each, and a threshold value is associated with each cell). The en signal enables the 1-position right shifting operation, while the Sel signal selects which shift register must be rotated. These two control signals are driven in order to provide in output the threshold associated with the image cell of the currently processed pixel. Since the image is received in a raster way, and each image cell is composed of 128x128 pixels, en is asserted for a clock cycle every 128 received corner responses (i.e., whenever we move from a cell to the following one). Instead, Sel selects the next shift register (i.e., the next row of image cells) after 128x1024 received corner responses (i.e., whenever a complete row of image cells has been processed). To avoid loosing the threshold values, during the thresholding phase each shift register composing the TH sh vector acts as a circular buffer through the multiplexer driven by the th phase signal (see Fig. 6). Instead, during the thresholds updating phase, the content of the TH sh vector is overwritten (exploiting the Data_in port) with new thresholds values computed by the Thresholds & Target Features Updater module.

Simultaneously to the thresholding task, the Features Counter counts (through an accumulator) the number of extracted features for each image cell (i.e., N F[i,j] in Alg. 1), and stores these values inside the NF sh vector. The NF sh vector is implemented as the TH sh vector (Fig. 7), and both modules share the input control signals. Whenever we move from the current image cell to the next one, the content of the internal accumulator is stored inside the NF sh vector, and it is initialized with the output value provided by the NF sh vector. At the end of the operations described by Alg. 1, a local reset is asserted to clear the content of the NF sh vector in order to prepare it for the next image processing cycle. All aforementioned control signals are generated by the Controller module (Fig. 6), which also coordinates the operations of all modules included in the ACTH.

VI. EXPERIMENTAL RESULTS

To estimate the hardware resources and the timing performances, the SA-FEMIP architecture has been synthesized and implemented, resorting to Xilinx ISE Design Suite 14.6, on a space-qualified Xilinx Virtex 4-QV VLX200 FPGA device that, together with the Virtex 5-QV VFX130 FPGA, represents the state-of-the-art architecture for space-qualified reprogrammable FPGAs. The reason to select the Virtex 4 architecture instead of the newer Virtex 5 is twofold. First the SA-FEMIP architecture has been designed to be integrated and tested inside the Thales Alenia Space Avionic Testbench (ATB), i.e., a hardware infrastructure emulating the on-board computing platform of a spacecraft. The ATB is equipped with a Gaisler Research GR-CPCI-XC4V development board [35]. This board integrates a Xilinx Virtex 4 VLX200 FPGA, which provides the same internal logic architecture of the space-qualified version. Second, implementing SE-FEMIP on a Virtex 4 FPGA allowed us to perform fair comparisons with other published architectures, thus highlighting the benefits of the introduced improvements. Post place and route simulations have been done using Modelsim SE 10.0c to annotate the switching activities of internal nodes, and the Xilinx XPower Analyzer has been exploited of power consumption estimation.

Table I compares the proposed adaptive architecture with the fixed architecture proposed in [8] (FEMIP). Comparison is performed in terms of area overhead by considering internal logic and memory resources (i.e., registers, Look-Up Tables (LUTs), and Block-RAMs (BRAMs)) [23]). Percentages in Table I represent the used portion of the hardware resources available in the Xilinx Virtex 4-QV VLX200 FPGA. It is important to point out that the synthesis of both FEMIP and SA-FEMIP architectures has been forced to avoid the use of DSPs. The reasons for this choice will be better elaborated later in this section. Power consumption is analyzed considering an operating frequency of 60 MHz for both architectures.

<table>
<thead>
<tr>
<th>Module</th>
<th>FPGA Area Occupation</th>
<th>Max.Freq. (MHz)</th>
<th>Power [W]</th>
</tr>
</thead>
<tbody>
<tr>
<td>FEMIP (4)</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>GP</td>
<td>608 (3.00%)</td>
<td>1,518 (7.51%)</td>
<td>0.07</td>
</tr>
<tr>
<td>HFE</td>
<td>1,106 (5.62%)</td>
<td>1,081 (5.22%)</td>
<td>0.11</td>
</tr>
<tr>
<td>RF</td>
<td>2,452 (12.56%)</td>
<td>930 (4.53%)</td>
<td>0.03</td>
</tr>
<tr>
<td>Total</td>
<td>4,166 (21.26%)</td>
<td>2,292 (11.28%)</td>
<td>0.22</td>
</tr>
<tr>
<td>Proposed</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>RF</td>
<td>939 (4.95%)</td>
<td>746 (3.81%)</td>
<td>0.02</td>
</tr>
<tr>
<td>HFE</td>
<td>1,962 (10.05%)</td>
<td>1,008 (5.14%)</td>
<td>0.02</td>
</tr>
<tr>
<td>RF</td>
<td>2,452 (12.56%)</td>
<td>930 (4.53%)</td>
<td>0.03</td>
</tr>
<tr>
<td>Total</td>
<td>4,353 (22.66%)</td>
<td>2,058 (10.62%)</td>
<td>0.24</td>
</tr>
<tr>
<td>Overhead</td>
<td>Total</td>
<td>409 (2.18%)</td>
<td>2,292 (11.66%)</td>
</tr>
</tbody>
</table>

Table I shows that SA-FEMIP FPGA occupation is around 10% for logic and memory resources, and the overhead w.r.t. FEMIP is less than 2%. This overhead is due to the additional modules required to perform adaptation in the Reconfigurable Gaussian Filter (RF) described in Section IV and the additional hardware required to implement the Adaptive Harris Features Extractor (AHFE) presented in Section V. In particular, in the RF, the increased occupation is due to the NVE and the Reconfiguration Manager modules. Instead, in the AHFE, the area overhead is introduced by the usage of a more complex thresholding approach, with respect to the simple one adopted in FEMIP. It is worth to highlight here that an effort has been placed to limit the registers overhead. The AHFE architecture strongly relies on shift registers structures to implement the required vectors and matrices included in Alg. 1. This kind of component can be efficiently implemented in Xilinx FPGAs, exploiting the Xilinx SRL capability of the Look-Up Tables (LUTs) [23]. This capability makes it possible to use LUTs as shift registers instead of a chain of Flip-Flops, saving hardware resources. As an example, a single LUT, in a Xilinx Virtex-4, can act as a 16x1-bit shift register avoiding a chain of 16 flip-flops [23].

The maximum operating frequencies of each module reported in Table I demonstrate that no timing penalty is introduced in SA-FEMIP by the introduction of the adaptivity
features.

The power consumption of each module reported in Table I does not take into account the contribution of the clock circuitry and the leakage. These contributions are included in the overall power consumption. By comparing the power consumption of SA-FEMIP with the one of FEMIP a very limited overhead equal to 4.75% is observed. It is worth noting that the power consumption of the RF module does not include the power used during the partial reconfiguration process. However, according to [36] the reconfiguration process consumes few tens of mW, only.

Eventually, the throughput, in terms of frames-per-second (fps), is the same (i.e., 33 fps) for both FEMIP and SA-FEMIP.

In Table II, the performances and the area occupation of SA-FEMIP have been compared with FEIC [17] [37]. FEIC is a Feature Extraction and matching Integrated Circuit, based on the Harris algorithm, that University of Dundee developed for the European Space Agency (ESA) in the framework of the Navigation for Planetary Approach and Landing (NPAL) project. LUTs and BRAMs used by FEIC are reported for a Virtex II device (as in [37]), but the internal logic and memory architecture is the same as in Virtex 4 family devices. The reported data confirm the great improvements of the proposed architecture, both in terms of resources usage and throughput.

<table>
<thead>
<tr>
<th>TABLE II: Resource usage and throughput of FEIC and SA-FEMIP for a Xilinx XQR4VLX200 Virtex 4 FPGA device</th>
</tr>
</thead>
<tbody>
<tr>
<td>Resource Usage</td>
</tr>
<tr>
<td>LUTs</td>
</tr>
<tr>
<td>Proposed</td>
</tr>
<tr>
<td>FEIC [37]</td>
</tr>
<tr>
<td>Improvements</td>
</tr>
</tbody>
</table>

The low area occupation of SA-FEMIP allow designers to exploit the free hardware resources to apply fault mitigation strategies to increase the reliability of the design, a key requirement in space applications. Several fault-mitigation techniques against Single Event Upset (SEU) can be applied on FPGA devices. Following [38], these techniques can be classified as (1) netlist level techniques or (2) register transfer level techniques.

Netlist level techniques include different types of Triple Modular Redundancy (TMR) techniques [38]. Triplication can be limited to the sequential elements of the circuit (i.e., Sequential logic TMR) introducing for each register of the design two additional registers and a 3-input voter. Otherwise, the full design can be triplicated (i.e., Global TMR) introducing a hardware overhead equal to the 200% of the original design.

Register transfer level techniques aim at protecting the Finite State Machines (FSMs) of the design (e.g., Safe FSM Coding, and 3-Hamming distance enhancement in FSMs). Usually, the overhead introduced by these techniques is one order of magnitude lower than the one associated with the TMR techniques.

In general, the total hardware overhead, even if a combination of the aforementioned techniques is exploited, can vary from 60% up to 200% of the original design [38]. It is clear that, given the low amount of resources required by SA-FEMIP, fault tolerance techniques can be freely implemented within the selected device. Moreover, even after the implementation of fault tolerance techniques, space is also available to integrate in the same device additional FPGA-based IP cores useful to accelerate other computational intensive tasks performed during the descending phase (e.g., Hazard map computation [39]). This is very important considering the limited resources available in space applications.

As mentioned at the beginning of this section SA-FEMIP has been synthesized avoiding the use of DSPs. This decision can now be better motivated. DSPs have the advantage of further reducing the area occupation of FEMIP especially when multipliers are implemented. With the use of DSPs the SA-FEMIP occupation would be reduced to 9,029 (5.06%) LUTs, 66 (68.75%) DSPs, while the occupation of registers and BRAMs remains the same. Nevertheless, DSPs are limited resources. With 66 DSPs required out of the 96 available in the Virtex 4 VLX200 FPGA, TMR techniques for this portion of the design would not be possible. Moreover, the intensive use of DSPs increase the routing complexity introducing a 30% frequency penalty in the design.

SA-FEMIP has not been compared to [19], since [19] implements the multi-scale Harris detector (i.e., a rotation-invariant version of the Harris detector). [19] consumes a lot of hardware resources, and implements a feature that is not actually required in EDL applications since rotations between two consecutive images are limited [40].

The proposed architecture has been evaluated in terms of accuracy and robustness, exploiting an image dataset, provided by Thales Alenia Space Italia s.p.a., company, that covers different landing zones (i.e., portions of the Mars surface), environmental conditions (i.e., image quality), and camera movement types, in a synthesized Mars environment. Camera movement types include displacements, up to 30 meters, at different altitudes (from 1,000 meters to 5,000 meters), and angular speed (up to 2.5 °/s, in accordance to [40]), while image quality types include the injection of different levels of Gaussian noise, blur, brightness and contrast variations.

According to [40], the robustness has been evaluated exploiting two parameters: (i) Number of Extracted Matches (NEM), that identifies the number of matching points, and (ii) Spatial Distribution of Points (SDP), that measures how much the extracted matching points are uniformly spread in the image, defined as:

\[ SDP = \frac{\sum_{i=1}^{N} -p_i \log p_i}{\log N} \]  

where \( p_i \) is computed as the number of matching points within an image cell (see Section V) over the total number of extracted matching points in the frame, and \( N \) is the number of image cells (i.e., 64).

Fig. 8 shows the SDP results obtained from FEMIP [8] and SA-FEMIP by providing in input the images composing the aforementioned dataset. Thanks to the adaptive cell-based
thresholding approach, the proposed architecture outperforms FEMIP results in every test case (i.e., Test Index). In particular, the improvements are very high (from Test Index 0 to 76) when the input images represent a landing zone characterized by few small rugged regions. This is visually highlighted in Fig. 9 that depicts the matching points extracted by FEMIP (Fig. 9a) and SA-FEMIP (Fig. 9b). Each figure shows two consecutive input images with lines connecting the features that match in the two images.

Fig. 10 shows the NEM versus different levels of injected Gaussian noise variance \( \sigma_n^2 \) (since FEMIP has a fixed \( \sigma_f^2 = 2 \), its NEM is represented by the dashed line).

A fixed \( \sigma_f^2 \) does not allow to reach the highest NEM for every noise level. Thus, exploiting the reconfigurable filter architecture (see Section IV) it is possible to highly increase the number of extracted matches, as shown by the Optimal line in Fig. 10. In order to follow the trend of this line, in the proposed architecture 5 configurations for the RF module have been chosen. In particular, these configurations are associated to \( \sigma_f^2 \) equal to 0.5, 0.75, 1, 1.5 and 2, for the noise level ranges \([0,100], [100,200], [200,300], [300,600], [600,1600]\), respectively. As can be seen in Fig. 10, the usage of a reconfigurable filter increases the NEM value w.r.t. FEMIP up to 2 times, especially for a \( \sigma_n^2 \) lower than 600.

Moreover, as described in Section IV, the usage of the DPR enables to save resources with respect to use a static hardware architecture including 49 multipliers, each one with a multiplexer to select the right Gaussian kernel value. In the proposed architecture, using the same fixed-point data representation adopted in [8], the RM and the Reconfiguration Controller (Fig. 3) require 5,320 LUTs and few registers. Instead, a static hardware architecture (as the one reported in [33]), with the same data parallelism, would require about 19,000 LUTs, leading to a save of 72% of hardware resources.

Since each bitstream for the RM module is 166 KB (for the selected FPGA device), to store the 5 configurations 830 KB are required in the external memory. Since the throughput of the Reconfiguration Controller is 400 MB/s (i.e., this value is limited by the maximum throughput of the ICAP [34]), the time required to reconfigure the RM is equal to 0.42 ms. This time fits the idle time of the external memory (i.e., \( t_{idle} \) in Fig. 4) that is equal to 1.2 ms (i.e., the time required by the Matcher to perform the NMS phase). For the sake of completeness, considering an operating frequency of SA-FEMIP chain equal to 60 MHz, the time required to perform the filtering and the matching tasks (i.e., \( t_{filtering} \) and \( t_{matching} \) in Fig. 4) is 21.5 ms and 7.6 ms, respectively.

Eventually, Fig. 11 shows the percentages of Correct Matches (CM) for the different filter configurations and injected noise levels. CM has been computed exploiting the knowledge about the camera movement between two consecutive images of the dataset. Starting from the position of a matching point in the first image, it is possible to compute its expected position in the second image by using a three dimensional roto-translation model. For each couple of images in the dataset this process has been automated through a MATLAB script. Then, the CM values have been computed by comparing the outputs of the script with the ones of the proposed architecture.

It is worth noting that, the CM values are not computed for every \( \sigma_f^2 \) since, as shown in Fig. 10, with a filter characterized by a low variance it is not possible to extract matching points for very high noise levels.

As can be seen in Fig. 11, the accuracy of the different filter
configurations is higher than 85% for every noise level, and it is almost equal for a fixed noise level. These data demonstrate that the proposed filter is able to maximize the NEM, while preserving the correctness of its outputs.

VII. CONCLUSION

This paper presented a self-adaptive module for features extraction and matching designed to suit modern space-qualified FPGAs. In order to make the core completely self-adaptive, the Dynamic Partial Reconfiguration (DPR) feature of modern space-qualified FPGAs is exploited to design a self-reconfigurable Gaussian Filter module, while a novel adaptive algorithm and the associated hardware architecture, embedded in the features extraction module, increase the quality of the output results of the entire features extraction and matching processing chain. Experimental results show the accuracy and the limited overhead of resources needed to implemented the proposed architecture, while maintaining the same throughput performances with respect to a state-of-the-art reference architecture, allowing to exploit the free hardware resources to apply fault mitigation strategies to increase the reliability of the design.

ACKNOWLEDGMENT

The authors would like to express their sincere thanks to the whole design team of Thales Alenia Space Italia s.p.a company for their helpful hints, guidelines, and fruitful brainstorming meetings.

REFERENCES

[22] J. Fernández-Berni, R. Carmona-Galan, F. Pozas-Flores, A. Zarándy, and Á. Rodríguez-Vázquez, “Multi-resolution power-law gaussian filtering