I spent the weekend trying to implement SIFT based on Lowe’s paper “Distinctive Image Features from Scale-Invariant Keypoints.” Needless to say it was a fun weekend. The main focus was on the first two steps of the algorithm: “Detection of scale-space extrema” and “Accurate keypoint localization.”

The implementation of the first two steps is not working properly yet. There seems to be a bug in the function that builds the scale-space pyramids. Specifically sigma values used to create the five Gaussian images of each octave appear incorrect. When the initial sigma was set to 1.0, no keypoints were found on any octave, and when the initial sigma was dropped to 0.7, approximately 4,500 keypoints were found in the test image. The test program used three octaves and most of the keypoints were found in the first two octaves. The last octave found under 20 keypoints. Compare this to Lowe’s results on a 233 x 189 image, where 832 initial keypoints were found and 536 keypoints remained after “thresholding on ratio of principal curvatures.” The test image used in this implementation is 583 x 395, about double the size. Nine times the number of keypoints found seems wrong. In addition, the keypoints are scattered all over the image and don’t appear to correspond to actual corners or edges.

**Fig. 1 - Test image**

Additional implementation notes:

64 bit floating point (double) values between 0.0 and 1.0 are used instead if unsigned chars (uchar) values between 0 and 255. This increases the precision and appears to create more keypoints.

One octave is the set of images at a particular level of the pyramid. For example octave 1 would consist of five Gaussian blurred image with resolution 1166 x 790, three Difference of Gaussians (DoG) images at the same resolution, and two images that contain keypoints. Lowe’s paper has a good diagram on page 6.

The maxima and minima of the DoG images is calculated by comparing the 8 neighboring pixels on current image and the 18 pixels from images above and below. This is 36 comparisons for each pixel of the image. Luckily not all comparisons are needed since the algorithm is searching for the smallest (minima) or largest (maxima) pixel. Lowe’s paper has a good diagram on page 7.

The equation to calculate the DoG requires two Gaussian images. The first image has a sigma that is a factor of `k`

larger or smaller than the second image. Here `D`

represents the DoG image. `G`

appears to represent the Gaussian kernels and `I`

represents the image.

```
D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)
```

Since the SIFT algorithm already calculates the images Gaussian blur of the images separately, we can factor `I`

into each of the Gaussian kernels separately. In the modified equation `L`

represents `G(x, y, kσ) ∗ I(x, y)`

and `G(x, y, σ) ∗ I(x, y)`

```
D(x, y, σ) = L(x, y, kσ) − L(x, y, σ)
```

Another thing to note is that the DoG can have negative values since they are not really images. To visual the DoG values better, 0.5 can be added to each value.

**Fig. 2 - DoG calculation with 0.5 added to value in matrix to improve visibility for visualization**

This is where my implementation falls apart, specifically on the matter of what value of `k`

to use and what value of `sigma`

to use for each successive Gaussian image.

On page 7 of Lowe’s paper, `k = 2^{1/s}`

, where `s`

appears to be the number of images desired for calculating the minima and maxima. In this case, `s = 2`

and `k = \sqrt{2} = 2^{1/2}`

. This is just guess though, since the paper specifies that there will be s + 3 blurred images.

The part is clear, but it remains unclear how to calculate successive sigma values. For instance is it as simple as multiplying the previous sigma with k? Thishis SIFT article from AI Shack seems to describe this approach. Here are similar numbers reproduced using a starting sigma of 1.0 and k of square of 2. Each new sigma is calculated by multiplying k with the previous sigma. The next octave starts with the sigma of the third image. Lowe’s paper says to subsample the third Gaussian image of the previous octave. This saves calculations.

Octave 1 | 1.0 | 1.41 | 2 | 2.83 | 4 |

Octave 2 | 1.41 | 2 | 2.83 | 4 | 5.66 |

Looking at the table, the sigma value seems to double at every other image in the octave. Is this correct? Or should the sigma only double at the last image?

One possible error in my implementation could be this line:

```
gaussian_blur(gauss[i - 1], gauss[i], scale[i % num_scales]);
```

Here the Gaussian blur is applied to the previous image and not the first image. It seems like the first image makes more sense since the previous image would already have a Gaussian blur applied. Wouldn’t this increase the sigma factor?

Of the four SIFT implementations that I studied, there were 3 different approaches to calculate the Gaussian pyramid.

- Source code for Open CV SIFT implementation
- Open CV SIFT is based off of this implementation
- Yet another Open CV SIFT implementation
- SIFT implementation using VXL

**Fig. 3 - Gaussian blur on octave 2 image 1**

**Fig. 4 - Gaussian blur on octave 2 image 4**

This part was difficult to understand, mainly because of the Linear Algebra and Calculus in three dimensions that is required to understand how edge responses are eliminated. Edge responses are minima and maxima that correspond to edges in the image. These points do not handle change in scale and noise as well as corners apparently.

To eliminate edge responses, Lowe uses a Hessian matrix to compute something called the principal curvature.

Dxx | Dxy |

Dxy | Dyy |

How are Dxx, Dyy, and Dxy calculated or what do they represent? Apparently these are derivatives that can be estimated by subtracting neighboring pixels. Once we have these values, it seems that we can just plug in the values to solve these two equations, where alpha and beta are eigenvalues. These values can be ignored since we will only be using Dxx, Dyy, and Dxy to perform the calculations.

```
Tr(H) = Dxx + Dyy = α + β and
Det(H) = Dxx * Dyy − (Dxy)^2 = αβ.
```

After that we divide Tr(H) squared by the Det(H) to find the ration. Then we keep the point if the ratio is less than some value `r`

. Lowe uses 10 in his paper. These equations are found on page 12.

Once again it was interesting to see how different the implementations were.

**Fig. 5 - Keypoints after edge responses eliminated in octave 2**

**Fig. 4 - Keypoints after edge responses eliminated in octave 3**

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396

#include <iostream>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
// User Settings
#define WINDOW_NAME "Results"
#define IMAGE "chugach-mtns.jpg"
#define SHOW_GAUSSIANS true
#define SHOW_DOGS true
#define SHOW_EXTREMA true
// Grayscale Settings
#define R_WEIGHT 0.21
#define G_WEIGHT 0.71
#define B_WEIGHT 0.07
// Gaussian Blur Settings
#define KERNEL_SIZE 3
// Difference of Gaussians Settings
#define NUM_OCTAVES 3
#define NUM_SCALES 5
#define PRE_BLUR_SIGMA 0.5
#define INITIAL_SIGMA 0.7
// Extrema Detection Settings
#define CONTRAST_THRESHOLD 0.03
// Constant Constants
#define MAX_8BIT 255
using namespace cv;
using namespace std;
void grayscale(Mat src, Mat &gray)
{
gray = Mat::zeros(src.rows, src.cols, CV_64F);
Vec3b rgb;
for (int i = 0; i < src.rows; ++i) {
for (int j = 0; j < src.cols; ++j) {
rgb = src.at<Vec3b>(i, j);
gray.at<double>(i, j) = R_WEIGHT * rgb[0] + G_WEIGHT * rgb[1] + B_WEIGHT * rgb[2];
}
}
}
void normalize(Mat src, Mat &norm)
{
norm = Mat::zeros(src.rows, src.cols, CV_64F);
for (int i = 0; i < src.rows; ++i) {
for (int j = 0; j < src.cols; ++j) {
norm.at<double>(i, j) = src.at<double>(i, j) / MAX_8BIT;
}
}
}
void scale_down_2x(Mat src, Mat &scaled)
{
scaled = Mat::zeros(src.rows / 2, src.cols / 2, CV_64F);
int i;
int j;
for (i = 0; i < scaled.rows; ++i) {
for (j = 0; j < scaled.cols; ++j) {
scaled.at<double>(i, j) = src.at<double>(i * 2, j * 2);
}
}
}
void scale_up_2x(Mat src, Mat &scaled)
{
scaled = Mat::zeros(src.rows * 2, src.cols * 2, CV_64F);
double p0;
double p1;
int i;
int j;
for (i = 0; i < src.rows; ++i) {
for (j = 0; j < src.cols; ++j) {
scaled.at<double>(i * 2, j * 2) = src.at<double>(i, j);
if (i + 1 == src.rows) {
scaled.at<double>(i * 2 + 1, j * 2) = src.at<double>(i, j);
} else {
p0 = src.at<double>(i, j);
p1 = src.at<double>(i + 1, j);
scaled.at<double>(i * 2 + 1, j * 2) = p0 + (p1 - p0) * 0.5;
}
if (j + 1 == src.cols) {
scaled.at<double>(i * 2, j * 2 + 1) = src.at<double>(i, j);
} else {
p0 = src.at<double>(i, j);
p1 = src.at<double>(i, j + 1);
scaled.at<double>(i * 2, j * 2 + 1) = p0 + (p1 - p0) * 0.5;
}
if (i + 1 == src.rows || j + 1 == src.cols) {
scaled.at<double>(i * 2 + 1, j * 2 + 1) = src.at<double>(i, j);
} else {
p0 = src.at<double>(i, j);
p1 = src.at<double>(i + 1, j + 1);
scaled.at<double>(i * 2 + 1, j * 2 + 1) = p0 + (p1 - p0) * 0.5;
}
}
}
}
void gaussian_blur(Mat src, Mat &blur, double sigma)
{
blur = src.clone();
double xdist[KERNEL_SIZE][KERNEL_SIZE] =
{
{1, 1, 1},
{0, 0, 0},
{1, 1, 1}
};
double ydist[KERNEL_SIZE][KERNEL_SIZE] =
{
{1, 0, 1},
{1, 0, 1},
{1, 0, 1}
};
double kernel[KERNEL_SIZE][KERNEL_SIZE];
double sum = 0;
double sigFactor = 1.0 / (2.0 * M_PI * sigma * sigma);
int x;
int y;
int i;
int j;
double p0;
double p1;
double p2;
double p3;
double p4;
double p5;
double p6;
double p7;
double p8;
for (x = 0; x < KERNEL_SIZE; ++x) {
for (y = 0; y < KERNEL_SIZE; ++y) {
kernel[x][y] = sigFactor * exp(-((xdist[x][y] + ydist[x][y]) / (2.0 * sigma * sigma)));
sum += kernel[x][y];
}
}
for (x = 0; x < KERNEL_SIZE; ++x) {
for (y = 0; y < KERNEL_SIZE; ++y) {
kernel[x][y] = kernel[x][y] / sum;
}
}
for (i = 1; i < blur.rows - 1; ++i) {
for (j = 1; j < blur.cols - 1; ++j) {
p0 = src.at<double>(i - 1, j - 1);
p1 = src.at<double>(i - 1, j);
p2 = src.at<double>(i - 1, j + 1);
p3 = src.at<double>(i, j - 1);
p4 = src.at<double>(i, j);
p5 = src.at<double>(i, j + 1);
p6 = src.at<double>(i + 1, j - 1);
p7 = src.at<double>(i + 1, j);
p8 = src.at<double>(i + 1, j + 1);
blur.at<double>(i, j) =
p0 * kernel[0][0] + p1 * kernel[0][1] + p2 * kernel[0][2] +
p3 * kernel[1][0] + p4 * kernel[1][1] + p5 * kernel[1][2] +
p6 * kernel[2][0] + p7 * kernel[2][1] + p8 * kernel[2][2];
}
}
}
void diff_of_gaussians(Mat gauss1, Mat gauss2, Mat &dog)
{
dog = Mat::zeros(gauss1.rows, gauss1.cols, CV_64F);
int diff = 0;
int i;
int j;
for (i = 0; i < dog.rows; ++i) {
for (j = 0; j < dog.cols; ++j) {
dog.at<double>(i, j) = gauss2.at<double>(i, j) - gauss1.at<double>(i, j);
}
}
}
void scale_space_pyramid(Mat src, Mat gauss[], Mat dogs[], double sigma, int num_octaves, int num_scales)
{
int num_images = num_octaves * num_scales;
int dog_count = 0;
double scale[num_scales];
double sig_prev = 0;
double sig_total = 0;
int i;
double k = pow(2.0, 1.0 / (num_scales - 3));
scale[0] = sigma;
for (i = 1; i < num_scales; ++i) {
sig_prev = pow(k, i - 1) * sigma;
sig_total = sig_prev * k;
scale[i] = sqrt(sig_total * sig_total - sig_prev * sig_prev);
}
gaussian_blur(src, gauss[0], scale[0]);
for (i = 1; i < num_images; ++i) {
if (i % num_scales == 0) {
scale_down_2x(gauss[i - num_scales + (num_scales - 3)], gauss[i]);
} else {
gaussian_blur(gauss[i - 1], gauss[i], scale[i % num_scales]);
diff_of_gaussians(gauss[i - 1], gauss[i], dogs[dog_count++]);
}
}
}
void detect_extrema(Mat dogs[], Mat extremas[], int num_octaves, int num_dogs)
{
int dogs_per_octave = num_dogs / num_octaves;
Mat image_above;
Mat image;
Mat image_below;
int rows;
int cols;
int i;
int h;
int w;
double val;
Mat extrema;
int extrema_count = 0;
int found = 0;
for (i = 0; i < num_dogs; ++i) {
if (i % dogs_per_octave == 0) {
extrema = Mat::zeros(dogs[i].rows, dogs[i].cols, CV_64F);;
} else if ((i + 1) % dogs_per_octave > 0) {
rows = dogs[i].rows - 1;
cols = dogs[i].cols - 1;
image_above = dogs[i + 1];
image = dogs[i];
image_below = dogs[i - 1];
for (h = 1; h < rows; ++h) {
for (w = 1; w < cols; ++w) {
val = image.at<double>(h, w);
if (
(
abs(val) < CONTRAST_THRESHOLD &&
val > image.at<double>(h - 1, w - 1) &&
val > image.at<double>(h - 1, w) &&
val > image.at<double>(h - 1, w + 1) &&
val > image.at<double>(h, w - 1) &&
val > image.at<double>(h, w + 1) &&
val > image.at<double>(h + 1, w - 1) &&
val > image.at<double>(h + 1, w) &&
val > image.at<double>(h + 1, w + 1) &&
val > image_above.at<double>(h - 1, w - 1) &&
val > image_above.at<double>(h - 1, w) &&
val > image_above.at<double>(h - 1, w + 1) &&
val > image_above.at<double>(h, w - 1) &&
val > image_above.at<double>(h, w) &&
val > image_above.at<double>(h, w + 1) &&
val > image_above.at<double>(h + 1, w - 1) &&
val > image_above.at<double>(h + 1, w) &&
val > image_above.at<double>(h + 1, w + 1) &&
val > image_below.at<double>(h - 1, w - 1) &&
val > image_below.at<double>(h - 1, w) &&
val > image_below.at<double>(h - 1, w + 1) &&
val > image_below.at<double>(h, w - 1) &&
val > image_below.at<double>(h, w) &&
val > image_below.at<double>(h, w + 1) &&
val > image_below.at<double>(h + 1, w - 1) &&
val > image_below.at<double>(h + 1, w) &&
val > image_below.at<double>(h + 1, w + 1)
)
||
(
abs(val) < CONTRAST_THRESHOLD &&
val < image.at<double>(h - 1, w - 1) &&
val < image.at<double>(h - 1, w) &&
val < image.at<double>(h - 1, w + 1) &&
val < image.at<double>(h, w - 1) &&
val < image.at<double>(h, w + 1) &&
val < image.at<double>(h + 1, w - 1) &&
val < image.at<double>(h + 1, w) &&
val < image.at<double>(h + 1, w + 1) &&
val < image_above.at<double>(h - 1, w - 1) &&
val < image_above.at<double>(h - 1, w) &&
val < image_above.at<double>(h - 1, w + 1) &&
val < image_above.at<double>(h, w - 1) &&
val < image_above.at<double>(h, w) &&
val < image_above.at<double>(h, w + 1) &&
val < image_above.at<double>(h + 1, w - 1) &&
val < image_above.at<double>(h + 1, w) &&
val < image_above.at<double>(h + 1, w + 1) &&
val < image_below.at<double>(h - 1, w - 1) &&
val < image_below.at<double>(h - 1, w) &&
val < image_below.at<double>(h - 1, w + 1) &&
val < image_below.at<double>(h, w - 1) &&
val < image_below.at<double>(h, w) &&
val < image_below.at<double>(h, w + 1) &&
val < image_below.at<double>(h + 1, w - 1) &&
val < image_below.at<double>(h + 1, w) &&
val < image_below.at<double>(h + 1, w + 1)
)
) {
extrema.at<double>(h, w) = 1.0;
double dxx =
(image.at<double>(h - 1, w) + image.at<double>(h - 1, w)) - 2.0 * val;
double dyy =
(image.at<double>(h, w - 1) + image.at<double>(h, w + 1)) - 2.0 * val;
double dxy =
(image.at<double>(h - 1, w - 1) +
image.at<double>(h + 1, w + 1) -
image.at<double>(h + 1, w - 1) -
image.at<double>(h - 1, w + 1)) / 4.0;
double trH = dxx + dyy;
double detH = dxx * dyy - dxy * dxy;
double R = ((10.0 + 1) * (10.0 + 1)) / 10.0;
double curvature_ratio = trH * trH / detH;
if (detH < 0 || curvature_ratio > R) {
extrema.at<double>(h, w) = 0;
} else {
found++;
}
}
}
}
extremas[extrema_count++] = extrema;
}
}
cout << found;
}
int main(int argc, char *argv[])
{
namedWindow(WINDOW_NAME, WINDOW_AUTOSIZE);
int num_images = NUM_OCTAVES * NUM_SCALES;
int dog_count = NUM_OCTAVES * (NUM_SCALES - 1);
int extrema_count = dog_count - (NUM_OCTAVES * 2);
Mat src = imread(IMAGE);
Mat gray;
Mat norm;
Mat preblur;
Mat scaledUp;
Mat gauss[num_images];
Mat dogs[dog_count];
Mat extremas[dog_count - (NUM_OCTAVES * 2)];
grayscale(src, gray);
normalize(gray, norm);
gaussian_blur(norm, preblur, PRE_BLUR_SIGMA);
scale_up_2x(preblur, scaledUp);
scale_space_pyramid(scaledUp, gauss, dogs, INITIAL_SIGMA, NUM_OCTAVES, NUM_SCALES);
detect_extrema(dogs, extremas, NUM_OCTAVES, dog_count);
int i;
if (SHOW_GAUSSIANS) {
for (i = 0; i < num_images; ++i) {
imshow(WINDOW_NAME, gauss[i]);
waitKey(0);
}
}
if (SHOW_DOGS) {
for (i = 0; i < dog_count; ++i) {
imshow(WINDOW_NAME, dogs[i]);
waitKey(0);
}
}
if (SHOW_EXTREMA) {
for (i = 0; i < extrema_count; ++i) {
imshow(WINDOW_NAME, extremas[i]);
waitKey(0);
}
}
}