00001 /** 00002 \file Neuro/GistEstimatorBeyondBoF.H 00003 00004 \brief Implementation of ``Beyond Bags of Features: Spatial Pyramid 00005 Matching for Recognizing Natural Scene Categories'' by Lazebnik, et 00006 al. 00007 00008 The GistEstimatorBeyondBoF class implements the following paper 00009 within the INVT framework: 00010 00011 Lazebnik, S., Schmid, C., Ponce, J. 00012 Beyond Bags of Features: Spatial Pyramid Matching for Recognizing 00013 Natural Scene Catgories 00014 CVPR, 2006. 00015 00016 In the paper, the authors describe the use of weak features (oriented 00017 edge points) and strong features (SIFT descriptors) as the basis for 00018 classifying images. In this implementation, however, we only concern 00019 ourselves with strong features clustered into 200 categories (i.e., 00020 the vocabulary size is 200 ``words''). Furthermore, we restrict the 00021 spatial pyramid used as part of the matching process to 2 levels. 00022 00023 We restrict ourselves to the above configuration because, as Lazebnik 00024 et al. report, it yields the best results (actually, a vocabulary 00025 size of 400 is better, but not by much). 00026 00027 To compute the gist vector for an image, we first divide the image 00028 into 16x16 pixel patches and compute SIFT descriptors for each of 00029 these patches. We then assign these descriptors to bins corresponding 00030 to the nearest of the 200 SIFT descriptors (vocabulary) gleaned from 00031 the training phase. This grid of SIFT descriptor indices is then 00032 converted into a feature map that specifies the grid coordinates for 00033 each of the 200 feature types. This map allows us to compute the 00034 multi-level histograms as described in the paper. The gist vectors we 00035 are interested in are simply the concatenation of all the multi-level 00036 histograms into a flat array of numbers. 00037 00038 Once we have these gist vectors, we can classify images using an SVM. 00039 The SVM kernel is the histogram intersection function, which takes the 00040 gist vectors for the input and training images and returns the sum of 00041 the minimums of each dimension (once again, see the paper for the gory 00042 details). 00043 00044 This class, viz., GistEstimatorBeyondBoF, only computes gist vectors 00045 (i.e., normalized multi-level histograms) given the 200 ``word'' 00046 vocabulary of SIFT descriptors to serve as the bins for the 00047 histograms. The actual training and classification must be performed 00048 by client programs. To assist with the training process, however, this 00049 class sports a training mode in which it does not require the SIFT 00050 descriptors database. Instead, in training mode, it simply returns the 00051 raw grid of SIFT descriptors for the images it is supplied. This 00052 allows clients to store those descriptors for clustering, etc. 00053 */ 00054 00055 // //////////////////////////////////////////////////////////////////// // 00056 // The iLab Neuromorphic Vision C++ Toolkit - Copyright (C) 2000-2005 // 00057 // by the University of Southern California (USC) and the iLab at USC. // 00058 // See http://iLab.usc.edu for information about this project. // 00059 // //////////////////////////////////////////////////////////////////// // 00060 // Major portions of the iLab Neuromorphic Vision Toolkit are protected // 00061 // under the U.S. patent ``Computation of Intrinsic Perceptual Saliency // 00062 // in Visual Environments, and Applications'' by Christof Koch and // 00063 // Laurent Itti, California Institute of Technology, 2001 (patent // 00064 // pending; application number 09/912,225 filed July 23, 2001; see // 00065 // http://pair.uspto.gov/cgi-bin/final/home.pl for current status). // 00066 // //////////////////////////////////////////////////////////////////// // 00067 // This file is part of the iLab Neuromorphic Vision C++ Toolkit. // 00068 // // 00069 // The iLab Neuromorphic Vision C++ Toolkit is free software; you can // 00070 // redistribute it and/or modify it under the terms of the GNU General // 00071 // Public License as published by the Free Software Foundation; either // 00072 // version 2 of the License, or (at your option) any later version. // 00073 // // 00074 // The iLab Neuromorphic Vision C++ Toolkit is distributed in the hope // 00075 // that it will be useful, but WITHOUT ANY WARRANTY; without even the // 00076 // implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR // 00077 // PURPOSE. See the GNU General Public License for more details. // 00078 // // 00079 // You should have received a copy of the GNU General Public License // 00080 // along with the iLab Neuromorphic Vision C++ Toolkit; if not, write // 00081 // to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, // 00082 // Boston, MA 02111-1307 USA. // 00083 // //////////////////////////////////////////////////////////////////// // 00084 // 00085 // Primary maintainer for this file: Manu Viswanathan <mviswana at usc dot edu> 00086 // $HeadURL: svn://isvn.usc.edu/software/invt/trunk/saliency/src/Neuro/GistEstimatorBeyondBoF.H $ 00087 // $Id: GistEstimatorBeyondBoF.H 11138 2009-04-22 22:02:38Z mviswana $ 00088 // 00089 00090 #ifndef GE_BEYOND_BAGS_OF_FEATURES_H_DEFINED 00091 #define GE_BEYOND_BAGS_OF_FEATURES_H_DEFINED 00092 00093 //------------------------------ HEADERS -------------------------------- 00094 00095 // Gist specific headers 00096 #include "Neuro/GistEstimator.H" 00097 00098 // Other INVT headers 00099 #include "SIFT/Keypoint.H" 00100 #include "Neuro/NeuroSimEvents.H" 00101 00102 // Standard C++ headers 00103 #include <ostream> 00104 #include <list> 00105 #include <string> 00106 00107 //------------------------- CLASS DEFINITION ---------------------------- 00108 00109 /** 00110 \class GistEstimatorBeyondBoF 00111 \brief Gist estimator for ``Beyond Bags of Features ...'' by Lazebnik, 00112 et al. 00113 00114 This class computes the gist vector for an input image using the 00115 feature extraction and spatial pyramid matching scheme described in 00116 sections 4 and 3 (respectively) of ``Beyond Bags of Features: Spatial 00117 Pyramid Matching for Recognizing Natural Scene Categories'' by 00118 Lazebnik, et al. 00119 00120 While the authors of the above-mentioned paper experiment with weak 00121 features (oriented edge points) and strong features (SIFT descriptors) 00122 and with different resolutions of the spatial matching pyramid, this 00123 class only implements strong features clustered into 200 categories 00124 and a two-level pyramid because other configurations were either not 00125 as good as this one or did not offer any significant advantages over 00126 it (as reported in the paper). 00127 00128 Thus, utilizing the terminology employed by Lazebnik, et al., we have 00129 the number of channels M = 200 and the number of levels of the spatial 00130 matching pyramid L = 2. This will result in gist vectors of 00131 dimensionality: 00132 00133 M * (4^(L+1) - 1)/3 = M * (2^(2L+2) - 1)/3 00134 = 200 * (2^(2*2+2) - 1)/3 00135 = 200 * (2^6 - 1)/3 00136 = 200 * 63/3 00137 = 200 * 21 00138 = 4200 00139 00140 See the paper for the gory details. 00141 */ 00142 class GistEstimatorBeyondBoF : public GistEstimatorAdapter { 00143 // This ought to work fine, but for some weird reason, doesn't. The 00144 // 32-bit nightly builds are carried out by GCC version 3.4.1 and it 00145 // compiles this and the .C files fine. But the linker chokes and 00146 // complains about undefined references to these variables. This does 00147 // not seem to be a problem with newer versions of GCC (>= 4.2). 00148 // 00149 // Oddly enough, other classes (e.g., GistEstimatorContextBased and 00150 // GistEstimatorTexton) that also define static const members do not 00151 // suffer from this affliction. But those classes define static const 00152 // uints rather than static const ints. Still, this doesn't make 00153 // sense! 00154 // 00155 // Anyhoo, just to resolve this issue: we explicitly define these 00156 // static data members in the .C file. 00157 //static const int NUM_CHANNELS = 200 ; 00158 //static const int NUM_LEVELS = 2 ; 00159 static int NUM_CHANNELS ; 00160 static int NUM_LEVELS ; 00161 static int GIST_VECTOR_SIZE ; 00162 00163 public: 00164 /// Accessors for some parameters used internally by the Lazebnik 00165 /// algorithm. 00166 ///@{ 00167 static int num_channels() {return NUM_CHANNELS ;} 00168 static int num_levels() {return NUM_LEVELS ;} 00169 static int gist_vector_size() {return GIST_VECTOR_SIZE ;} 00170 ///@} 00171 00172 /// Modifiers for some parameters used internally by the Lazebnik 00173 /// algorithm. 00174 /// 00175 /// WARNING: These methods should be used with care. Do not call them 00176 /// to change the size of the SIFT vocabulary and spatial pyramid size 00177 /// in the "middle" of an "entire" run. That is, if you use 100 00178 /// channels and a 4-level pyramid for training, don't switch to 200 00179 /// channels and a 3-level pyramid during the testing phase! 00180 ///@{ 00181 static void num_channels(int) ; 00182 static void num_levels(int) ; 00183 ///@} 00184 00185 /// The constructor expects to be passed an option manager, which it 00186 /// uses to set itself up within the INVT simulation framework. 00187 GistEstimatorBeyondBoF(OptionManager& mgr, 00188 const std::string& descrName = "GistEstimatorBeyondBoF", 00189 const std::string& tagName = "GistEstimatorBeyondBoF") ; 00190 00191 /// A SIFT descriptor is just a vector of 128 numbers. The following 00192 /// inner class provides a convenient wrapper for this vector. 00193 struct SiftDescriptor { 00194 static const int SIZE = 128 ; 00195 00196 SiftDescriptor() ; 00197 SiftDescriptor(const Image<float>&) ; 00198 SiftDescriptor(const rutz::shared_ptr<Keypoint>&) ; 00199 00200 // WARNING: NO RANGE CHECK! i better be in [0, 128). 00201 float& operator[](int i) {return values[i] ;} 00202 const float& operator[](int i) const {return values[i] ;} 00203 private : 00204 float values[SIZE] ; 00205 } ; 00206 00207 /// Like other gist estimators, this one too filters the input image. 00208 /// Its filteration process involves subdividing the input image into 00209 /// 16x16 pixel patches and running SIFT on each of these patches. The 00210 /// filteration results are, therefore, a grid of SIFT descriptors. 00211 /// The following type is used to represent these results. 00212 typedef Image<SiftDescriptor> SiftGrid ; 00213 00214 /// In order to compute a gist vector, this estimator needs to know 00215 /// the vocabulary associated with the bag of features. This 00216 /// vocabulary is usually obtained by a clustering process as part of 00217 /// the training. Each ``word'' or ``vis-term'' in this vocabulary is 00218 /// also known as a channel. The gist vector essentially represents a 00219 /// ``spatial histogram'' for each of these channels. 00220 /// 00221 /// The vocabulary itself is merely a collection of 200 SIFT 00222 /// descriptors (the centroids of the clusters) passed in via an 00223 /// Image of floating point numbers. Thus, the size of this Image 00224 /// would be 128x200. (The 128 comes from the dimensionality of a 00225 /// SIFT descriptor.) 00226 typedef std::list<SiftDescriptor> Vocabulary ; 00227 00228 /// This method should be called once during the client's 00229 /// initialization process prior to attempting to obtain gist 00230 /// vectors for input images. Thus, the clustering phase of the 00231 /// training must be complete before this estimator can be used to 00232 /// compute gist vectors. 00233 void setVocabulary(const Image<float>&) ; 00234 00235 /// To assist with training, GistEstimatorBeyondBoF can be configured 00236 /// to operate in a special training mode in which it does not have a 00237 /// vocabulary from which to form gist vectors but rather simply 00238 /// passes back (to its client) the grid of SIFT descriptors for the 00239 /// input image, i.e., the results of the filteration step. The 00240 /// client may then store these descriptors, perform the clustering 00241 /// required to create the vocabulary necessary for subsequent normal 00242 /// use of this gist estimator, and then run the estimator in 00243 /// non-training mode to compute the actual gist vectors. 00244 /// 00245 /// Training mode is set by specifying a hook function that accepts 00246 /// the filteration results, i.e., the grid/Image of SIFT 00247 /// descriptors. 00248 typedef void (*TrainingHook)(const SiftGrid&) ; 00249 00250 /// This method should be called once during the client's 00251 /// initialization sequence to specify the training mode hook 00252 /// function to configure GistEstimatorBeyondBoF to run in training 00253 /// mode. If this hook is not specified, the estimator will run in 00254 /// ``normal'' mode and compute gist vectors from the vocabulary. 00255 /// 00256 /// It is an error to not specify either the training hook or the 00257 /// vocabulary. If both are specified, the training hook takes 00258 /// precedence, i.e., the estimator runs in training mode, wherein it 00259 /// passes back filteration results (grid of SIFT descriptors) to the 00260 /// client rather than computing gist vectors. 00261 void setTrainingHook(TrainingHook) ; 00262 00263 /// Return the gist vector (useless in training mode). 00264 Image<double> getGist() ; 00265 00266 /// Destructor 00267 virtual ~GistEstimatorBeyondBoF() ; 00268 00269 protected: 00270 /// Callback for when a new input (retina) frame is available 00271 SIMCALLBACK_DECLARE(GistEstimatorBeyondBoF, SimEventRetinaImage); 00272 00273 private : 00274 Image<double> itsGistVector ; // gist feature vector 00275 Vocabulary itsVocabulary ; 00276 TrainingHook itsTrainingHook ; 00277 } ; 00278 00279 //---------------------- MISCELLANEOUS FUNCTIONS ------------------------ 00280 00281 std::ostream& operator<<(std::ostream&, 00282 const GistEstimatorBeyondBoF::SiftDescriptor&) ; 00283 00284 //-------------------- INLINE FUNCTION DEFINITIONS ---------------------- 00285 00286 inline Image<double> GistEstimatorBeyondBoF::getGist() 00287 { 00288 return itsGistVector ; 00289 } 00290 00291 inline void 00292 GistEstimatorBeyondBoF:: 00293 setTrainingHook(GistEstimatorBeyondBoF::TrainingHook H) 00294 { 00295 itsTrainingHook = H ; 00296 } 00297 00298 //----------------------------------------------------------------------- 00299 00300 #endif 00301 00302 /* So things look consistent in everyone's emacs... */ 00303 /* Local Variables: */ 00304 /* indent-tabs-mode: nil */ 00305 /* End: */