https://tracker.cse.buffalo.edu/Ticket/Display.html?id=19513

ContactPerson: yiminkwu@cse.buffalo.edu
Remote host: chopin.cse.buffalo.edu
### Begin Citation ### Do not delete this line ###
%R 2003-03
%U /home/csegrad/yiminkwu/public_html/wu_tr.pdf
%A Wu, Yimin
%A  Zhang, Aidong
%T Adaptively Discovering Meaningful Patterns in High-Dimensional Nearest Neighbor Search
%D April 08, 2003
%I Department of Computer Science and Engineering, SUNY Buffalo
%Y Database
%X To query high-dimensional databases, similarity search (or k nearest neighbor search) is the most extensively used method. However, since 
each attribute of high dimensional data records only contains very small amount of information, the distance of two high-dimensional records may not always correctly reflect their similarity. So, a multi-dimensional 
query may have a k-nearest-neighbor set which only contains few 
relevant records. To address this issue, we present an adaptive pattern 
discovery method to search high dimensional data spaces both effectively and efficiently. With our method, the user is allowed to participate in the database search by labeling the returned records 
as relevant or irrelevant. By using user-labeled data records as training samples, our method employs an adaptive pattern discovery technique 
to learn the distribution patterns of relevant records in 
the data space, and drastically reduces irrelevant data records. From the reduced data set, our approach returns the top-k nearest neighbors of the query to the user -- this interaction between the user and the DBMS 
can be repeated multiple times. To achieve the adaptive pattern discovery,  
we employ a pattern classification algorithm called random forests, which is a machine learning algorithm with proven good performance on many traditional classification problems.By using a novel two-level resampling method, we adapt the original random forests to an interactive algorithm, which achieves noticeable gains in efficiency over the original algorithm. We empirically compare our method with previously well-known related approaches on large-scaled, high-dimensional and real-world data sets, and report promising results of our method.