https://tracker.cse.buffalo.edu/Ticket/Display.html?id=19513 ContactPerson: yiminkwu@cse.buffalo.edu Remote host: chopin.cse.buffalo.edu ### Begin Citation ### Do not delete this line ### %R 2003-03 %U /home/csegrad/yiminkwu/public_html/wu_tr.pdf %A Wu, Yimin %A Zhang, Aidong %T Adaptively Discovering Meaningful Patterns in High-Dimensional Nearest Neighbor Search %D April 08, 2003 %I Department of Computer Science and Engineering, SUNY Buffalo %Y Database %X To query high-dimensional databases, similarity search (or k nearest neighbor search) is the most extensively used method. However, since each attribute of high dimensional data records only contains very small amount of information, the distance of two high-dimensional records may not always correctly reflect their similarity. So, a multi-dimensional query may have a k-nearest-neighbor set which only contains few relevant records. To address this issue, we present an adaptive pattern discovery method to search high dimensional data spaces both effectively and efficiently. With our method, the user is allowed to participate in the database search by labeling the returned records as relevant or irrelevant. By using user-labeled data records as training samples, our method employs an adaptive pattern discovery technique to learn the distribution patterns of relevant records in the data space, and drastically reduces irrelevant data records. From the reduced data set, our approach returns the top-k nearest neighbors of the query to the user -- this interaction between the user and the DBMS can be repeated multiple times. To achieve the adaptive pattern discovery, we employ a pattern classification algorithm called random forests, which is a machine learning algorithm with proven good performance on many traditional classification problems.By using a novel two-level resampling method, we adapt the original random forests to an interactive algorithm, which achieves noticeable gains in efficiency over the original algorithm. We empirically compare our method with previously well-known related approaches on large-scaled, high-dimensional and real-world data sets, and report promising results of our method.