出版社:Information and Media Technologies Editorial Board
摘要:Patent processing is important in various fields such as industry, business, and law. We used F-terms (Schellner 2002) to classify patent documents using the k-nearest neighborhood method. Because the F-term categories are fine-grained, they are useful when we classify patent documents. We clarified the following three points using experiments: i) which variations of the k-nearest neighborhood method are the best for patent classification, ii) which methods of calculating similarity are the best for patent classification, and iii) from which regions of a patent terms should be extracted. In our experiments, we used the patent data used in the F-term categorization task in the NTCIR-5 Patent Workshop (NTCIR committee 2005; Iwayama, Fujii, and Kando 2005). We found that the method of adding the scores of k extracted documents to classify patent documents was the most effective among the variations of the k-nearest neighborhood method used in this study. We also found that SMART (Singhal, Buckley, and Mitra 1996; Singhal, Choi, Hindle, and Pereira 1997), which is known to be effective in information retrieval, was the most effective method of calculating similarity. Finally, when extracting terms, we found that using the abstract and claim regions together was the best method among all the combinations of using abstract, claim, and description regions. The results were confirmed using a statistical test. Moreover, we experimented with changing the amount of training data and found that we obtained better performance when we used more data, which was limited to that provided in the NTCIR-5 Patent Workshop.