随着信息技术的不断发展,基于内容的信息检索和数据挖掘逐渐成为备受关注的研究领域。文本分类是信息检索和文本挖掘的重要基础,其主要任务是在预先给定的一组训练文本和它们的类别的情况下,对文档根据其内容判定其类别。
在介绍文本分类技术的基础上,比较研究了朴素贝叶斯和KNN两种分类算法,并将其应用应用于中文文本分类。预处理过程中首先在中科院分词系统基础上对文本进行分词,然后应用基于文档频的互信息原理对文本特征选择达到降文效果,通过对文本TFIDF加权处理后获得向量结构模型,最后使用两种分类算法进行中文文本分类。5773
实验结果表明,两种文本分类算法各有其特性:朴素贝叶斯具有较快的分类速度,但分类不准;KNN 针对加权后得到的高文稀疏向量具有分类准确度较高、分类速度较慢的特性。
关键词:中文文本分类;朴素贝叶斯;KNN
毕业设计说明书(论文)外文摘要
Title Research on Text Classification Technology
Abstract
With the development of Information technology,content based information retrieval and data mining will be a concerned field of investigation increasingly.Text categorization(TC) is regarded as an important foundation of information retrieval and text mining,Its key tests are that the PC decides the class label of a text basing on its content in the time of giving a group of training texts and its class label.
The two algorithms of Native Bayes and KNN on Chinese text categorization are compared in my paper.First, the Chinese texts are classified by useing the ICTCLAS.Then,the frequency feature selection is finished by applying the mutual information based on DF,and in order to make the texts have a uniform and disposal structure-model,I use TFIDF to value the feature.Finally,the predict texts are classified by using the two algorithms.
It will be seen from the results of experiment that the two text categorization algorithms have their characteristics respectively.Naïve Bayes is compared in the paper that it has a worse accuracy and a better speed than KNN.The other one has a better accuracy and categorization capability,but it is much slower.
Keywords Chinese Text categorization Native Bayes KNN