K-means算法WEB文本挖掘中的聚类研究

摘要数据挖掘是现今的研究热点。本次研究着手于新浪微博的数据挖掘，微博作为Web2.0时代新生网络应用形式，在最近几年中得到了迅猛的发展。本文通过新浪微博API获取新浪微博内容，统计某一博主的微博，通过高频词汇提取，构造停用词表，删除不常用词汇，使用TF-IDF值赋予权重，利用同义词表优化向量表，最后利用余弦算法计算文本相似度，使用K-means算法将微博聚类，从而研究聚类文本的特点以及K-means算法在网络文本数据挖掘的优点与不足。64786

毕业论文关键词：聚类 WEB 数据挖掘 K-means 微博文本相似度

毕业设计说明书（论文）外文摘要

Title Research on Clustering in WEB Text Mining

Abstract

Data mining is a research hotspot. This study is based on Sina micro-blog data mining, Web 2.0 era as a new network application form , has been developing rapidly in recent years. This paper gets the Sina micro-blog content through the Sina micro-blog API ,and Census the micro-blog, remove common words, use the TF-IDF value to give the weights, using synonym table optimization vector table, finally calculated the similarity table . Using cosine algorithm and K-means algorithm ,we can cluster the micro-blog.Then we want to find advantages and disadvantages of K-means, in order to study the text clustering and K-means algorithm in network text data mining.

Keywords： Clustering Web Datamining K-means micro-blog Text similarity

1 绪论 1

1.1 研究背景 1

1.1.1 数据挖掘的内容 1

1.1.2 数据挖掘的意义 1

1.1.3 知识发现的过程 2

1.2 聚类分析 2

1.2.1 聚类分析简介 2

1.2.2 研究现状 3

1.2.3 传统的聚类算法概述 3

1.2.4 基于划分的方法 3

1.2.5 基于层次的方法 4

1.2.6 基于密度的方法 4

1.2.7 基于网格的方法 5

1.3 簇间距离的度量方法 5

1.3.1 欧式距离 6

1.3.2 街区距离 6

1.3.3 基于密度的距离 6

2 文本数据的获取及其分词 6

2.1 Web数据获取 7

2.1.1 新浪微博开放平台 7

2.1.2 新浪微博授权机制 8

2.1.3 新浪微博API 9

2.2 Web 数据清理 11

2.2.1 中文分词及其方法 11

2.2.2 分词算法的局限性 12

2.2.3 停用词表 12

2.2.4 盘古分词中文分词算法简介 13

3 聚类算法介绍 13

3.1 文本相似度算法

上一篇：LCC复杂产品研制费用估算系统设计与开发

下一篇：Android移动互联网的校园通知推送系统设计

K-means算法WEB文本挖掘中的聚类研究

基于Web应用的致胜公司企业内部培训系统设计

WEB仪器管理系统分析项目...

国产加密算法的研究与实现

基于深度学习的目标识别算法研究

智能算法的海上应急救援基地选址优化设计

基于启发式算法的智能路径规划研究

React+Router+webpack楼宇能源监控Web端设计与实现

激光模拟训练器材国内外研究现状

肢体语言在小学英语教学中的应用浅谈

2021年什么行业赚钱，适合...

浅谈农村大气环境保护的制度构建【1868字】

大淘宝网的虚假交易研究

淮安市高校足球运动损伤问卷调查表

日语论文中日酒文化对比研究

发酵米粉优势菌株的发酵特性研究

新疆农林高校學生昆虫生...

个案管理茬老年糖尿病患...