垂直网站网络分布式爬虫的设计与实现

随着科技的日新月异，因特网也在飞速地发展之中，网络上的资源也在不断丰富。在由此导致的海量信息面前，如何有效地提取和利用互联网上的巨大信息量成为一个挑战。集中式搜索引擎从海量的信息中快速检索出用户真正需要的信息正变得愈加困难，搜索引擎正向着具有分布式处理能力的方向发展，扩展系统的规模来增强处理信息的能力，分布式搜索引擎应运而生。

本课题的分布式爬虫设计基于一个开源的 java 搜索引擎 nutch，能够提供一个搜索引擎运行所需的全部工具，包括 web 爬虫和全文搜索。考虑到网络爬虫的巨大任务量，需要强大的处理能力和网络带宽，本课题的设计工作将 nutch 部署在 hadoop 之上以实现分布式爬取。同时本课题使用基于 Apache Lucene 的 solr 搜索引擎，便于对爬取的结果进行查询和检索，以达到更好的用户体验。

毕业论文关键词: nutch Hadoop solr 分布式爬取77211

毕业设计说明书外文摘要

Title The Design and Implementation of a Distributed Crawler for Vertical Website Networks

Abstract With the rapid development of science and technology, internet has been developed rapidly, therefore, the network resources also haven been continuously enriched。 When facing the resultant massive amounts of information, how to effectively extract and take advantage of the huge information quantity on internet becomes a challenging issue。 It is more and more difficult for the centralized search engines to promptly retrieve information really required by the users from the massive amounts of information。 As a result, the search engine is under development toward the direction of distributed processing, for the purpose of enhancing information processing ability by expanding system scale。 Accordingly, distributed search engines are taken into consideration to satisfy the above-mentioned requirements。

This project designs and implements a distributed crawler based upon an open source java search engine nutch, which is capable of providing all required tools to operate a search engine, including web crawling and full-text search functionalities。 Considering that the heavy workload of web crawling requires powerful processing capacity and high network bandwidth, this project deploys nutch on a hadoop architecture for implementing distributed crawling。 In addition, this project uses solr search engine, which is based on Apache Lucene, to achieve convenient search and retrieve of crawling results as well as better user experience。

Keywords nutch hadoop solr distributed crawling

本科毕业设计说明书第 I 页

1 绪论 1

1。1 课题背景 1

1。2 分布式搜索引擎发展的背景及历史 2

1。3 本文研究工作 3

1。4 本文结构安排 3

2 技术综述 4

2。1nutch 简介 4

2。1。1 nutch 文件组成 4

上一篇：i-jetty人脸识别系统设计

下一篇：Word2Vec和cosine相似度程序能力智能辅助训练平台设计

垂直网站网络分布式爬虫的设计与实现

教育技术学专业技能学习网站的设计

H5动漫社区网站设计

jsp值得买导购网站的设计与实现

IEEE802.15.4网络自适应策略研究

java的B2C型电子商务网站管理系统的设计

MATLAB基于流形学习与神经网络的预测建模

社交网络个性化推荐方法对比研究

新疆农林高校學生昆虫生...

浅谈农村大气环境保护的制度构建【1868字】

淮安市高校足球运动损伤问卷调查表

发酵米粉优势菌株的发酵特性研究

肢体语言在小学英语教学中的应用浅谈

2021年什么行业赚钱，适合...

大淘宝网的虚假交易研究

日语论文中日酒文化对比研究

个案管理茬老年糖尿病患...

激光模拟训练器材国内外研究现状