摘要: 本论文研究开发计算机专业文章搜索引擎。我们平时使用百度搜索学习资源,搜索结果通常掺杂大量广告与无关内容,基于这样的现状,我编写设计一个专为计算机相关技术提供搜索服务的搜索引擎。本计算机搜索引擎主要使用Python语言开发,分为爬虫与搜索两部分,由开源爬虫框架Scrapy编写爬虫从网站爬取数据保存至数据库,保存至开源搜索引擎Elasticsearch中,之后使用Django框架搭建一个搜索网站,用户只能在该网站搜索计算机相关内容。本课题的难点在于爬虫端,大多数网站都有一定的反爬虫机制,那么如何避免被反爬虫系统识别并且高效获取数据将是本课题最主要的问题。在爬虫获取数据并保存在Elasticsearch之后使用Elasticsearch进行搜索,将搜索结果呈现在浏览器上是另一大难题。最终通过我的不懈努力与老师的悉心指导完成了本次毕业设计,本课题最后完成的版本均能实现最初设想,且运行稳定。
关键词: Python,搜索引擎,Elasticsearch,Django,Scrapy
Computer professional article search engine
Abstract: This thesis researches and develops computer professional article search engine. We usually use Baidu to search for learning resources. The search results are usually doped with a lot of advertising and irrelevant content. Based on this situation, I write a search engine designed to provide search services specifically for computer-related technologies. The computer search engine is mainly developed using Python language and is pided into two parts: crawler and search. The crawler written by the open-source crawler Scrapy crawls data from the website and saves it to the database, saves it to the open source search engine Elasticsearch, and then uses the Django framework to build a search site. Users can only search for computer-related content on this site. The difficulty of this topic lies in the crawler side. Most websites have a certain anti-repeat mechanism. How to avoid being recognized by the anti-reptile system and efficiently obtain data will be the most important issue of this topic. Searching results using Elasticsearch after crawlers get data and save them in Elasticsearch is another big challenge. Finally, through my unremitting efforts and the teacher's careful guidance, this graduated design was completed. The last completed version of this topic can realize the initial assumption and is stable in operation.
Keywords: Python,search engine,Elasticsearch,Django,Scrapy
目录
摘要 i
Abstract i
目录
1 绪论 5
1.1 搜索引擎原理 5
1.1.1 搜索引擎的核心问题 5
1.1.2 搜索引擎的分类 5
1.1.3 搜索引擎的结构 5
1.1.4 搜索引擎的原理 6
1.2 主流搜索引擎 6
1.2.1 Google 6
1.2.2 Bing 7
1.2.3 百度 7
1.3 主流搜索引擎搜索结果分析