当前位置: 首页 > 编程日记 > 正文

Lucene.net: the main concepts

2019独角兽企业重金招聘Python工程师标准>>> hot3.png

In the previous post you learnt how to get a copy of Lucene.net and where to go in order to look for more information. As you noticed the documentation is far from being complete and easy to read. So in the post I’ll write about the main concepts behind Lucene.net and which are the main steps in the development of a solution based on Lucene.net.

Some of the main concepts

Before looking at the development phases, it’s important to have a look at the main actors of Lucene.net.

Directoy

The directory is where Lucene indexes are stored: it can be a physical folder on the filesystem (FSDirectory) or an area in memory where files are stored (RAMDirectory). The index structure is compatible with all ports of Lucene, so you could also have the indexing done with .NET and searched with Java, or the other way around (obviously, using the filesystem directory).

IndexWriter

This component is responsible for the management of the indexes. It creates indexes, adds documents to the index, optimizes the index.

Analyzer

This is where the complexity of the indexing resides. In a few words the analyzer contains the policy for extracting index terms from the text. There are several analyzers available both in the core library and in the contrib project. And the java version has even more analyzers that have not been ported to .net yet.

Probably the analyzer you’ll use the most is the StandardAnalyzer, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stopwords.

Another interesting analyzer is the SnowballAnalyzer, which works exactly like the standard one, but adds one more step at the end: the stemming phase, using the Snowball stemming language. Stemming is the process of reducing inflected words to their root. For example, if you are looking for “developing”, probably you are also interested in the word “developed” or “develop” or “developer”. During the indexing phase, the stemming process normalizes all these inflected words to their root “develop”. And does the same when querying the index (if you search for “development” it will search for “develop”). Obviously this is tied to the language of the text, so the snowball analyzer comes with many different “grammars” for that.

Document and Fields

A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document. Different fields can be indexed using different algorithm and analyzers. For example you might just want to store the document id, without being able to search on it. But you want to be able to search by tags as single keywords, and, finally you want to index the body of blog post for full text search (thus using the Analyzer and the tokenizers).

Since this is an important topic, I’ll write a more in depth post in the future.

Searcher and IndexReader

The searcher is the component that, with the help of the IndexReader, scans the index files and returns results based on the query supplied.

QueryParser

The query parser is responsible for parsing a string of text to create a query object. It evaluates Lucene query syntax and uses an analyzer (which should be the same you used to index the text) to tokenize the single statements.

The main development steps

And now let’s have a brief overview at the logical steps involved in integrating Lucene.net into your applications:

1 – Initialize Directory and IndexWriter

The first step is initializing the Directory and the IndexWriter. In a web application, like Subtext, this is most likely done in the application startup and then the instance stored in a global variable somewhere (or accessed through a Singleton) since only one Writer can read the Dictionary at the same time.

And when you create the IndexWriter you can supply the analyzer that will be used by default to index all the text.

2 – Add Documents to the Index

Each document is made by various Fields. You have to create a Document with all the Fields that must be indexed and also the ones you need in order to link the result to the real document that is being indexed (for example the id of the post).

And once created the Document, you have to add it to the Directory with the IndexWriter.

At this point, you could either add more documents or close the IndexWriter. The index will be saved to the Directory and can be re-opened later for adding more Documents or to perform queries on in.

3 – Create the Query

Once you have all your documents in the index, it’s time to do some queries.

You can create the query either via the QueryParser or creating a Query object directly via API.

4 – Pass the Query to the IndexSearcher

And once you have the Query object you have to pass it to the Search method of a IndexSearcher.

One caveats is that the IndexSearcher sees the index only at the point it was at the time it was opened. So in order to search over the most recent set of documents you have to re-open theIndexSearcher. But re-opening takes time and resources, so in a web application you might want to cache it somehow and re-open it periodically.

5 – Iterates over the results

The Search method returns the results, inside a Hit object, which contains all the documents that match the query, ordered by Score, which is a very complex math formula that should tell you how much the document found is related to your query. For more information refer to Lucene website:Scoring.

6 – Close everything

And once you are done with everything, close the IndexWriterIndexSearcher and the Directoryobject. In a web application this is typically performed in the application shutdown event.

Next

You just read about the main concepts behind Lucene.net. In a future post I’ll write how to implement Lucene.net into a sample console application that puts together all the concepts discussed here.

Tags:  Lucene.net

Kick it • Digg This! • Technorati Links • Save to del.icio.us (86 saves, tagged: .net asp.net fulltextsearch)

posted on Monday, August 31, 2009 12:11 PM

Related Links

Lucene.net: your first application (9/2/2009) Dissecting Lucene.net storage: Documents and Fields (9/4/2009) Lucene - or how I stopped worrying, and learned to love unstructured data (9/8/2009) How Subtext’s Lucene.net index is structured (9/10/2009) Lucene.net is powering Subtext 2.5 search (2/26/2010)

转载于:https://my.oschina.net/u/138995/blog/178778

相关文章:

einsum,一个函数走天下

作者 | 永远在你身后转载自知乎【导读】einsum 全称 Einstein summation convention(爱因斯坦求和约定),又称为爱因斯坦标记法,是爱因斯坦 1916 年提出的一种标记约定,本文主要介绍了einsum 的应用。简单的说&#xff…

常用排序算法的C++实现

排序是将一组”无序”的记录序列调整为”有序”的记录序列。假定在待排序的记录序列中,存在多个具有相同的关键字的记录,若经过排序,这些记录的相对次序保持不变,即在原序列中,rirj,且ri在rj之前&#xff0…

4.65FTP服务4.66测试登录FTP

2019独角兽企业重金招聘Python工程师标准>>> FTP服务 测试登录FTP 4.65FTP服务 文件传输协议(FTP),可以上传和下载文件。比如我们可以把Windows上的文件shan上传到Linux,也可以把Linux上的文件下载到Windows上。 Cent…

JavaScript的应用

DOM, BOM, XMLHttpRequest, Framework, Tool (Functionality) Performance (Caching, Combine, Minify, JSLint) ---------------- 人工做不了,交给程序去做,这样可以流程化。 Maintainability (Pattern) http://www.jmarshall.com/easy/http/ http://dj…

miniz库简介及使用

miniz:Google开源库,它是单一的C源文件,紧缩/膨胀压缩库,使用zlib兼容API,ZIP归档读写,PNG写方式。关于miniz的更详细介绍可以参考:https://code.google.com/archive/p/miniz/miniz.c is a loss…

iOS之runtime详解api(三)

第一篇我们讲了关于Class和Category的api,第二篇讲了关于Method的api,这一篇来讲关于Ivar和Property。 4.objc_ivar or Ivar 首先,我们还是先找到能打印出Ivar信息的函数: const char * _Nullable ivar_getName(Ivar _Nonnull v) …

亚马逊首席科学家李沐「实训营」国内独家直播,马上报名 !

开学了,别人家的学校都开始人工智能专业的学习之旅了,你呢?近年来,国内外顶尖科技企业的 AI 人才抢夺战愈演愈烈。华为开出200万年薪吸引 AI 人才,今年又有 35 所高校新增人工智能本科专业,众多新生即将开展…

人脸检测库libfacedetection介绍

libfacedetection是于仕琪老师放到GitHub上的二进制库,没有源码,它的License是MIT,可以商用。目前只提供了windows 32和64位的release动态库,主页为https://github.com/ShiqiYu/libfacedetection,采用的算法好像是Mult…

倒计时1天 | 2019 AI ProCon报名通道即将关闭(附参会指南)

2019年9月5-7日,面向AI技术人的年度盛会—— 2019 AI开发者大会 AI ProCon,震撼来袭!2018 年由 CSDN 成功举办 AI 开发者大会一年之后,全球 AI 市场正发生着巨大的变化。顶尖科技企业和创新力量不断地进行着技术的更迭和应用的推…

法院判决:优步无罪,无人车安全员可能面临过失杀人控诉

据路透社报道,负责优步无人车在亚利桑那州致人死亡事件调查的律师事务所发布公开信宣布,优步在事故中“不承担刑事责任”,但是当时在车上的安全员Rafaela Vasquez要接受进一步调查,可能面临车辆过失杀人罪指控。2018年3月&#xf…

09 Storage Structure and Relationships

目标:存储结构:Segments分类:Extents介绍:Blocks介绍:转载于:https://blog.51cto.com/eread/1333894

边界框的回归策略搞不懂?算法太多分不清?看这篇就够了

作者 | fivetrees来源 | https://zhuanlan.zhihu.com/p/76477248本文已由作者授权,未经允许,不得二次转载【导读】目标检测包括目标分类和目标定位 2 个任务,目标定位一般是用一个矩形的边界框来框出物体所在的位置,关于边界框的回…

人脸识别引擎SeetaFaceEngine简介及在windows7 vs2013下的编译

SeetaFaceEngine是开源的C人脸识别引擎,无需第三方库,它是由中科院计算所山世光老师团队研发。它的License是BSD-2.SeetaFaceEngine库包括三个模块:人脸检测(detection)、面部特征点定位(alignment)、人脸特征提取与比对(identification)。人…

当移动数据分析需求遇到Quick BI

我叫洞幺,是一名大型婚恋网站“我在这等你”的资深老员工,虽然在公司五六年,还在一线搬砖。“我在这等你”成立15年,目前积累注册用户高达2亿多,在我们网站成功牵手的用户达2千多万。目前我们的公司在CEO的英名带领下&…

为什么选择数据分析师这个职业?

我为什么选择做数据分析师? 我大学专业是物流管理,学习内容偏向于管理学和经济学,但其实最感兴趣的还是心理学,即人在各种刺激下反应的机制以及原理。做数据分析师,某种意义上是对群体行为的研究和量化,两者…

人脸识别引擎SeetaFaceEngine中Detection模块使用的测试代码

人脸识别引擎SeetaFaceEngine中Detection模块用于人脸检测&#xff0c;以下是测试代码&#xff1a;int test_detection() {std::vector<std::string> images{ "1.jpg", "2.jpg", "3.jpg", "4.jpeg", "5.jpeg", "…

基于Pygame写的翻译方法

发布时间&#xff1a;2018-11-01技术&#xff1a;pygameeasygui概述 实现一个翻译功能&#xff0c;中英文的互相转换。并可以播放翻译后的内容。 翻译接口调用的是百度翻译的api接口。详细 代码下载&#xff1a;http://www.demodashi.com/demo/14326.html 一、需求分析 使用pyg…

冠军奖3万元!CSDN×易观算法大赛开赛啦

伴随着5G、物联网与大数据形成的后互联网格局的逐步形成&#xff0c;日益多样化的用户触点、庞杂的行为数据和沉重的业务体量也给我们的数据资产管理带来了不容忽视的挑战。为了建立更加精准的数据挖掘形式和更加智能的机器学习算法&#xff0c;对不断生成的用户行为事件和各类…

快速把web项目部署到weblogic上

weblogic简介 BEA WebLogic是用于开发、集成、部署和管理大型分布式Web应用、网络应用和数据库应 用的Java应用服务器。将Java的动态功能和Java Enterprise标准的安全性引入大型网络应用的开发、集成、部署和管理之中。 BEA WebLogic Server拥有处理关键Web应用系统问题所需的性…

使GDAL库支持中文路径或中文文件名的处理方法

之前生成的gdal 2.1.1动态库&#xff0c;在通过命令行执行时&#xff0c;遇到有中文路径或中文图像名时&#xff0c;GDALOpen函数不能正确的被调用&#xff0c;如下图&#xff1a;解决方法&#xff1a;1. 在所有使用GDALAllRegister();语句后面加上一句CPLSetConfigOption…

创新工场论文入选NeurIPS 2019,研发最强“AI蒙汗药”

9月4日&#xff0c;被誉为机器学习和神经网络领域的顶级会议之一的 NeurIPS 2019 揭晓收录论文名单&#xff0c;创新工场人工智能工程院的论文《Learning to Confuse: Generating Training Time Adversarial Data with Auto-Encoder》被接收在列。这篇论文围绕现阶段人工智能系…

Flutter环境搭建(Windows)

SDK获取 去官方网站下载最新的安装包 &#xff0c;或者在Github中的Flutter项目去 下载 。 将下载的安装包解压 注意&#xff1a;不要将Flutter安装到高权限路径&#xff0c;例如 C:\Program Files\ 配置环境变量&#xff0c;在Path中添加flutter\bin的全路径(如&#xff1a;D…

Android在eoe分享一篇推荐开发组件或者框架的文章

http://www.eoeandroid.com/thread-311194-1-1.html y4078275315 主题 62 帖子 352 e币实习版主 积分314发消息电梯直达楼主 回复 发表于 2013-11-7 09:58:45 | 只看该作者 |只看大图 34本帖最后由 y407827531 于 2013-11-28 15:07 编辑感谢版主推荐&#xff0c;本贴会持续更新…

如何打造高质量的机器学习数据集?这份超详指南不可错过

作者 | 周岩&#xff0c;夕小瑶&#xff0c;霍华德&#xff0c;留德华叫兽转载自知乎博主『运筹OR帷幄』导读&#xff1a;随着计算机行业的发展&#xff0c;人工智能和数据科学近几年成为了学术和工业界关注的热点。特别是这些年人工智能的发展日新月异&#xff0c;每天都有新的…

人脸识别引擎SeetaFaceEngine中Alignment模块使用的测试代码

人脸识别引擎SeetaFaceEngine中Alignment模块用于检测人脸关键点&#xff0c;包括5个点&#xff0c;两个眼的中心、鼻尖、两个嘴角&#xff0c;以下是测试代码&#xff1a;int test_alignment() {std::vector<std::string> images{ "1.jpg", "2.jpg"…

微软宣布 Win10 设备数突破8亿,距离10亿还远吗?

开发四年只会写业务代码&#xff0c;分布式高并发都不会还做程序员&#xff1f; >>> 微软高管 Yusuf Mehdi 昨天在推特发布了一条推文&#xff0c;宣布运行 Windows 10 的设备数已突破 8 亿&#xff0c;比半年前增加了 1 亿。 根据之前的报道&#xff0c;两个月前 W…

领导者必须学会做的十件事情

在这个世界上你可以逃避很多事情并且仍然能够获得成功&#xff0c;但是有些事情你根本就无法走捷径。商业领导者们并不存在于真空之中。成功总是与竞争有关。你可能拥有伟大的产品、服务、概念、战略、团队等等&#xff0c;如果它不能从某种程度上超越竞争对手&#xff0c;这对…

人脸识别引擎SeetaFaceEngine中Identification模块使用的测试代码

人脸识别引擎SeetaFaceEngine中Identification模块用于比较两幅人脸图像的相似度&#xff0c;以下是测试代码&#xff1a;int test_recognize() {const std::string path_images{ "E:/GitCode/Face_Test/testdata/recognization/" };seeta::FaceDetection detector(&…

李沐亲授加州大学伯克利分校深度学习课程移师中国,现场资料新鲜出炉

2019 年 9 月 5 日&#xff0c;AI ProCon 2019 在北京长城饭店正式拉开帷幕。大会的第一天&#xff0c;以亚马逊首席科学家李沐面对面亲自授课完美开启&#xff01;“大神”&#xff0c;是很多人对李沐的印象。除了是亚马逊首席科学家李&#xff0c;李沐还拥有多重身份&#xf…