OverviewIntroductionZend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:
Document and Field ObjectsZend_Search_Lucene operates with documents as atomic objects for indexing. A document is divided into named fields, and fields have content that can be searched. A document is represented by the Zend_Search_Lucene_Document class, and this objects of this class contain instances of Zend_Search_Lucene_Field that represent the fields on the document. It is important to note that any information can be added to the index. Application-specific information or metadata can be stored in the document fields, and later retrieved with the document during search. It is the responsibility of your application to control the indexer. This means that data can be indexed from any source that is accessible by your application. For example, this could be the filesystem, a database, an HTML form, etc. Zend_Search_Lucene_Field class provides several static methods to create fields with different characteristics:
Each of these methods (excluding the Zend_Search_Lucene_Field::Binary() method) has an optional $encoding parameter for specifying input data encoding. Encoding may differ for different documents as well as for different fields within one document:
If encoding parameter is omitted, then the current locale is used at processing time. For example:
Fields are always stored and returned from the index in UTF-8 encoding. Any required conversion to UTF-8 happens automatically. Text analyzers (see below) may also convert text to some other encodings. Actually, the default analyzer converts text to 'ASCII//TRANSLIT' encoding. Be careful, however; this translation may depend on current locale. Fields' names are defined at your discretion in the addField() method. Java Lucene uses the 'contents' field as a default field to search. Zend_Search_Lucene searches through all fields by default, but the behavior is configurable. See the "Default search field" chapter for details. Understanding Field Types
HTML documentsZend_Search_Lucene offers a HTML parsing feature. Documents can be created directly from a HTML file or string: Zend_Search_Lucene_Document_Html class uses the DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse the source HTML, so it doesn't need HTML to be well formed or to be XHTML. On the other hand, it's sensitive to the encoding specified by the "meta http-equiv" header tag. Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags. The 'title' field is actually the /html/head/title value. It's stored within the index, tokenized and available for search. The 'body' field is the actual body content of the HTML file or string. It doesn't include scripts, comments or attributes. The loadHTML() and loadHTMLFile() methods of Zend_Search_Lucene_Document_Html class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored. The third parameter of loadHTML() and loadHTMLFile() methods optionally specifies source HTML document encoding. It's used if encoding is not specified using Content-type HTTP-EQUIV meta tag. Other document header meta tags produce additional document fields. The field 'name' is taken from 'name' attribute, and the 'content' attribute populates the field 'value'. Both are tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords). Parsed documents may be augmented by the programmer with any other field:
Document links are not included in the generated document, but may be retrieved with the Zend_Search_Lucene_Document_Html::getLinks() and Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:
Starting from Zend Framework 1.6 it's also possible to exclude links with
Zend_Search_Lucene_Document_Html::getExcludeNoFollowLinks() method returns current state of "Exclude nofollow links" flag. Word 2007 documentsZend_Search_Lucene offers a Word 2007 parsing feature. Documents can be created directly from a Word 2007 file:
Zend_Search_Lucene_Document_Docx class uses the
Zend_Search_Lucene_Document_Docx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. The 'filename' field is the actual Word 2007 file name. The 'title' field is the actual document title. The 'subject' field is the actual document subject. The 'creator' field is the actual document creator. The 'keywords' field contains the actual document keywords. The 'description' field is the actual document description. The 'lastModifiedBy' field is the username who has last modified the actual document. The 'revision' field is the actual document revision number. The 'modified' field is the actual document last modified date / time. The 'created' field is the actual document creation date / time. The 'body' field is the actual body content of the Word 2007 document. It only includes normal text, comments and revisions are not included. The loadDocxFile() methods of Zend_Search_Lucene_Document_Docx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored. Parsed documents may be augmented by the programmer with any other field:
Powerpoint 2007 documentsZend_Search_Lucene offers a Powerpoint 2007 parsing feature. Documents can be created directly from a Powerpoint 2007 file:
Zend_Search_Lucene_Document_Pptx class uses the
Zend_Search_Lucene_Document_Pptx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. The 'filename' field is the actual Powerpoint 2007 file name. The 'title' field is the actual document title. The 'subject' field is the actual document subject. The 'creator' field is the actual document creator. The 'keywords' field contains the actual document keywords. The 'description' field is the actual document description. The 'lastModifiedBy' field is the username who has last modified the actual document. The 'revision' field is the actual document revision number. The 'modified' field is the actual document last modified date / time. The 'created' field is the actual document creation date / time. The 'body' field is the actual content of all slides and slide notes in the Powerpoint 2007 document. The loadPptxFile() methods of Zend_Search_Lucene_Document_Pptx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored. Parsed documents may be augmented by the programmer with any other field:
Excel 2007 documentsZend_Search_Lucene offers a Excel 2007 parsing feature. Documents can be created directly from a Excel 2007 file:
Zend_Search_Lucene_Document_Xlsx class uses the
Zend_Search_Lucene_Document_Xlsx class recognizes document meta data and document text. Meta data consists, depending on document contents, of filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. The 'filename' field is the actual Excel 2007 file name. The 'title' field is the actual document title. The 'subject' field is the actual document subject. The 'creator' field is the actual document creator. The 'keywords' field contains the actual document keywords. The 'description' field is the actual document description. The 'lastModifiedBy' field is the username who has last modified the actual document. The 'revision' field is the actual document revision number. The 'modified' field is the actual document last modified date / time. The 'created' field is the actual document creation date / time. The 'body' field is the actual content of all cells in all worksheets of the Excel 2007 document. The loadXlsxFile() methods of Zend_Search_Lucene_Document_Xlsx class also have second optional argument. If it's set to TRUE, then body content is also stored within index and can be retrieved from the index. By default, the body is tokenized and indexed, but not stored. Parsed documents may be augmented by the programmer with any other field:
|