You are here
ankur's blog
Working with Zend Lucene Search
Zend Search Lucene is a fulltext search engine written in PHP. It is an implementation of the java-based Apache Lucene in PHP. Lucene Search is a part of Zend Framework (ZF) and large chunks of ZF can be used as a standalone and integrated into any existing application. Before getting into the minute details of ZSL, we should first understand what a fulltext search is. It is a combination of indexing content and then conducting search. For ZSL, index is made up of documents and these documents, in turn, contain fields, which contain content that is queried. In general, web applications deal with databases for keeping contents, so index is nothing but a table/view and documents are records in the table. However, ZSL is not limited to index tables. We can index HTML documents, XLS sheets, Word Documents, Power Point slides, PDFs etc. Present day web applications store content in powerful RDBMSs like MySQL/PostgreSQL having native support for fulltext search. This leads to a genuine confusion about why should the development team invest in a third party library for fulltext search? ZSL wins over MySQL (and most popular RDBMS) for several reasons:
- ZSL returns a ranked result-set.
- ZSL searches for keywords in indexed documents stored within the file system at the webserver compared to fetching content from the DB over established connections. This makes ZSL faster than conventional search.
- Fulltext search works with MyISAM storage engine only.
- If document size increases, performance of search decreases drastically because indexes are cached in RAM.
- INSERT in fulltext search table becomes very slow because of reformation of whole index.
- In MySQL, there is no configuration for stop words and tokenizers, while in ZSL we can define our own stop words and tokens.
- ZSL boasts of many advanced features like proximity search, wild card search, similarity search, range search, phrase queries etc.
- ZSL has simple but very powerful query syntax.
Let us see how ZSL works. As discussed above, ZSL first processes an index and then queries search. So let us create the index first:
//Function to create index.
public function createindexAction() {
Zend_Search_Lucene_Analysis_Analyzer::setDefault(
new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
$patientModel = $this->_getPatientTable(); //Instantiate object of patient table.
$patientArray = $patientModel->fetchAllRecordsForIndex();
//Specify the path where indexes will be stored on file system.
$patientIndexPath = Zend_Registry::get('configuration')->patient_index_path;
$index = new Zend_Search_Lucene($patientIndexPath,true);
foreach($patientArray as $patientRow) {
$index->addDocument($this->generateZSLDocument($patientRow));
}
$index->commit();
$index->optimize();
echo "Indexes created successfully.";
}
public function generateZSLDocument($patientRow) {
$doc = new Zend_Search_Lucene_Document();
// Field is tokenized and indexed, and is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Text('PTPTNO',$patientRow['PTPTNO']);
$doc->addField(Zend_Search_Lucene_Field::Text('PTPLN', $patientRow['PTPLN']);
$doc->addField(Zend_Search_Lucene_Field::Text('PTPFN', $patientRow['PTPFN']);
$doc->addField(Zend_Search_Lucene_Field::Text('PTPDOB', $patientRow['PTPDOB']);
$doc->addField(Zend_Search_Lucene_Field::Text('PTPSSN', $patientRow['PTPSSN']);
// Field is tokenized and indexed, but is not stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnStored('PTCHRT', $patientRow['PTCHRT']);
$doc->addField(Zend_Search_Lucene_Field::UnStored('PPPHPH', $patientRow['PPPHPH']);
$doc->addField(Zend_Search_Lucene_Field::UnStored('PPPWPH', $patientRow['PPPWPH']);
$doc->addField(Zend_Search_Lucene_Field::UnStored('PPPCPH', $patientRow['PPPCPH']);
return $doc;
}
While adding the fields to the document, it is very important to specify the appropriate field types. Remember: larger the index, slower the response. Indexes are meant for searching and not for bulk storage. If we have three fields in a table (file_id, file_name, file_data) and we want to fetch the name of the file whose content is the closest match for the query. Then, we should specify file_id as keyword, file_name as UnIndexed, and file_data as UnStored. The query will return the file_ids, file_names and later if we would be interested in fetching file content then we can hit to the db table to fetch the content of that particular file. In order to understand different field types, just have a look at the following table:
| stored | indexed | tokenized | |
| Keyword | yes | yes | no |
| UnIndexed | yes | no | no |
| Binary | yes | no | no |
| Text | yes | yes | yes |
| UnStored | no | yes | yes |
How to Fetch from indexes created above:
public function patientsearchAction() {
$field = $this->_getParam('fieldToSearch');
$str = strtoupper(trim($this->_getParam('searchStr'))).'*';
Zend_Search_Lucene_Analysis_Analyzer:
:setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
$this->_helper->layout->disableLayout();
$this->_helper->viewRenderer->setNoRender(true);
if ($field == 'PHONE') {
//pattern to remove any special char from phone number.
$str = preg_replace('/[\s]*[(]*[)]*[-]*/', '', $str);
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$field = "PPPHPH";
$queryPattern = new Zend_Search_Lucene_Index_Term($str, $field);
$subQuery = new Zend_Search_Lucene_Search_Query_Wildcard($queryPattern);
//Null specifis that this is neither required nor prohibited.
$query->addSubquery($subQuery, NULL);
$field = "PPPWPH";
$queryPattern = new Zend_Search_Lucene_Index_Term($str, $field);
$subQuery = new Zend_Search_Lucene_Search_Query_Wildcard($queryPattern);
$query->addSubquery($subQuery, NULL);
$field = "PPPCPH";
$queryPattern = new Zend_Search_Lucene_Index_Term($str, $field);
$subQuery = new Zend_Search_Lucene_Search_Query_Wildcard($queryPattern);
$query->addSubquery($subQuery, NULL);
} else {
$pattern = new Zend_Search_Lucene_Index_Term($str, $field);
$query = new Zend_Search_Lucene_Search_Query_Wildcard($pattern);
}
$hits = array();
if ($query) {
$index = new Zend_Search_Lucene(Zend_Registry:
:get('configuration')->patient_index_path);
$hits = $index->find(strtoupper($query));
}
$PatientArray = Array();
$totalRecords = count($hits);
$i = 0;
foreach ($hits as $hit) {
$PatientArray[$i]['PTPTNO'] = $hit->PTPTNO;
$PatientArray[$i]['PTPLN'] = $hit->PTPLN;
$PatientArray[$i]['PTPFN'] = $hit->PTPFN;
$PatientArray[$i]['PTPDOB'] = $hit->PTPDOB;
$PatientArray[$i]['PTPSSN'] = $hit->PTPSSN;
}
$jsonData = Zend_Json::encode($PatientArray);
$this->_response->appendBody($jsonData);
}
ZSL index segment is not updatable by nature so adding a new document to an index always generates a new segment. This in turns decreases the index quality. So, in order to increase the performance, optimize the index which actually merges the indexes.
//Function to update index. This function is called for both add/edit patient.
public function updateindexAction() {
$patientIdToUpdate = $this->_getParam('PTPTNO');
Zend_Search_Lucene_Analysis_Analyzer:
:setDefault(new Zend_Search_Lucene_Analysis_Analyzer_Common_TextNum_CaseInsensitive());
$patientIndexPath = Zend_Registry::get('configuration')->patient_index_path;
$index = Zend_Search_Lucene::open($patientIndexPath);
$hits = $index->find('PTPTNO:' .$patientIdToUpdate);
//Delete index document for the updated patient.
foreach ($hits as $hit) {
$index->delete($hit->id);
}
$patientModel = $this->_getMpatTable();
$patientArray = $patientModel->fetchPatientById($patientIdToUpdate);
$index->addDocument($this->generateZSLDocument($patientRow));
$index->commit();
$index->optimize();
echo "Indexes updated successfully.";
}
Few quick points to be aware of:
- Index quality is completely determined by number of segments.
- Index size is limited by 2GB for 32 bit platform.
- Index optimization is a process of merging several segments into one.
- Minimum length of search word is 3 but is configurable.
The blog post is based on excerpt of a session by Ankur Aeren on Zend Lucene Search made at OSScamp Delhi September 2009.