Pages

Monday, 31 October 2011

Lucene, sample JAVA code to Index a file folder


Please find below the Lucene sample code to index the files inside a folder. This code will index ( or create fields for ) the file path, file title, modified date and contents of the file.

This java code is expecting the index path ( where the index files will be created ) and file folder path as program arguments like  "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH]" .

The logic of the code is to iterate through each file in the folder and call the method indexDoc(), where the above said fields are created and added to a Document object. This means that for each file there will be a document object and these document objects will be added to IndexWriter.

Please find below the screen shot of the indexd file folder :



import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Date;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class IndexFiles {
 public static void main(String[] args) {
  String usage = "java IndexFiles  [-index INDEX_PATH] [-docs DOCS_PATH] \n\n"
   + "This indexes the documents in DOCS_PATH, creating a Lucene index in"
   + "INDEX_PATH that can be searched with SearchFiles";
  String indexPath = "index";
  String docsPath = null;
  for (int i = 0; i < args.length; i++) {
   if ("-index".equals(args[i])) {
    indexPath = args[i + 1];
    i++;
   } else if ("-docs".equals(args[i])) {
    docsPath = args[i + 1];
    i++;
   }
  }
  if (docsPath == null) {
   System.err.println("Usage: " + usage);
   System.exit(1);
  }
  final File docDir = new File(docsPath);
  if (!docDir.exists() || !docDir.canRead()) {
   System.out.println("Document directory "
   + docDir.getAbsolutePath()
   + "does not exist or is not readable, please check the path");
   System.exit(1);
  }
  Date start = new Date();
  try {
   System.out.println("Indexing to directory '" + indexPath + "'...");
   Directory dir = FSDirectory.open(new File(indexPath));

   Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
   IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_31,analyzer);
   iwc.setOpenMode(OpenMode.CREATE);
   IndexWriter writer = new IndexWriter(dir, iwc);
   findFilesAndIndex(writer, docDir);

   writer.close();
   Date end = new Date();
   System.out.println(end.getTime() - start.getTime()+ " total milliseconds");
  } catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }
 }

 static void findFilesAndIndex(IndexWriter writer, File file) throws IOException {
  FileInputStream fis = null;
  try{
  if (file.canRead()) {
   if (file.isDirectory()) {
   String[] files = file.list();
   if (files != null) {
    for (int i = 0; i < files.length; i++) {
    findFilesAndIndex(writer, new File(file, files[i]));
    }
   }
   } else {
    fis = new FileInputStream(file);
    indexDoc(writer, file,fis);
   }
  }
  }catch (IOException e) {
   System.out.println(" caught a " + e.getClass()+ "\n with message: " + e.getMessage());
  }finally {
   if(fis != null){
    fis.close();
   }
  }
 }

 static void indexDoc(IndexWriter writer, File file,FileInputStream fis) throws IOException {
  Document doc = new Document();
  Field pathField = new Field("path", file.getPath(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(pathField);

  Field titleField = new Field("title", file.getName(),Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
  pathField.setOmitTermFreqAndPositions(true);
  doc.add(titleField);

  NumericField modifiedField = new NumericField("modified");
  modifiedField.setLongValue(file.lastModified());
  doc.add(modifiedField);

  doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));

  System.out.println("adding " + file);
  writer.addDocument(doc);
 }
}

Friday, 14 October 2011

Good features of Eclipse3.5 (Eclipse Galileo) JDT


This blog will list down the new features of Eclipse Galileo JDT. I will write another blog regarding the features of Eclipse Helios and Eclipse Indigo.

Read about Eclipse Helios features @ http://tips4ufromsony.blogspot.com/2011/11/good-features-of-eclipse-36-eclipse.html

==========================================================
1. Toggle Breadcrumb —> Will list the name of the file and the method name with respect to your cursor position , on the top of the Eclipse IDE. From here you can go to other methods, other classes in same package , ….

Screen shot of Toggle Breadcrumb:



==========================================================
2. From the method call , you can either go to declaration or to implementation

Screen shot of implementation call:



==========================================================
3. Advanced Open Type –> You can restrict the open type to a selected Working set only.

Screen shot of Advanced Open Type:



==========================================================
4. Embedded Telnet connection window —> You can have the telnet connection as a window in Eclipse

Screen shot of Telnet connection:



==========================================================
5. Embedded Sql developer —> You can view the database tables , run queries , can see the history of queries ran and the results …

Screen shot of Sql Developer:



==========================================================
6. Enhanced Local History —> You can view all the changes that you made in a file in all the file save that you done. —> Just like clear case/SVN / CVS , you can compare with the previous versions of the files to see each line changes

Screen shot of Enhanced Local History:



==========================================================
7.A new property window to view the properties of the selected file

Screen shot of property window:



==========================================================

8.Exclude selected packages or files from the build path

Screen shot of Exclusion of build path:



==========================================================
9. Ctrl +3 —> Advanced quick access to the available screens by typing the start letters

Screen shot of Ctrl +3:



==========================================================

10. XML files can be open in a Design View

Screen shot of XML Design View:



==========================================================

11. Quick search in window – Preference

Screen shot of  Quick search in window – Preference:



Monday, 10 October 2011

Apache Lucene quick links






Thursday, 6 October 2011

Apache Lucene Search Engine’s Features


Apache Lucene is a high-performance, full featured text search engine library written entirely in Java. It is part of Apache Jakarta Project. Lucene was originally written by Doug Cutting in Java. While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. Lucene is Doug Cutting’s wife’s middle name !

Features

1. Scalable, High-Performance Indexing

  • Over 95GB/hour on modern hardware
  • Small RAM requirements — only 1MB heap
  • Incremental indexing as fast as batch indexing
  • Index size roughly 20-30% the size of text indexed


2. Powerful, Accurate and Efficient Search Algorithms

  • Ranked searching — best results returned first
  • Sorting by any field
  • Multiple-index searching with merged results
  • Allows simultaneous update and searching


3. Flexible Queries

  • Phrase queries –>  like “star wars” –> search for the full word star wars.
  • Wildcard queries  –> like star* or  sta?  –> search for a single character or multi character replacements for the search words
  • Fuzzy queries  –> like star~0.8  –> search for the similar words with some weightage
  • Proximity queries  –> like  ”star wars”~10 –> search for a “star” and “wars” within 10 words of each other in a document
  • Range queries  –>  like {star-stun}  –>  search for documents in between star and stun. Exclusive queries are denoted by curly brackets
  • Fielded searching   –>  fields like  title, author, contents
  • Date-range searching   –> like [2006-2007]  –>  search for documents with field value in between 2006 and 2007. Inclusive queries are denoted by square brackets
  • Boolean Operators  –>  like star AND wars . The OR operator is the default conjunction operator.
  • Boosting a Term –>  like star^4  wars –> make documents with term star more relevant
  • + Operator  –>  like +star wars –>  search for documents that must contain “star” and may contain “wars”
  • - Operator  –>  like star -wars –>  search for documents that contain “star” and not contains “wars”
  • Grouping –>  like (star AND wars) OR website –>  using parentheses to group clauses to form sub queries
  • Escape special character –>  The current list special characters are   + – && || ! ( ) { } [ ] ^ ” ~ * ? : \  . To escape these character use the \ before the character.


4. Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible


At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of the file format. Text from PDFs, HTML, Microsoft Word, and OpenDocument documents, as well as many others (except images), can all be indexed as long as their textual information can be extracted.
Index  --> sequence of documents ( Directory)
Document  -->  sequence of fields
Field  --> named sequence of terms
Term  --> a text string (e.g., a word)
Terms:
A search query is broken up into terms and operators. There are two types of terms: Single Terms and Phrases. A Single Term is a single word such as "test" or "hello". A Phrase is a group of words surrounded by double quotes such as "hello dolly". Multiple terms can be combined together with Boolean operators to form a more complex query.

Fields:
When performing a search you can either specify a field, or use the default field. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.