Document Full Text Search - FAQ

Print this Topic  Previous Topic Home Topic Next Topic
You are here: Search for a document > Document Full Text Search >Document Full Text Search - FAQ
Expand All   Collapse All

 

What does the document full text search feature do?

The Document Full Text search feature allows you to search for documents in a Sohodox DB based on their content. The Full Text Search feature works by extracting text from documents that you add to a Sohodox DB and then indexing the text. The text can be automatically extracted in the background after you add a document. Otherwise the text extraction and indexing can be performed manually later.

 

Since text extraction happens in the background, the process continues even when you close Sohodox. To stop text extraction...

 

Explore Control Panel >Administrative Tools >Services. Select ITAZ Sohodox Indexing Services under the Name column. Right click the entry and select the Stop option.


Why is it useful?

Without the full text search feature you can find documents either...

using the indexing information that you have stored along with each document, or...
using the properties of the document (for e.g. file name, file size, file type etc.)

Enabling the full text search provides you with a third method for quickly finding documents.


For what file types does the document full text search feature work?

Depending on the file type (i.e. file format) text extraction from documents is now done using OCR, built-in text extractors and IFilters installed on the user's machine.

For example for TIFF, JPG, PNG and other image file types Sohodox uses its built-in OCR engine to extract text. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine). Note: Starting with MS Office 2010, Microsoft no more ships MS Office Document Imaging with MS Office.

Sohodox uses it's built-in text extractor for MS Word (DOC, DOCX), MS Excel (XLS, XLSX) and PDF files (PDF files which contain text and not only scanned images).

For other file types, Sohodox uses IFilters installed on your machine to extract text

PDF files are handled a little differently. PDF files created by Sohodox contain scanned images. So Sohodox extracts text from them using OCR. For all other PDF files, Sohodox first uses its built-in text extractor and if that does not return any text, Sohodox tries OCR to extract text from the PDF file.

IFilters act as plug-ins and are a part of Microsoft Indexing Service (they are also used by Windows Desktop Search). Using the IFilter mechanism improves the accuracy and performance of text extraction in Sohodox.

For Sohodox to be able to extract text from a file of a particular format, an IFilter for that file format must be installed on the user's machine.

IFilters for the following file formats are installed by default on Windows 2000/XP/2003/Vista machines...

PPT (Microsoft PowerPoint presentation)
DOC (Microsoft Word document) - By default Sohodox does not use this because it uses its built-in extractor for MS Word files.
XLS (Microsoft Excel spreadsheet) - By default Sohodox does not use this because it uses its built-in extractor for MS Excel files.
HTML documents
TXT documents

 

You can also install third party filters to enable Sohodox to extract text from other file types, e.g.:

More information and downloads links for various IFilters (both free and commercial) are available at...


Why aren’t all IFilters automatically installed along with Sohodox?

Although some IFilters are available for free, we cannot ship them with Sohodox as they are published by different companies. You will find download links for available IFilters (both free and commercial) at…
http://www.ifilter.org/Links.htm


Is OCR available in Sohodox?

Yes, OCR is available in Sohodox. You can use the built-in OCR engine to extract text from TIFF, JPG, PNG and other image file types. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office 2007 or earlier with MS Office Document Imaging installed on the machine)..


What is the Use built-in OCR engine setting?

The Use built-in OCR engine option allows you to use the built-in engine to OCR your documents.


What is the Microsoft OCR engine setting?

The Use Microsoft OCR engine option allows you to use the Microsost OCR engine to OCR your documents. You will need to have MS OFFICE Document Imaging installed on the system, to use the Microsoft Office OCR Engine (this is available if you have MS Office 2007 or earlier installed - not available with MS Office 2010).


How can I stop background text extraction on a machine?

Background text extraction only happens on the machine on which Sohodox has been installed in server mode (a single user installation of Sohodox is always installed in server mode). On this machine, the extraction of text from newly added documents continues in the background even when Sohodox itself is not running. To stop background text extraction...

Explore Control Panel >Administrative Tools >Services. Select ITAZ Sohodox Indexing Service under the Name column. Right click the entry and select the Stop option.

 


Sohodox does not extract text from my document?

Sohodox uses two different methods depending on the file type (i.e. file format) to extract text from documents.

For example for TIFF, JPG, PNG and other image file types, Sohodox uses its built-in OCR engine to extract text. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine).

For file types such as .Doc, .XLS, .TXT, .HTM Sohodox uses IFilters installed on your machine to extract text

PDF files are handled a little differently. PDF files created by Sohodox contain scanned images. So Sohodox extracts text from them using OCR. For all other PDF files, Sohodox first uses its built-in text extractor and if that does not return any text, Sohodox tries OCR to extract text from the PDF file.


When I search for some text, documents (which I am sure contain that text) are not listed in the search results.

For the Full text feature to work, the text from the document should be extracted. Depending on the file type (i.e. file format) text extraction from documents is done using OCR and IFilters installed on the user's machine.

 

The reason for this could be that the IFilter for that particular file format is not installed on the machine. For Sohodox to be able to extract text from a file of a particular format, the IFilter for that file format must be installed on the machine.

 

It could also be that the file for which text extraction is failing, is password protected.

 

Another reason could be that the size of the document may be larger than the size specified in the Maximum size of documents to extract text from option.


Will Sohodox complain if it cannot extract text from a particular document?

No. Sohodox attempts to find the ifilter for a document and proceeds without complaining (and without extracting text) if the IFilter for a particular file cannot be found on the machine.


What is the Maximum size of documents to extract text from option?

Specify the file size that should be indexed in this box. By default the limit of the file size is set to 1 mb. This means that files larger then 1 mb will not be indexed. For slower machines it is recommended to choose a lower value. A larger value affects the performance of MS Access DB . This option is useful in a multi-user scenario where you can disable extracting and indexing of text on slow machines for large files without disabling full text search.


How do I use the document full text search feature to search for documents?

To search for documents using the document full text search feature...

In Sohodox, select Workspace > All Documents in the Navigation pane. The documents will be displayed in the List View pane.
Click the Double Arrow button to bring up the Advanced Search pane.
Select the Document Text option from the Field Name drop down, to search for text in the document.
Select the appropriate comparison operator (i.e. contains, begins with, equal to etc.) from the Comparison drop down. For e.g. To search for text beginning with specific alphabets use the "begins with" operator in your query condition.
Enter the value which will be used for comparison in the Compare To box.
You can add more criteria to your search by clicking this button. To remove a criteria by click this button.
To get a result which matches all the criteria's specified by you, select the Match all conditions option from the Conditions drop down. To get a result which matches any criteria, select the Match any conditions option from the Conditions drop down
Click the Search button to begin the search. The search results will be displayed in the List View pane.

If from the Comparison drop-down list you had chosen does not contain then the search would have returned all documents which do not contain the text you have specified.

 


Related Topics
Extract Text from Document
Search for text in a document

View the Extracted Text of the Document

 

 

 


Page URL: http://www.sohodox.com/docs/help/index.htm?document_full_text_search_faq.htm