You are here: Search for a document > Document Full Text Search >Document Full Text Search - FAQ |
|
The Document Full Text search feature allows you to search for documents in a Sohodox DB based on their content. The Full Text Search feature works by extracting text from documents that you add to a Sohodox DB and then indexing the text. The text can be automatically extracted in the background after you add a document. Otherwise the text extraction and indexing can be performed manually later.
Since text extraction happens in the background, the process continues even when you close Sohodox. To stop text extraction...
Explore Control Panel >Administrative Tools >Services. Select ITAZ Sohodox Indexing Services under the Name column. Right click the entry and select the Stop option. |
Without the full text search feature you can find documents either...
Enabling the full text search provides you with a third method for quickly finding documents. |
Depending on the file type (i.e. file format) text extraction from documents is now done using OCR, built-in text extractors and IFilters installed on the user's machine. For example for TIFF, JPG, PNG and other image file types Sohodox uses its built-in OCR engine to extract text. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine). Note: Starting with MS Office 2010, Microsoft no more ships MS Office Document Imaging with MS Office. Sohodox uses it's built-in text extractor for MS Word (DOC, DOCX), MS Excel (XLS, XLSX) and PDF files (PDF files which contain text and not only scanned images). For other file types, Sohodox uses IFilters installed on your machine to extract text PDF files are handled a little differently. PDF files created by Sohodox contain scanned images. So Sohodox extracts text from them using OCR. For all other PDF files, Sohodox first uses its built-in text extractor and if that does not return any text, Sohodox tries OCR to extract text from the PDF file. IFilters act as plug-ins and are a part of Microsoft Indexing Service (they are also used by Windows Desktop Search). Using the IFilter mechanism improves the accuracy and performance of text extraction in Sohodox. For Sohodox to be able to extract text from a file of a particular format, an IFilter for that file format must be installed on the user's machine. IFilters for the following file formats are installed by default on Windows 2000/XP/2003/Vista machines...
You can also install third party filters to enable Sohodox to extract text from other file types, e.g.: More information and downloads links for various IFilters (both free and commercial) are available at... |
Although some IFilters are available for free, we cannot ship them with Sohodox as they are published by different companies. You will find download links for available IFilters (both free and commercial) at… |
Yes, OCR is available in Sohodox. You can use the built-in OCR engine to extract text from TIFF, JPG, PNG and other image file types. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office 2007 or earlier with MS Office Document Imaging installed on the machine).. |
The Use built-in OCR engine option allows you to use the built-in engine to OCR your documents. |
The Use Microsoft OCR engine option allows you to use the Microsost OCR engine to OCR your documents. You will need to have MS OFFICE Document Imaging installed on the system, to use the Microsoft Office OCR Engine (this is available if you have MS Office 2007 or earlier installed - not available with MS Office 2010). |
Background text extraction only happens on the machine on which Sohodox has been installed in server mode (a single user installation of Sohodox is always installed in server mode). On this machine, the extraction of text from newly added documents continues in the background even when Sohodox itself is not running. To stop background text extraction...
Explore Control Panel >Administrative Tools >Services. Select ITAZ Sohodox Indexing Service under the Name column. Right click the entry and select the Stop option.
Sohodox uses two different methods depending on the file type (i.e. file format) to extract text from documents. For example for TIFF, JPG, PNG and other image file types, Sohodox uses its built-in OCR engine to extract text. You can configure Sohodox to use the faster Microsoft Office OCR engine if it is installed (this is available if you have MS Office Document Imaging installed on the machine). For file types such as .Doc, .XLS, .TXT, .HTM Sohodox uses IFilters installed on your machine to extract text PDF files are handled a little differently. PDF files created by Sohodox contain scanned images. So Sohodox extracts text from them using OCR. For all other PDF files, Sohodox first uses its built-in text extractor and if that does not return any text, Sohodox tries OCR to extract text from the PDF file. |
For the Full text feature to work, the text from the document should be extracted. Depending on the file type (i.e. file format) text extraction from documents is done using OCR and IFilters installed on the user's machine.
The reason for this could be that the IFilter for that particular file format is not installed on the machine. For Sohodox to be able to extract text from a file of a particular format, the IFilter for that file format must be installed on the machine.
It could also be that the file for which text extraction is failing, is password protected.
Another reason could be that the size of the document may be larger than the size specified in the Maximum size of documents to extract text from option. |
No. Sohodox attempts to find the ifilter for a document and proceeds without complaining (and without extracting text) if the IFilter for a particular file cannot be found on the machine. |
Specify the file size that should be indexed in this box. By default the limit of the file size is set to 1 mb. This means that files larger then 1 mb will not be indexed. For slower machines it is recommended to choose a lower value. A larger value affects the performance of MS Access DB . This option is useful in a multi-user scenario where you can disable extracting and indexing of text on slow machines for large files without disabling full text search. |
To search for documents using the document full text search feature... In Sohodox, select Workspace > All Documents in the Navigation pane. The documents will be displayed in the List View pane. If from the Comparison drop-down list you had chosen does not contain then the search would have returned all documents which do not contain the text you have specified. |
Related Topics
Extract Text from Document
Search for text in a document
View the Extracted Text of the Document
Page URL: http://www.sohodox.com/docs/help/index.htm?document_full_text_search_faq.htm