DocFetcher Wiki

Desktop search application

Brought to you by: qforce

FAQ

Can you please add feature XY?

No new features are planned for DocFetcher, only bugfixes. Development continues in DocFetcher Pro.

How can I donate to the DocFetcher project?

Buying a copy of DocFetcher Pro, the commercial big brother of DocFetcher, is equivalent to making a donation, plus you get a bunch of new features.

If you don't need those features and/or DocFetcher Pro costs more than you're willing to donate, you can "buy" the otherwise free demo of DocFetcher Pro for a price of your choosing.

I can't start DocFetcher at all. What do I do?

If you experienced problems with the installed version of DocFetcher, consider using the portable version instead (download page). The latter runs on all supported platforms and does not try to detect or download Java runtimes. However, note that the portable version has to be put in a location where you have write permissions. The reason for this is that on the first start, the program figures out what operating system it's running on and whether the operating system is 32-bit or 64-bit. It then tries to unpack the right library files into a subfolder under its own folder, and this will fail without write permissions.

In some cases where DocFetcher doesn't start, the solution is to uninstall all currently installed Java runtimes and then reinstall the latest Java runtime from the Java website. On that website, be sure to pick either the 32-bit or the 64-bit Java runtime, depending on whether your operating system is 32-bit or 64-bit.

Another potential problem: Running DocFetcher with a memory setting of more than 1 GB requires a 64-bit Java runtime. It will not work with 32-bit.

On some systems, the embedded web browser that is used for displaying the manual and HTML files can crash the entire program. As a workaround, disable the embedded web browser by modifying DocFetcher's settings file. Look for the settings file in one of the following locations:

If DocFetcher was installed on Windows:
- Windows 2000/XP: C:\Documents and Settings\<UserName>\Application Data\DocFetcher\conf\settings-conf.txt
- Windows Vista and later: C:\Users\<UserName>\AppData\Roaming\DocFetcher\conf\settings-conf.txt
Portable DocFetcher: DocFetcher\conf\settings-conf.txt
OS X Application Bundle: /Users/<UserName>/.docfetcher/conf/settings-conf.txt

If the settings file doesn't exist at the expected location, create a new, empty text file there named settings-conf.txt. Now, first close DocFetcher, then open the settings file in a text editor and set ShowManualOnStartup = false and PreferHtmlPreview = false in it. While you're at it, you may also set HotkeyEnabled = false to disable the global hotkey. Save and close the file, then try to start DocFetcher.

Some users reported startup issues caused by faulty NVIDIA drivers, version 378.xx. See this thread.

If none of the above helps, try launching DocFetcher via one of the alternative launchers:

Windows: In the folder DocFetcher\misc, there's a file named DocFetcher.bat. Move this file one level up into the DocFetcher folder. Then open a command prompt and use the cd command to navigate to the DocFetcher folder, like so: cd C:\Program Files (x86)\DocFetcher. Then try launching DocFetcher from the command prompt by entering DocFetcher.bat and pressing Enter. If DocFetcher doesn't start, then chances are an error message will be printed in the command prompt. Post this error message on the DocFetcher forum.
Linux: First, make sure the executable flag has been set on the DocFetcher.sh launcher. If that doesn't help, do the following: Open a terminal and use the cd command to navigate to the DocFetcher folder. Then try launching DocFetcher from the terminal by running ./DocFetcher.sh. If DocFetcher doesn't start, then chances are an error message will be printed in the terminal. Post this error message on the DocFetcher forum.
OS X: On OS X, DocFetcher is started via a shell script located inside the DocFetcher application bundle. If you're using portable DocFetcher, the application bundle is just the "DocFetcher" entry inside the DocFetcher folder that looks like an executable but is actually a folder with the extension ".app". Open a terminal and use the cd command to navigate to the folder Contents/MacOS inside the application bundle, then launch the DocFetcher script from there. If it doesn't start, an error message might be printed in the terminal. Post this error message on the DocFetcher forum.

I can't start the PortableApps version of DocFetcher! Help!

If double-clicking the DocFetcherPortable.exe launcher gives you an error message saying something along the lines of "DocFetcher requires a Java Runtime Environment", you need to make sure jPortable is installed in the right location.

There are actually two versions of jPortable, called jPortable and jPortable 64. The former runs on both 32-bit and 64-bit operating systems, and the latter only on 64-bit operating systems. However, in the 32-bit version, DocFetcher's so-called memory limit cannot be set higher than about 1 GB. This could result in out-of-memory errors when trying to index large files. In the 64-bit version, the maximum value for the memory limit is much higher.

The main problem with installing jPortable or jPortable 64 is that their installers give no clue about where they need to be installed to make DocFetcher work. There are basically two rules you need to follow:

The CommonFiles folder needs to be placed beside the DocFetcherPortablefolder, not inside it or anywhere else. For example, if you installed the PortableApps version of DocFetcher into X:\, then the CommonFiles folder should also be in X:\, like so:
X:\DocFetcherPortable
X:\CommonFiles
There must be a Java folder inside the CommonFiles folder, containing the Java Runtime to run DocFetcher on. Importantly, if you go with jPortable 64, the folder must still be named Java, and not Java64, unlike what the jPortable 64 installer's default path would suggest.

On Windows 2000, DocFetcher crashes when I try to start it, what do I do?

In the crash report, there's probably a line that says "SWTException: Unable to load graphics library [GDI+ is required]". This indicates that you need to install a package called GDI+ for supporting advanced graphics operations. Here's where you can download GDI+: http://www.microsoft.com/en-us/download/details.aspx?id=18909

How is the score for each document calculated?

Short answer: It's complicated. Last time I checked it had something to do with vector spaces and stuff. For further information, have a look at the scoring page of Lucene (DocFetcher's underlying search engine) and the Wikipedia article about the Vector Space Model.

Here's an extremely simplified explanation of how the scoring works: Suppose you have two files file1.doc and file2.doc with the following contents:

file1.doc contains the word "dog" 10 times, and nothing else
file2.doc contains 100 words, 20 of which are "dog"

Now, if you search for "dog", both files will show up in the results, but file1.doc gets a higher score because 10/10=100%, and 20/100=20%. This illustrates the basic idea: Dividing the number of hits by the word count gives you a measure of how "relevant" a document is with respect to your query. Why is that so? Because:

the higher the number of hits, the higher the relevance of the document
the higher the number of words that aren't hits, the lower the relevance of the document

Occasionally, you'll see score values greater than 100%. This is because the actual formula used is much more complicated, and the calculated score is not really a percentage, but a fraction greater than or equal to 0.

A hit count column on the result table would be nice. How about it?

Some people have asked for a column on the result table that displays for each file the “hit count”, i.e., the number of occurrences of the query string in the file. This information is currently only displayed for the selected file in a small box at the top of the preview pane.

There are currently no plans to implement a hit count column, due to performance reasons:

Searching in DocFetcher is very fast because it runs entirely on the index; the indexed files are not accessed at all.
It's not possible to store a per-file hit count in the index, as such a hit count would be different for every possible query string.
Thus, a per-file hit count can only be determined by loading and reparsing the relevant files.
Currently, when a file is selected in the result table, DocFetcher will load and reparse that file in order to show its text in the preview pane and to determine its hit count.
Displaying a hit count column in the result table would require loading and reparsing all files in the result table, which would come with a huge performance hit. -- Effectively, it's almost as if DocFetcher had to reindex every file in the results.

DocFetcher can't always find numeric strings like 20090614_P6140036.jpg.

This is not really a bug, but a consequence of the fact that DocFetcher splits documents into individual words during indexing, a.k.a. tokenization. This is done in order to build a dictionary (i.e. the index), which DocFetcher then uses to do quick searches. In general, DocFetcher works best with natural language, but not quite as well with text containing digits or special characters.

That being said, there's an Analyzer option in the Advanced Settings which allows you to switch to an alternative tokenization mechanism that works better with source code and other kinds of text not written in natural language.

Additionally, take a look at the Query Syntax section in the manual. Some of the concepts explained in there, e.g. wildcards and phrase searches, might help to work around the above issues.

How do I exclude ".svn" folders from indexing?

On the indexing dialog, use this regex exclusion pattern: .*/\.svn/.*

Note the usage of forward slashes to match against path separators (even on Windows!), and escaping the "." with a backward slash.

In addition, "Match Against" must be set to "Absolute path".

Why does DocFetcher not search folder names and file paths?

DocFetcher does not include folder names or file paths in the search, only filenames and file contents. That was a fundamental design decision that was made back then when the core program was written. It may or may not have been a good decision, but the idea was that (1) most of the important stuff the user may want to search for is in the filename and file contents, (2) if searching for words matching some file path component brings up all files on that path, this will decrease the overall quality of search results, and (3) there are already a lot of programs to search filenames and folder names, such as Everything.

Note: DocFetcher Pro is capable of finding folders by name.

What can I do about all those errors during indexing?

DocFetcher uses third-party libraries to perform text extraction. For example, Apache POI is used for MS Office files, and Apache PDFBox for PDF files. Most of the errors that are shown during indexing come directly from the respective extraction libraries, without further translation by DocFetcher.

If DocFetcher gives an error on some file, there's usually not much one can do about it, except waiting for the developers of the respective library to release an update of their software, and then waiting for this update to be included in DocFetcher.

Certain errors can be circumvented as follows:

Re-saving files in an old format in a newer or an alternative format, e.g. re-saving old MS Office files in the newer MS Office 2007 format or with LibreOffice.
Treating unreadable files as plain text. See this bug report for further explanation.
Enabling mime-type detection for files on which DocFetcher fails because they have the wrong file extension. For example, if some of your .doc files aren't MS Word files, you can enable mime-type detection for .doc files by putting the pattern .*\.doc in the pattern table on the indexing configuration dialog and setting "Detect mime type" as the action to be performed.

If DocFetcher crashes during indexing, can I still use or salvage the partially created index?

If during indexing an error dialog pops up that says something like "Oops, this program just died", then the program has crashed and left the partially created index in a potentially broken state. There's no telling what will happen if you try to use this potentially broken index; it may or may not work correctly. There's no guaranteed way to salvage such an index.

If the crash occurred because the program ran out of memory, and you fixed the problem by raising the memory limit and/or adding RAM, then you still need to rebuild the index to be sure that there are no issues with it.

Where does DocFetcher put its index files? How can I change the location of the index files?

The location of the index files depends on the version of DocFetcher and the operating system:

If you're using portable DocFetcher, the index files will be in the indexes folder inside the DocFetcher folder.
If DocFetcher was installed on Windows (i.e. is non-portable), the index files can be found at the following places:
- Windows 2000/XP: C:\Documents and Settings\<UserName>\Application Data\DocFetcher
- Windows Vista and later: C:\Users\<UserName>\AppData\Roaming\DocFetcher
If you're using the OS X Application Bundle of DocFetcher, the index files can be found at: /Users/<UserName>/.docfetcher

For customizing the location of the index files, have a look at the file misc/paths.txt inside the DocFetcher folder.

I indexed files on an external drive. What happens if I unplug the drive and plug it back in later?

Unplugging the drive = search still works, but opening files and previewing won't.
Unplugging the drive and plugging it back in under the same drive letter = everything works.
Unplugging the drive and plugging it back in under a different drive letter = search still works, but opening files and previewing won't.
After (1) or (3), plugging the drive back in under the original drive letter = everything works again.

I want to index a huge amount of data (100 GB or more). Can DocFetcher do this?

In principle, DocFetcher should work with any amount of data. In practice, however, when dealing with a massive amount of data, there's a high risk that there are some problematic files in there that either cause DocFetcher to run out of memory or to crash altogether. The first often happens with large PDF files, and the second may happen with corrupt or otherwise unusual files. There are a few other potential problems as well. All in all, the following is recommended:

Allow DocFetcher to use as much memory as possible. Currently, the Windows version of DocFetcher ships with prebuilt launchers with up to 8 GB of RAM. More than 1 GB of RAM requires a 64-bit Java runtime. See the section "How to raise the memory limit" in the program manual for more info.
Instead of trying to index all files in one go, better index the subfolders separately. That way you avoid wasting a lot of time with failed indexing if DocFetcher crashes on a file in one of the subfolders. Note also that on DocFetcher's indexing dialog there's a little "+" button at the top right that allows you to put multiple folders in an indexing queue.
DocFetcher may crash if the folder structure to be indexed is very deep (i.e., there are folders located deep down the folder hierarchy). You can prevent this crash by raising DocFetcher's so-called stack size. See the question dealing wit the "Folder hierarchy is too deep" error for more info on this.
Make sure there's enough disk space for DocFetcher's index files. Now, there are no clear-cut rules as to how much disk space will be sufficient, since this depends on the type of data to be indexed. However, with 100 GB or more, it's probably best to have at least 1 GB of free disk space reserved for the index files. As to the exact location the index files, see the question "Where does DocFetcher put its index files?" on this page.

I'm using portable DocFetcher. How can I avoid rebuilding my indexes when upgrading to a new version?

Copy the folders conf and indexes into the new DocFetcher folder. conf contains the program settings, while indexes contains the indexes.

I'm running the PortableApps.com version of DocFetcher. How do I increase the memory limit (a.k.a. heap size)?

In the DocFetcher folder, open the following file with a text editor: App\AppInfo\Launcher\DocFetcherPortable.ini

In that file, there's a line starting with CommandLineArguments=. This line contains various launch parameters, including an -Xmx parameter. To set the memory limit to 8 GB, for example, you can change the parameter to -Xmx8g. After editing, save and close the file, then restart the application.

Note that the application won't start if the chosen memory limit exceeds the amount of physical RAM available. Moreover, you need to take into account that obviously the OS and other process also need some of that RAM. So, for example, if the computer has 8 GB of RAM, a reasonable memory limit to try would be 4 GB.

I installed DocFetcher on Linux from a Snap package. How do I increase the memory limit (a.k.a. heap size)?

Open a terminal and launch DocFetcher in it via the command docfetcher. In the terminal, you will see a message pointing to the configuration file in which you can set the memory limit. The configuration file contains further instructions.

DocFetcher is causing constantly high CPU usage, what do I do?

The likely reason for high CPU usage is that (1) the operating system or some other program is constantly updating some files in one of the indexed folders, and (2) the option "Watch folders for file changes" was selected when the index was created, causing DocFetcher to frequently run index updates.

Accordingly, the workaround is to rebuild the affected index(es) with the option "Watch folders for file changes" turned off.

How can I search by filename only, ignoring the file contents?

You can use so-called field searches to search in filenames only. Example: filename:dog

For more info, see the section "Field Searches" on the "Query Syntax" page of the built-in program manual.

How can I search by file extensions not listed in the Document Types pane?

For instance, to list all files with the file extension .mm, enter this query: filename:*.mm

This syntax is explained under the section "Field Searches" on the "Query Syntax" page of the built-in program manual.

Note: DocFetcher Pro addresses this problem via a feature called Custom Types. In essence, it is a customizable version of the Document Types pane, allowing you to define your own file types to filter the search results by.

How can I index files that don't have a file extension?

To index files without file extension (i.e. without the dot in the filename), add the following rule in the pattern table on the indexing dialog:

Pattern (regex): [^\.]*
Match Against: Filename
Action: Detect mime type (slower)

This rule matches all files whose filenames do not contain a dot, and it will make DocFetcher recognize the matched files as plain text files.

Note: In DocFetcher Pro, files without file extension can be indexed by ticking the checkbox "Index files without file extension as text files" on the indexing dialog.

The indexing failed with a "Folder hierarchy is too deep" error message. What can I do?

The error message means the folder you're trying to index contains more levels of subfolders than DocFetcher can handle with its current settings. The workaround is to either move the subfolders around in order to reduce the maximum folder depth, or to change the settings. How the latter is done depends on your operating system:

Windows: Move the DocFetcher.bat file from the misc folder inside the DocFetcher folder one level up into the DocFetcher folder (important, otherwise the DocFetcher.bat won't run). Now open the DocFetcher.bat file in a text editor. In the last line, you can see a setting -Xss2m. Set this to a higher value, e.g. -Xss4m. From now on, always launch DocFetcher through the DocFetcher.bat.
Linux and OS X: Open the launch script DocFetcher.sh in a text editor. In the last line, you can see a setting -Xss2m. Set this to a higher value, e.g. -Xss4m.

FYI, the -Xss setting is the so-called "thread stack size" in megabytes that, among other things, limits the number of folder levels DocFetcher can handle.

Note: DocFetcher Pro is completely immune to the above problem. It can index folder hierarchies of any depth.

Why am I not allowed to create overlapping indexes?

If you index a certain folder, say C:\path\to\folder, and then try to index a subfolder of that folder, say C:\path\to\folder\subfolder, DocFetcher will refuse and complain that overlapping indexes aren't allowed. There are technical reasons for this:

If you create overlapping indexes, the same file may show up multiple times on the result list. Theoretically, DocFetcher could make an effort to remove duplicate results, but this would probably slow down searching.
If a certain folder occurs multiple times in the Search Scope pane, multiple checkboxes will be associated with it, so it may be possible for the folder to have conflicting check states, e.g. checked and unchecked at the same time. This would raise the question of whether DocFetcher should filter out results from that folder or not. And even if we set an arbitrary rule to resolve this conflict (e.g. show file under this folder if the folder is checked at least once), then this scenario would still complicate the underlying program logic.

Why doesn't DocFetcher show a progress bar during indexing?

Showing the indexing progress would require determining in advance how many files need to be indexed, which is both complex and time-consuming due to a number of factors such as incremental indexing (i.e., skip all files that have already been indexed), error handling, unlimited nesting of archives, and file inclusion/exclusion rules. If there is a huge number of files to be indexed, DocFetcher could easily spend 10 minutes just determining what needs to be indexed, before getting to the actual indexing.

Some unicode characters in the preview pane aren't displayed correctly, how can I fix this?

This might be a problem with the specific fonts you are using. Try different font settings on DocFetcher's preferences dialog.

The preview pane is empty for some PDF files...

It's possible that the text in these PDF files exists only as scanned images, and is therefore not extractable. You can check this by opening the PDF files and trying to select the visible text. If you can't select the text, that means it's actually just an image. If this is indeed the problem, run your PDF files through OCR software.

Proximity search only works "forwards" in the preview pane...

To give an example of the problem:

Suppose you indexed a file containing the following text: The quick brown fox jumps over the lazy dog.
The proximity search "brown jumps"~10 will bring up the file and highlight the match correctly.
The proximity search "jumps brown"~10 will also bring up the file, but the match won't be highlighted.

This is a known limitation of the preview highlighting; the searching itself is not affected.

As a workaround, you can combine a proxmity search and its reverse with an OR operator, like so: "dog cat"~10 OR "cat dog"~10. With this, you will get highlighting in both directions. Do note that you have to use the OR operator, not the AND operator; OR and AND behave totally differently.

After indexing my Outlook e-mails, I have difficulty starting Outlook. What's going on?

This is appears to be an issue with the DocFetcher daemon on Windows. The daemon runs whenever DocFetcher isn't running, and seems to prevent Outlook from starting. As a workaround, rename the file docfetcher-daemon-windows.exe in the DocFetcher folder to prevent the daemon from starting, and then reboot Windows.

Note that disabling the daemon comes with the downside that you'll have to update your indexes by hand; otherwise your search results will be out of date after file changes.

For more information about what the daemon does, please see the first section in the DocFetcher manual.

How can I switch the user interface to a different language?

DocFetcher ships with translations of its user interface for a couple of languages. At program start, it will detect the language of your operating system and then either choose a matching translation if available or use the English default.

You can override this auto-detection and explicitly set the user interface language as follows. First, take a look at the contents of the lang folder in the DocFetcher program folder. The lang folder contains files named Resource_XX.properties, where XX is a lowercase two-letter language code called ISO-639-1 that specifies a certain language. The lang folder contains all available translations. A complete list of ISO-639-1 language codes for all languages can be found here.

Now, to manually set the user interface language, you have to add a language parameter at the end of the launcher file, which depends on your operating system:

Windows: In the folder DocFetcher\misc, there's a file named DocFetcher.bat. Open the file in a text editor, add the parameter described in the next section, then save and close the file. Importantly, move the DocFetcher.bat file one level up into the DocFetcher folder. From now on, always start DocFetcher by double-clicking the DocFetcher.bat file.
Linux: Open the file DocFetcher.sh in a text editor, add the parameter described in the next section, then save and close the file. Launch DocFetcher via the modified DocFetcher.sh file.
OS X: Follow the instructions given in this forum thread.

Now, regarding the language parameter that needs to be added: The last line of the launcher file starts with a "java" command which launches the DocFetcher process. It looks like this:

java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9

Modify this line like so:

java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib **-Duser.language=XX** net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9

Replace XX with the ISO-639-1 language code of the language you want DocFetcher to use.

Does DocFetcher have a web interface?

No, but the commercial software DocFetcher Server has a web interface.

What does the DocFetcher daemon actually do?

When DocFetcher isn't running, the daemon detects file changes in the indexed folders, and marks the corresponding indexes as "needs to be updated".

When DocFetcher is running, the daemon remains inactive, because then DocFetcher assumes the responsibility of detecting file changes, provided that the indexes have been created with the folder watching option enabled.

By itself, the daemon does not do any indexing, it only marks indexes as to be updated. When DocFetcher is started the next time, DocFetcher picks up the information left behind by the daemon and runs the required index updates.

Does DocFetcher support searching via command-line or some sort of search API?

In DocFetcher 1.1.20 and later versions, DocFetcher supports Python-based scripting. This can be used to programmatically execute searches and retrieve the results. For an example of how this is done, see the explanation at the top of the file search.py, which can be found in the DocFetcher program folder.

Alternatively, if you feel like tinkering with the DocFetcher source code, have a look at the [Source code] page for instructions on how to obtain the source code and build DocFetcher.

For Java-based indexing and searching in general, have a look at these Apache projects:

Why are the DocFetcher installer and the other packages so large (> 30 MB)?

This is mainly due to the fact that DocFetcher is shipped with lots of built-in text extraction libraries, some of which are quite big. The worst offenders are the libraries for MS Office and PDF files. However, the developers of these libraries aren't to blame here: The libraries have to be big because the respective file formats are immensely complex.

Why does DocFetcher require this god-awful thing called Java?

The word "Java" refers both to a platform for programs to run on, and to a programming language for writing such programs. Here's why DocFetcher was written in the Java language: Java is a far easier and far more convenient language to develop in than, say, C++. Java's advantages include: Automatic memory management, 10x less error-prone, 10x less effort to make it work on different platforms. If DocFetcher had been written in C++ instead, development time would probably have been twice as long, and the resulting program would have only half the features, but twice the number of bugs. And perhaps you would have to pay for it, or download some crack, because far fewer developers are willing to go through the ordeal of messing with C++ in their unpaid sparetime.

Also, while Java programs still start up slowly and memory usage is still high, the runtime performance has improved significantly in recent years and is now comparable to native code as produced by C/C++ programs, according to Wikipedia. (Case in point: I've never heard anybody say that DocFetcher's indexing algorithm is "slow".)

As for Java security, here’s the Truth most non-tech people never seem to quite understand:

The Java runtime by itself is no more dangerous than the .NET framework or any other application runtime. Update the Java runtime if you feel like doing so, but those updates are primarily for the sake of improved performance and fewer bugs and crashes, not for the sake of security.
The only serious danger when using Java comes from the Java plugin running in your browser. It’s best to disable the plugin altogether. If you’re using a modern, up-to-date browser, the plugin is probably already disabled. If you must run the plugin, then you’d be well advised to keep it up to date at all times.

Why does DocFetcher take up so much RAM?

One part of the answer is that it's a Java program. The other part is that you're feeding it with huge amounts of data.

Why does DocFetcher only have a text-only preview (except for HTML files)?

Because a preview with full formatting, tables, paging, etc. would require a tremendous amount of programming effort. It's sort of like implementing a miniature version of MS Office inside DocFetcher for every single supported document format. That being said, there are some ready-made solutions for MS Office and PDF files out there, although integrating them into DocFetcher wouldn't be easy either. The cost-benefit ratio is really low here, so there are currently no plans to improve the situation.

Wiki: Home
Wiki: Source code
Wiki: Tips & tricks