No new features are planned for DocFetcher, only bugfixes. Development continues in DocFetcher Pro.
Buying a copy of DocFetcher Pro, the commercial big brother of DocFetcher, is equivalent to making a donation, plus you get a bunch of new features.
If you don't need those features and/or DocFetcher Pro costs more than you're willing to donate, you can "buy" the otherwise free demo of DocFetcher Pro for a price of your choosing.
If you experienced problems with the installed version of DocFetcher, consider using the portable version instead (download page). The latter runs on all supported platforms and does not try to detect or download Java runtimes. However, note that the portable version has to be put in a location where you have write permissions. The reason for this is that on the first start, the program figures out what operating system it's running on and whether the operating system is 32-bit or 64-bit. It then tries to unpack the right library files into a subfolder under its own folder, and this will fail without write permissions.
In some cases where DocFetcher doesn't start, the solution is to uninstall all currently installed Java runtimes and then reinstall the latest Java runtime from the Java website. On that website, be sure to pick either the 32-bit or the 64-bit Java runtime, depending on whether your operating system is 32-bit or 64-bit.
Another potential problem: Running DocFetcher with a memory setting of more than 1 GB requires a 64-bit Java runtime. It will not work with 32-bit.
On some systems, the embedded web browser that is used for displaying the manual and HTML files can crash the entire program. As a workaround, disable the embedded web browser by modifying DocFetcher's settings file. Look for the settings file in one of the following locations:
C:\Documents and Settings\<UserName>\Application Data\DocFetcher\conf\settings-conf.txt
C:\Users\<UserName>\AppData\Roaming\DocFetcher\conf\settings-conf.txt
DocFetcher\conf\settings-conf.txt
/Users/<UserName>/.docfetcher/conf/settings-conf.txt
If the settings file doesn't exist at the expected location, create a new, empty text file there named settings-conf.txt
. Now, first close DocFetcher, then open the settings file in a text editor and set ShowManualOnStartup = false
and PreferHtmlPreview = false
in it. While you're at it, you may also set HotkeyEnabled = false
to disable the global hotkey. Save and close the file, then try to start DocFetcher.
Some users reported startup issues caused by faulty NVIDIA drivers, version 378.xx. See this thread.
If none of the above helps, try launching DocFetcher via one of the alternative launchers:
DocFetcher\misc
, there's a file named DocFetcher.bat
. Move this file one level up into the DocFetcher
folder. Then open a command prompt and use the cd
command to navigate to the DocFetcher
folder, like so: cd C:\Program Files (x86)\DocFetcher
. Then try launching DocFetcher from the command prompt by entering DocFetcher.bat
and pressing Enter. If DocFetcher doesn't start, then chances are an error message will be printed in the command prompt. Post this error message on the DocFetcher forum.DocFetcher.sh
launcher. If that doesn't help, do the following: Open a terminal and use the cd
command to navigate to the DocFetcher
folder. Then try launching DocFetcher from the terminal by running ./DocFetcher.sh
. If DocFetcher doesn't start, then chances are an error message will be printed in the terminal. Post this error message on the DocFetcher forum.cd
command to navigate to the folder Contents/MacOS
inside the application bundle, then launch the DocFetcher script from there. If it doesn't start, an error message might be printed in the terminal. Post this error message on the DocFetcher forum.If double-clicking the DocFetcherPortable.exe launcher gives you an error message saying something along the lines of "DocFetcher requires a Java Runtime Environment", you need to make sure jPortable is installed in the right location.
There are actually two versions of jPortable, called jPortable and jPortable 64. The former runs on both 32-bit and 64-bit operating systems, and the latter only on 64-bit operating systems. However, in the 32-bit version, DocFetcher's so-called memory limit cannot be set higher than about 1 GB. This could result in out-of-memory errors when trying to index large files. In the 64-bit version, the maximum value for the memory limit is much higher.
The main problem with installing jPortable or jPortable 64 is that their installers give no clue about where they need to be installed to make DocFetcher work. There are basically two rules you need to follow:
The CommonFiles
folder needs to be placed beside the DocFetcherPortable
folder, not inside it or anywhere else. For example, if you installed the PortableApps version of DocFetcher into X:\
, then the CommonFiles
folder should also be in X:\
, like so:
X:\DocFetcherPortable
X:\CommonFiles
There must be a Java
folder inside the CommonFiles
folder, containing the Java Runtime to run DocFetcher on. Importantly, if you go with jPortable 64, the folder must still be named Java
, and not Java64
, unlike what the jPortable 64 installer's default path would suggest.
In the crash report, there's probably a line that says "SWTException: Unable to load graphics library [GDI+ is required]". This indicates that you need to install a package called GDI+ for supporting advanced graphics operations. Here's where you can download GDI+: http://www.microsoft.com/en-us/download/details.aspx?id=18909
Short answer: It's complicated. Last time I checked it had something to do with vector spaces and stuff. For further information, have a look at the scoring page of Lucene (DocFetcher's underlying search engine) and the Wikipedia article about the Vector Space Model.
Here's an extremely simplified explanation of how the scoring works: Suppose you have two files file1.doc
and file2.doc
with the following contents:
file1.doc
contains the word "dog" 10 times, and nothing elsefile2.doc
contains 100 words, 20 of which are "dog"Now, if you search for "dog", both files will show up in the results, but file1.doc
gets a higher score because 10/10=100%
, and 20/100=20%
. This illustrates the basic idea: Dividing the number of hits by the word count gives you a measure of how "relevant" a document is with respect to your query. Why is that so? Because:
Occasionally, you'll see score values greater than 100%. This is because the actual formula used is much more complicated, and the calculated score is not really a percentage, but a fraction greater than or equal to 0.
Some people have asked for a column on the result table that displays for each file the “hit count”, i.e., the number of occurrences of the query string in the file. This information is currently only displayed for the selected file in a small box at the top of the preview pane.
There are currently no plans to implement a hit count column, due to performance reasons:
This is not really a bug, but a consequence of the fact that DocFetcher splits documents into individual words during indexing, a.k.a. tokenization. This is done in order to build a dictionary (i.e. the index), which DocFetcher then uses to do quick searches. In general, DocFetcher works best with natural language, but not quite as well with text containing digits or special characters.
That being said, there's an Analyzer option in the Advanced Settings which allows you to switch to an alternative tokenization mechanism that works better with source code and other kinds of text not written in natural language.
Additionally, take a look at the Query Syntax section in the manual. Some of the concepts explained in there, e.g. wildcards and phrase searches, might help to work around the above issues.
On the indexing dialog, use this regex exclusion pattern: .*/\.svn/.*
Note the usage of forward slashes to match against path separators (even on Windows!), and escaping the "." with a backward slash.
In addition, "Match Against" must be set to "Absolute path".
DocFetcher does not include folder names or file paths in the search, only filenames and file contents. That was a fundamental design decision that was made back then when the core program was written. It may or may not have been a good decision, but the idea was that (1) most of the important stuff the user may want to search for is in the filename and file contents, (2) if searching for words matching some file path component brings up all files on that path, this will decrease the overall quality of search results, and (3) there are already a lot of programs to search filenames and folder names, such as Everything.
Note: DocFetcher Pro is capable of finding folders by name.
DocFetcher uses third-party libraries to perform text extraction. For example, Apache POI is used for MS Office files, and Apache PDFBox for PDF files. Most of the errors that are shown during indexing come directly from the respective extraction libraries, without further translation by DocFetcher.
If DocFetcher gives an error on some file, there's usually not much one can do about it, except waiting for the developers of the respective library to release an update of their software, and then waiting for this update to be included in DocFetcher.
Certain errors can be circumvented as follows:
.doc
files aren't MS Word files, you can enable mime-type detection for .doc
files by putting the pattern .*\.doc
in the pattern table on the indexing configuration dialog and setting "Detect mime type" as the action to be performed.If during indexing an error dialog pops up that says something like "Oops, this program just died", then the program has crashed and left the partially created index in a potentially broken state. There's no telling what will happen if you try to use this potentially broken index; it may or may not work correctly. There's no guaranteed way to salvage such an index.
If the crash occurred because the program ran out of memory, and you fixed the problem by raising the memory limit and/or adding RAM, then you still need to rebuild the index to be sure that there are no issues with it.
The location of the index files depends on the version of DocFetcher and the operating system:
indexes
folder inside the DocFetcher folder.C:\Documents and Settings\<UserName>\Application Data\DocFetcher
C:\Users\<UserName>\AppData\Roaming\DocFetcher
/Users/<UserName>/.docfetcher
For customizing the location of the index files, have a look at the file misc/paths.txt
inside the DocFetcher folder.
In principle, DocFetcher should work with any amount of data. In practice, however, when dealing with a massive amount of data, there's a high risk that there are some problematic files in there that either cause DocFetcher to run out of memory or to crash altogether. The first often happens with large PDF files, and the second may happen with corrupt or otherwise unusual files. There are a few other potential problems as well. All in all, the following is recommended:
Copy the folders conf
and indexes
into the new DocFetcher folder. conf
contains the program settings, while indexes
contains the indexes.
In the DocFetcher folder, open the following file with a text editor: App\AppInfo\Launcher\DocFetcherPortable.ini
In that file, there's a line starting with CommandLineArguments=
. This line contains various launch parameters, including an -Xmx
parameter. To set the memory limit to 8 GB, for example, you can change the parameter to -Xmx8g
. After editing, save and close the file, then restart the application.
Note that the application won't start if the chosen memory limit exceeds the amount of physical RAM available. Moreover, you need to take into account that obviously the OS and other process also need some of that RAM. So, for example, if the computer has 8 GB of RAM, a reasonable memory limit to try would be 4 GB.
Open a terminal and launch DocFetcher in it via the command docfetcher
. In the terminal, you will see a message pointing to the configuration file in which you can set the memory limit. The configuration file contains further instructions.
The likely reason for high CPU usage is that (1) the operating system or some other program is constantly updating some files in one of the indexed folders, and (2) the option "Watch folders for file changes" was selected when the index was created, causing DocFetcher to frequently run index updates.
Accordingly, the workaround is to rebuild the affected index(es) with the option "Watch folders for file changes" turned off.
You can use so-called field searches to search in filenames only. Example: filename:dog
For more info, see the section "Field Searches" on the "Query Syntax" page of the built-in program manual.
For instance, to list all files with the file extension .mm
, enter this query: filename:*.mm
This syntax is explained under the section "Field Searches" on the "Query Syntax" page of the built-in program manual.
Note: DocFetcher Pro addresses this problem via a feature called Custom Types. In essence, it is a customizable version of the Document Types pane, allowing you to define your own file types to filter the search results by.
To index files without file extension (i.e. without the dot in the filename), add the following rule in the pattern table on the indexing dialog:
[^\.]*
This rule matches all files whose filenames do not contain a dot, and it will make DocFetcher recognize the matched files as plain text files.
Note: In DocFetcher Pro, files without file extension can be indexed by ticking the checkbox "Index files without file extension as text files" on the indexing dialog.
The error message means the folder you're trying to index contains more levels of subfolders than DocFetcher can handle with its current settings. The workaround is to either move the subfolders around in order to reduce the maximum folder depth, or to change the settings. How the latter is done depends on your operating system:
DocFetcher.bat
file from the misc
folder inside the DocFetcher folder one level up into the DocFetcher folder (important, otherwise the DocFetcher.bat
won't run). Now open the DocFetcher.bat
file in a text editor. In the last line, you can see a setting -Xss2m
. Set this to a higher value, e.g. -Xss4m
. From now on, always launch DocFetcher through the DocFetcher.bat
.DocFetcher.sh
in a text editor. In the last line, you can see a setting -Xss2m
. Set this to a higher value, e.g. -Xss4m
.FYI, the -Xss
setting is the so-called "thread stack size" in megabytes that, among other things, limits the number of folder levels DocFetcher can handle.
Note: DocFetcher Pro is completely immune to the above problem. It can index folder hierarchies of any depth.
If you index a certain folder, say C:\path\to\folder
, and then try to index a subfolder of that folder, say C:\path\to\folder\subfolder
, DocFetcher will refuse and complain that overlapping indexes aren't allowed. There are technical reasons for this:
Showing the indexing progress would require determining in advance how many files need to be indexed, which is both complex and time-consuming due to a number of factors such as incremental indexing (i.e., skip all files that have already been indexed), error handling, unlimited nesting of archives, and file inclusion/exclusion rules. If there is a huge number of files to be indexed, DocFetcher could easily spend 10 minutes just determining what needs to be indexed, before getting to the actual indexing.
This might be a problem with the specific fonts you are using. Try different font settings on DocFetcher's preferences dialog.
It's possible that the text in these PDF files exists only as scanned images, and is therefore not extractable. You can check this by opening the PDF files and trying to select the visible text. If you can't select the text, that means it's actually just an image. If this is indeed the problem, run your PDF files through OCR software.
To give an example of the problem:
The quick brown fox jumps over the lazy dog.
"brown jumps"~10
will bring up the file and highlight the match correctly."jumps brown"~10
will also bring up the file, but the match won't be highlighted.This is a known limitation of the preview highlighting; the searching itself is not affected.
As a workaround, you can combine a proxmity search and its reverse with an OR operator, like so: "dog cat"~10 OR "cat dog"~10
. With this, you will get highlighting in both directions. Do note that you have to use the OR operator, not the AND operator; OR and AND behave totally differently.
This is appears to be an issue with the DocFetcher daemon on Windows. The daemon runs whenever DocFetcher isn't running, and seems to prevent Outlook from starting. As a workaround, rename the file docfetcher-daemon-windows.exe
in the DocFetcher folder to prevent the daemon from starting, and then reboot Windows.
Note that disabling the daemon comes with the downside that you'll have to update your indexes by hand; otherwise your search results will be out of date after file changes.
For more information about what the daemon does, please see the first section in the DocFetcher manual.
DocFetcher ships with translations of its user interface for a couple of languages. At program start, it will detect the language of your operating system and then either choose a matching translation if available or use the English default.
You can override this auto-detection and explicitly set the user interface language as follows. First, take a look at the contents of the lang
folder in the DocFetcher program folder. The lang
folder contains files named Resource_XX.properties
, where XX
is a lowercase two-letter language code called ISO-639-1
that specifies a certain language. The lang
folder contains all available translations. A complete list of ISO-639-1 language codes for all languages can be found here.
Now, to manually set the user interface language, you have to add a language parameter at the end of the launcher file, which depends on your operating system:
DocFetcher\misc
, there's a file named DocFetcher.bat
. Open the file in a text editor, add the parameter described in the next section, then save and close the file. Importantly, move the DocFetcher.bat
file one level up into the DocFetcher folder. From now on, always start DocFetcher by double-clicking the DocFetcher.bat
file.DocFetcher.sh
in a text editor, add the parameter described in the next section, then save and close the file. Launch DocFetcher via the modified DocFetcher.sh
file.Now, regarding the language parameter that needs to be added: The last line of the launcher file starts with a "java" command which launches the DocFetcher process. It looks like this:
java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9
Modify this line like so:
java -enableassertions -Xmx512m -Xss2m -cp %libclasspath% -Djava.library.path=lib **-Duser.language=XX** net.sourceforge.docfetcher.Main %1 %2 %3 %4 %5 %6 %7 %8 %9
Replace XX
with the ISO-639-1 language code of the language you want DocFetcher to use.
No, but the commercial software DocFetcher Server has a web interface.
When DocFetcher isn't running, the daemon detects file changes in the indexed folders, and marks the corresponding indexes as "needs to be updated".
When DocFetcher is running, the daemon remains inactive, because then DocFetcher assumes the responsibility of detecting file changes, provided that the indexes have been created with the folder watching option enabled.
By itself, the daemon does not do any indexing, it only marks indexes as to be updated. When DocFetcher is started the next time, DocFetcher picks up the information left behind by the daemon and runs the required index updates.
In DocFetcher 1.1.20 and later versions, DocFetcher supports Python-based scripting. This can be used to programmatically execute searches and retrieve the results. For an example of how this is done, see the explanation at the top of the file search.py, which can be found in the DocFetcher program folder.
Alternatively, if you feel like tinkering with the DocFetcher source code, have a look at the [Source code] page for instructions on how to obtain the source code and build DocFetcher.
For Java-based indexing and searching in general, have a look at these Apache projects:
This is mainly due to the fact that DocFetcher is shipped with lots of built-in text extraction libraries, some of which are quite big. The worst offenders are the libraries for MS Office and PDF files. However, the developers of these libraries aren't to blame here: The libraries have to be big because the respective file formats are immensely complex.
The word "Java" refers both to a platform for programs to run on, and to a programming language for writing such programs. Here's why DocFetcher was written in the Java language: Java is a far easier and far more convenient language to develop in than, say, C++. Java's advantages include: Automatic memory management, 10x less error-prone, 10x less effort to make it work on different platforms. If DocFetcher had been written in C++ instead, development time would probably have been twice as long, and the resulting program would have only half the features, but twice the number of bugs. And perhaps you would have to pay for it, or download some crack, because far fewer developers are willing to go through the ordeal of messing with C++ in their unpaid sparetime.
Also, while Java programs still start up slowly and memory usage is still high, the runtime performance has improved significantly in recent years and is now comparable to native code as produced by C/C++ programs, according to Wikipedia. (Case in point: I've never heard anybody say that DocFetcher's indexing algorithm is "slow".)
As for Java security, here’s the Truth most non-tech people never seem to quite understand:
One part of the answer is that it's a Java program. The other part is that you're feeding it with huge amounts of data.
Because a preview with full formatting, tables, paging, etc. would require a tremendous amount of programming effort. It's sort of like implementing a miniature version of MS Office inside DocFetcher for every single supported document format. That being said, there are some ready-made solutions for MS Office and PDF files out there, although integrating them into DocFetcher wouldn't be easy either. The cost-benefit ratio is really low here, so there are currently no plans to improve the situation.