diff --git a/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG b/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG index dadd9f4e71edc4622f27c9d5e45b3ff114d3e6de..7c0856902dbf388c3061f465c2b7a983c3ecce04 100644 Binary files a/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG and b/docs/doxygen-user/images/keyword-search-configuration-dialog-general.PNG differ diff --git a/docs/doxygen-user/keyword_search.dox b/docs/doxygen-user/keyword_search.dox index 48142ee4a7101aa070f203e480b17d5d46411e51..d388643cafdef4682e452a5d25daed4809659bed 100644 --- a/docs/doxygen-user/keyword_search.dox +++ b/docs/doxygen-user/keyword_search.dox @@ -5,22 +5,30 @@ \section keyword_module_overview What Does It Do -The Keyword Search module facilitates both the \ref ingest_page "ingest" portion of searching and also supports manual text searching after ingest has completed (see \ref ad_hoc_keyword_search_page). It extracts text from files being ingested, selected reports generated by other modules, and results generated by other modules. This extracted text is then added to a Solr index that can then be searched. +The Keyword Search module facilitates both the \ref ingest_page "ingest" portion of searching and also supports manual text searching after ingest has completed (see \ref ad_hoc_keyword_search_page). It extracts text from files being ingested, selected reports generated by other modules, and results generated by other modules. -Autopsy tries its best to extract the maximum amount of text from the files being indexed. First, the indexing will try to extract text from supported file formats, such as pure text file format, MS Office Documents, PDF files, Email, and many others. If the file is not supported by the standard text extractor, Autopsy will fall back to a string extraction algorithm. String extraction on unknown file formats or arbitrary binary files can often extract a sizeable amount of text from a file, often enough to provide additional clues to reviewers. String extraction will not extract text strings from encrypted files. +Autopsy tries its best to extract the maximum amount of text from the files being indexed. First, it will try to extract text from supported file formats, such as pure text file format, MS Office Documents, PDF files, Email, and many others. If the file is not supported by the standard text extractor, Autopsy will fall back to a string extraction algorithm. String extraction on unknown file formats or arbitrary binary files can often extract a sizeable amount of text from a file, often enough to provide additional clues to reviewers. String extraction will not extract text strings from encrypted files. Autopsy ships with some built-in lists that define regular expressions and enable the user to search for Phone Numbers, IP addresses, URLs and E-mail addresses. However, enabling some of these very general lists can produce a very large number of hits, and many of them can be false-positives. Regular expressions can potentially take a long time to complete. -Once files are placed in the Solr index, they can be searched quickly for specific keywords, regular expressions, or keyword search lists that can contain a mixture of keywords and regular expressions. Search queries can be executed automatically during the ingest run or at the end of the ingest, depending on the current settings and the time it takes to ingest the image. - Refer to \ref ad_hoc_keyword_search_page for more details on specifying regular expressions and other types of searches. +As of Autopsy 4.21.0 release, two types of keyword searching are supported - Solr search with full text indexing, or an built-in Autopsy "In-Line" Keyword Search. It is also possible to combine both searches during ingest process - perform In-Line keyword search as well as index all extracted text in Solr to allow for ad-hoc searching after the ingest has completed. See \ref keyword_ingest_settings on details regarding search type configuraiton. + +\subsection keyword_SolrSearch Solr Search With Indexing + +Full text indexing with Solr allows user the flexibility to run ad-hoc manual text searching after ingest has completed (see \ref ad_hoc_keyword_search_page). However, the process of full text indexing can greately slow down ingest speed for large datasources and/or cases. Once files are placed in the Solr index, they can be searched quickly for specific keywords, regular expressions, or keyword search lists that can contain a mixture of keywords and regular expressions. Search queries can be executed automatically at the end of the ingest. + +\subsection keyword_InlineSearch In-Line Keyword Search + +The In-Line Keyword Search performs the searching during ingest at the time of text extraction and only indexes small sections of the files that have keyword hits. Our profiling runs show that in most cases this has reduced data source ingest time in half, meaning that using In-Line Keyword Search the ingest on a data source is completed in roughly half the time that it takes to ingest and search the same data source using Solr indexing. The downside is that all of the search terms must be specified ahead of the ingest, and there is no way to run ad-hoc search on the entire extracted text after ingest has completed. + \section keyword_search_configuration_dialog Keyword Search Configuration Dialog The keyword search configuration dialog has three tabs, each with its own purpose: \li The \ref keyword_keywordListsTab is used to add, remove, and modify keyword search lists. \li The \ref keyword_stringExtractionTab is used to enable language scripts and extraction type. -\li The \ref keyword_generalSettingsTab is used to configure the ingest timings and display information. +\li The \ref keyword_generalSettingsTab is used to configure display information. \subsection keyword_keywordListsTab Lists tab @@ -57,18 +65,19 @@ The user can also use the String Viewer first and try different script/language \subsubsection keyword_nsrl NIST NSRL Support The hash lookup ingest service can be configured to use the NIST NSRL hash set of known files. The keyword search advanced configuration dialog "General" tab contains an option to skip keyword indexing and search on files that have previously marked as "known" and uninteresting files. Selecting this option can greatly reduce size of the index and improve ingest performance. In most cases, user does not need to keyword search for "known" files. -\subsubsection keyword_update_freq Result update frequency during ingest -To control how frequently searches are executed during ingest, the user can adjust the timing setting available in the keyword search advanced configuration dialog "General" tab. Setting the number of minutes lower will result in more frequent index updates and searches being executed and the user will be able to see results more in real-time. However, more frequent updates can affect the overall performance, especially on lower-end systems, and can potentially lengthen the overall time needed for the ingest to complete. - -One can also choose to have no periodic searches. This will speed up the ingest. Users choosing this option can run their keyword searches once the entire keyword search index is complete. - \section keyword_usage Using the Module -Search queries can be executed manually by the user at any time, as long as there are some files already indexed and ready to be searched. Searching before indexing is complete will naturally only search indexes that are already compiled. +After the ingest has completed, \ref ad_hoc_keyword_search_page will be available for manual search. The amount of files/text available for Ad Hoc Search depends on the Keyword Search module settings at the time of the ingest. As of Autopsy 4.21.0 release, two types of keyword searching are supported - Solr search with full text indexing, and/or an built-in Autopsy "In-Line" Keyword Search. It is also possible to combine both searches during ingest process - perform In-Line keyword search as well as index all extracted text in Solr to allow for ad-hoc searching after the ingest has completed. + +If full text indexing with Solr was enabled during ingest then ad-hoc manual text searching will be able to search all of the text extracted from all of the files and artifacts. -See \ref ingest_page "Ingest" for more information on ingest in general. +The In-Line Keyword Search performs the searching during ingest at the time of text extraction and only indexes small sections of the files that have keyword hits. Therefore unless full text indexing with Solr is enabled, the ad-hoc search will only be able to search those small sections of the files that had keyword hits (as opposed to all of the text extracted from all of the files and artifacts). -Once there are files in the index, \ref ad_hoc_keyword_search_page will be available for use to manually search at any time. +Other situations which will result in not being able to search all of the text extracted from all of the files and artifacts include: +<ul> +<li>If file filtering was used during ingest, resulting in only a subset of files getting ingested. See \ref file_filters for information in file filtering. +<li>If Autopsy case contains multiple data sources and one or more of those data sources was not indexed during it's ingest. +</ul> \subsection keyword_ingest_settings Ingest Settings