Newspaper research with intergator and PPS_Finder

Posted on 21.08.2018 by

Thomas Aurich

The intelligent search in digital newspaper archives

Modern periodicals and publishing houses often have large databases of publications, which are increasingly digitized and thus available for comprehensive research. However, there are problems with newspapers and magazines, which date from before the computer age and are often only available in paper format. These are recorded and digitized in large-scale projects with special scanners. However, at the moment, the process often ends with pure digitization and omits content capture.

Together with the Hessian PPS PREPRESS SYSTEME GmbH, intergator was adapted to the requirements of a search in archives of newspapers and magazines. With the experience from enterprise search projects on the one hand and the know-how from digitization processes on the other hand, the PPS_Finder was developed into a comprehensive research tool.

The PPS_Finder as a research tool

The search in detail

As with any search, a central entry into the search is a search input field into which a search term is entered. The search results are then clearly listed, with each hit next to some META data or a preview image is prefixed. Since all data contents are automatically digitized by OCR and converted to PDF format, the newspaper can be called up immediately via a reader. The search term is highlighted in color and makes it easier for the user to immediately find the reference within the document. The search and its results remain unaffected, i. a new search is not necessary, but can be continued immediately.

Document preview with excerpt from the reference

Via facets that are freely configurable, search results can be reasonably narrowed down. In addition to vintages, sources (if, for example, a publisher publishes several publications), even authors can be used as a filter for research. The selected facets can be deactivated at any time with a simple click, without having to start the search again. In addition, the facets can also be used for exclusions - if, for example, you are looking for a term and want to explicitly avoid a year, this can be excluded as an exclusionary facet. Additional facets, such as the filtering for agency reports or regional editions are possible in principle.

Narrow down the search results with facets

During indexing, the elements are annotated. In the later search, hits for later searches can be additionally enriched by a simple tagging. In this way, results for later research can be specifically included. The keywords also remain after the search on the hit, but can be removed at any time. This makes it easier to research topics that are more complex, sometimes resulting from a context.

Extensible detailed information

In addition to the simple search in the primary search field, this field can be extended by additional fields and the results are limited at startup. These parameters can also be similar to the Google search in the standard search via special abbreviations directly enter. Searching for the file type, a date, an author, etc. allow the skilled user fast searches. Also Boolean operators (AND, OR, NOT, etc.) can be used to search.

Boolean search using the terms "Rostock" and "Marteria"

Machine learning as part of the search

Machine learning methods make it even easier to prepare documents for more efficient searches in the future. With various machine learning methods, documents are no longer simply indexed and cataloged but "understood". At the same time, the search approach is radically changed. Previously frequencies and the occurrence of a search engine, the intelligent search engine independently recognizes the relevant terms of a document and relates them to each other. After a short training phase with training contents, the machine decides independently after a certain time, what the document is about and how it has to be arranged. The results of machine learning can then be assigned to the respective document and thus the content can be enriched with further information via, for example, META data.

Images can now also be transferred to a search using machine learning methods. While manual tagging was the only way to understand the content of an image on a search engine, the trained machine now recognizes the content without human intervention. Here, too, requires an initial training.

Especially large amounts of data can be more easily categorized, searched and meaningfully linked with machine learning. Publishers with extensive archives and image collections who have so far concentrated solely on digitization, but not on the subsequent utilization, get an effective tool for research with the PPS_Finder. The development of machine learning methods in the field of cognitive search has progressed rapidly in the last two years and individual elements are already in the intergator and thus also in the PPS_Finder.

This entry was posted in Corporate Blog and tagged digitization, Enterprise Search Solution, intergator, partners, newspapers . Save the link here.