Deep Learning – the master key for document analyses? (Part 1)

29.11.2016 | Deep Learning – the master key for document analyses? (Part 1)

by Matthias Neidhardt

Business decisions based on complex data analyses: Challenges & opportunities

The steadily advancing digitization of business processes – from personnel administration, product management and production to customer support – allows for complex decision-making processes to become more sustainable on the basis of comprehensive data. However, this presupposes that the relevant data is available at decision-making levels and can be analyzed in the context of the decision-making process.

So far only a few information aspects could be automatically derived from unstructured data objects, since the technical connections in the data or documents (e-mails, text documents, wikis, etc.) are hidden and therefore not accessible for classical analysis methods. In companies, approximately 80% of the relevant information is located in unstructured data objects. In contrast to data warehouse/BI methods, this unstructured content must first be extracted from various heterogeneous data sources before it can be analyzed and visualized. Not least due to ever-shorter product cycles, enterprises are in need for cross-department and system-wide analyses for the discovery of interrelations, for example for the recognition of interactions between content in the Helpdesk (ticket system), quality documents, contracts and instructions for action. Another complicating factor is that there are insufficient data analysis experts who could carry out such analyses. Ultimately, it is all about unlocking the knowledge hidden in Dark Data[1].

Automatic information retrieval not always “easy”

The art of modern information management lies in the organic linking of known facts with automatically determined information aspects. The challenge is, in particular, in the establishment of an economic and sustainable solution. Behind this lies the balancing act between the creation and use of a curated information and concept model and its interaction with automatically acquired entities, other concepts and the connections between them. This is relatively easy to manage as long as the automatic acquisition of information aspects by deterministic rules takes place without contradiction. If, on the other hand, methods of machine learning are used, strategies are needed that would compensate for the errors in such methods or make them tolerable. In addition to the quality of the learned aspects, the resource requirements of the applied methods play a decisive role. Today it is not always possible to find an economic solution with results suitable for practice.

Machine learning requires state-of-the-art technology

Complex learning methods require powerful hardware. The corresponding algorithms require massive parallel operations. Since the number of cores in today’s CPUs is not yet sufficient for this purpose there is increased use of graphic cards which can execute thousands of floating-point operations simultaneously.

The fact that this is essential can be illustrated in a little numerical example. 100 million documents are to be analyzed in a medium-sized company. This is a typical volume for companies or authorities with several thousand employees. If you only need 1 second per document for these analyses, you would need more than 3 years of analysis time for sequential processing. This example quickly demonstrates that it is essential to use effective methods and to revert to state-of-the-art technology.

Semantic analyses with neural networks

Procedures based on special neural networks which are often referred to as Deep Learning have been promising a great breakthrough for the last few years. With Deep Learning great progress has been made in the areas of automatic translations, image analyses and semantic analyses of texts. In the case of semantic analysis of documents there have already been approaches for the use of neural networks for many years. In particular, through publications by Google[2], this case of application has experienced a significant boost since 2013. Thanks to a clever combination of already well-known procedures Google was able to provide a toolkit for text analyses (Word2Vec[3]) which can be used to determine very performant semantic relations between words on a modern hardware.

Despite all the euphoria one should not forget that semantic analyses are diverse and one single toolkit cannot cover all purposes.

Document analyses sometimes have completely different objectives, such as

  • Extraction of properties (metadata) for indexing and for the provision of filter criteria for the search
  • Classification of documents based on specific categories
  • Finding semantic relationships between terms, topics and different documents
  • Automatic creation of a company-specific thesaurus
  • Statistics on various properties of document contents
  • Automatic translation

and many more.

deep learning for document analysis

Deep Learning is a modern form of semantic analyses. The analysis of digital documents (or in neo-German ‘analytics’) has always been an integral part of the process. In addition to the examination of the document content and its other characteristics the process also includes the analysis of the use of the documents (access frequency, etc.). Texts are analyzed using a variety of different methods and meaningful terms (people’s names, e-mail addresses, product names, order numbers, etc.) are extracted. For this, in addition to neural networks classical static procedures and rule-based methods are also applied.

The results of semantic analyses are stored in a knowledge graph. It forms the knowledge base for wizards and all other forms of user guidance. This shows that analytics cannot be reduced to the visualization (charts) of statistical data.


[1] A term coined by the consultancy firm Gartner for unknown and therefore unused information in a company
[2] T. Mikolov, K. Chen, G. Corrado and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013


Recent Blogs