What is knowledge extraction?
The use of a linguistic representation for expressing knowledge acquired by learning systems is an important issue as regards to user understanding. Under this assumption, and to make sure that these systems will be welcome and used, several techniques have been developed by the artificial intelligence community, under both the symbolic and the connectionist approaches.
Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, images) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criteria is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema. It requires either the reuse of existing formal knowledge (reusing identifiers or ontologies) or the generation of a schema based on the source data.
Essentially, you could say that knowledge extraction is the process of making use of several sources of data and information in order to build up a cohesive knowledge bank. As part of the process, the extraction will often draw information from a wide range of sources of both structured data as well as unstructured data. When it is successful, the knowledge extraction process brings you solid data that can easily be read and interpreted by a program, enabling the end user to make use of that formal knowledge for whatever purpose they have in mind.
Various sources of data could be used in the process fo knowledge extraction. If we’re looking at structured data sources, knowledge could be extracted from several kinds of relational databases or some sort of extensible markup language (XML) source. Even unstructured data sources like images, various forms of word processing documents, spreadsheets as well as text captured on programs like notepads could be used as part of the data extraction process.
Basically, if the program being used to manage the knowledge extraction process can read the sources of data, those sources can be utilized as sources that expand the potential for the project that is being advanced through the knowledge extraction and make it possible for the final knowledge produced to be usable.
What are some of the knowledge extraction techniques that you could use?
1. Knowledge graph completion: link prediction
Translating Embeddings for Modelling Multi-relational Data by Bordes et al. in 2013 is a first attempt of a dedicated method for KG completion. It learns an embedding for the entities and the relations in the same low-dimensional vector space. The objective function is such that it constraints entity e2 to be close to e1 + r. This is done by assigning a higher score to exist triplets than to random triplets obtained with negative sampling. This model is known as TransE and this work is related to the word2vecwork by Mikolov where relations between concepts naturally take the form of translations in the embedding space as seen in the picture here.
2. Triplet extraction from raw text
Triplet extraction can be done in a purely unsupervised way. Usually, the text is first parsed with several tools (such as TreeBank parser, MiniPar or OpenNLP parser) then the texts between entities (as well as the annotations from the parsers) are clustered and finally simplified. While attractive at the first look as no supervision is needed, there are a few drawbacks.
First, it requires lots of tedious work to hand-craft rules which depend on the parser used. Moreover, the clusters found contain semantically related relations but they do not give us fine-grained implications. Typically, a cluster may contain “ is-capital-of “ and “ is-city-of “ which are semantically closed relations. However, with the unsupervised approach, we will fail to discover that “ is-capital-of “ implies the relation “ is-city-of “ and not the opposite.
3. Schema-based supervised learning
In this case, the available data is a collection of sentences where each sentence is annotated with the triplet extracted from it. This means that raw text aligned with a KG of the text. Two recent papers (both published in 2016) give cutting-edge solutions to this problem.
The End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures article by Miwa and Bansal shows an approach that uses two stacked networks: a Bidirectional LSTM for entity detection (it creates an embedding of the entities) and a Tree-based LSTM for the detection of the relation that links the entities found. The figure below from the original paper shows the architecture used.
What are the applications and uses of knowledge extraction?
Knowledge extraction has various applications. One of the most common of these applications is the capturing of data from an unstructured source and incorporation of it into some type of structured knowledge source.
Yet another example of the way knowledge extraction can speed up the sharing of formal knowledge without needing to manually enter data that is already available from some other source is the use of knowledge extraction to extract data found in relational databases and utlize it for the purpose of creating new documents, or make use of electronic documents to import data into relational databases. This reuse of existing knowledge in a new format tends to be very helpful in a wide range of scenarios, making it possible to employ that knowledge in ways that may not have otherwise been possible with the existing source. This makes it possible for the user to create sources that are appropriate for a range of different applications instead of just those relevant to the original home of the formal knowledge.
It is also possible to use data extraction in order to use a huge data warehouse and import and export data is a rather easy manner as a way of creating a new source of data that can be used for a particular purpose.