Information Retrieval Machine Learning Publications

Relations Extraction in textual information

These days we find a large amount of textual data on the web, which is partly encouraged by the proliferation of social networks and other web pages. This textual information is a “goldmine”, because many applications can be drawn from the quantity of knowledge they provide. For example, it is possible to analyze the sentiment and/or opinion of customers. This kind of information can be used to improve the accuracy of customer targeting. One can also predict or prevent Customer Churn more accurately. Another kind of application is Semantic Web (SW, Web 3.0). The core idea behind the SW is to link information according to relations. Hence, computers will be able to “understand the information by themselves”. In the present paper, first we present an overview of the architecture of a generic Information Extraction (IE) system. This is followed by a short description of each component of this architecture, and we end by presenting the principle of a Relations Extraction system.

Textual Information Extraction Architecture

A generic IE system consists of four main components (Figure 1) [1]:

?  Raw Text Processing (Sentence Segmentation and Tokenization)

?  Part of Speech Tagging (POST)

?  Entity Recognition (ER)

?  Relationship Recognition


Figure 1. A generic IE system architecture

Raw Text Processing

During the Raw Text Processing (RTP) the textual corpus is dived into sentences which are themselves split into single words.

Part Of Speech Tagging (POST)

The second block is Part Of Speech Tagging (POST), which grammatically labels the words issued from the RTP phase (Figure 2). This resolves the ambiguity of the meaning of words.

This part of the system assumes that the grammatical tags of the words in any sentence belong to eight different classes. These classes, introduced by the Greek linguist and grammarian Dionysius Thrax, are: noun, verb, adjective, adverb, preposition, pronoun, conjunction, and interjection.

The difficulty of the POST task is the fact that one word can possess several different tags according to context.

The general problem to solve in POST can be formalized as follows. Considering a sentence

 , the objective is to find the sequence of tags  of these  words.

To summarize, this leads to the following optimization problem:


 After manual POST systems, some automatic POST systems have been introduced:

?  Supervised POST: systems using algorithms are based on Hidden Markov sequence Models (HMM) [2], Maximum Entropy Markov models (MEMMs) [3] or Conditional Random Fields (CRFs) [4] …

?  Unsupervised: for example, there are systems that use the Baum-Welch algorithm, well known in ASR systems

?  Transformation Based Learning (TBL), also known as Bill tagger [5 ] [6]

In addition to IE, POST tasks can be used in other areas, such as Text To Speech (TTS). It can therefore help TTS in solving the pronunciation of a word (for example, “lead” is pronounced differently according to  its grammatical nature).


Figure 1. POST system

Named Entity Recognition (NER)

The next bloc in IE architecture is Named Entity Recognition. Entity Extraction is the process which aims to extract automatically from any text some key entities such as person names, locations, dates, percentage, etc. An intermediate step, which is prior to NER, is Non Phrase Chunking (NP-Chunker). This step uses the POST information to split sentences into non-overlapping groups of words (composing this sentence). The NP-chunker can be built by defining a “grammar” and using a regular expression parser (See an example in Table 1).

We can observe that an NER is an NP referring to a specific type of individual (named entity). One can distinguish two types of NER: symbolic and statistical approaches.

The symbolic NER, also known as the linguistic approach, is based on some linguistic rules established manually by linguistics experts.

The statistical approach is based on Machine learning systems. Its goal is to determine the sequence of words allowing the maximization of  , where  is the sequence of named-entity labels and  the sequence of words. Like POST, Statistical NER can be supervised, non-supervised or hybrid (semi-supervised).

Sentence

The

R&D

Department

of

DYNADMIC

is located

in

LIMOGES

POST

Determiner

Noun

Noun

Preposition

Noun

Verb

Adverb

Noun

NP Chunker

NP

NP

NER

ORGANIZATION

 

 

LOCATION

Table 1. POST, NP-Chunker and NER processes on the sentence “The R&D Department of DYNADMIC is located in LIMOGES” ? 2 Named Entities

Relation Extraction (RE)

The final objective in IE is Relation Extraction (RE). After an NER process, the goal of RE is to establish the relation between these Named Entities. An n-ary relation   between  entities can be defined by the tuple  where  some are predefined Named Entities.

The objectives of RE are on the one hand to create new structured knowledge bases and on the other hand to enrich the existing knowledge bases.

One of the most popular programs being worked on is ACE (Automatic Content Extraction) [7]. This program was developed to extract some semantics from multimedia sources (text, audio and image). In ACE, there are 17 relations classified into 6 main relation categories: person-social, Physical, general affiliation, part-whole, organization and artifact.

Another type of relations database is DBpedia [8] which uses information from Wikipedia to build some structured information available on the web. They are two main types of RE: rule-based (hand-written pattern RE) and learning-based systems. The last ones can be subclassified into three categories: supervised, semi-supervised and unsupervised REs.

Hand-written pattern RE

One of the earlier hand-written pattern Res, “Automatic Acquisition of Hyponyms”, was introduced by Hearst [9]. The templates defined by Hearst are as follows:

X and other Y, X or other Y, Y such as X, Such Y as X, Y including X and Y, especially X.

Since Hand-written pattern REs are human-made systems, they have high accuracies and can be adapted to specific domains. However, it is time consuming to build all the rules manually. That is why some automatic REs have been designed.

Supervised RE

In Supervised RE, first, the set of relations to model and the relevant named entities must be chosen.  Then, the data of the corpus must be labelled (NER followed by labeling the relations between NE manually). To describe Supervised RE, we will focus on binary relations. Considering a sentence

 containing the  entities. First, all the pairs of entities  must be determined, then a classifier is used (SVM or Naïve Bayes for example) to determine if   is in relation with .

Let us denote  the relation to extract the  modelling the RE defined as follows:


where  is a features extractor function.

The different types of feature that can be used are words, POST, or Named Entities according to the application.

Nevertheless, the drawback of Supervised RE is the fact that they require enough training data to provide high accuracies. Consequently, to overcome this, unsupervised and semi-supervised techniques have been introduced.

Semi-supervised and Unsupervised RE

When there are a few training data, the principle of semi-supervised RE is to use the output of the initial training data iteratively as additional input data to train the model. Hence, at each iteration, the outputs of training data are used as new training data for the next iteration. For example we have DIPRE (Dual Iterative Pattern Relation Expansion) introduced by Brin in 1998 [10], where the relation is .

In an unsupervised RE system, there are no training data. Moreover no set of relations has to be predefined, because the system learns by the relations and entities alone. These are called Open Information Extraction (OIE) systems. One of the implementations of an OIE system is TextRunner [11]. Like all OIEs in general, TextRunner has three main parts: the learner, the extractor and the assessor.

The learner is a self-supervised system. First it parses (using some syntactic parsers) a small corpus text, identifies NP, and automatically labels the different pairs of NP as positive (there is a relationship between elements) or negative (there is no relationship between them). Once this small set of training data is labelled, it is used to train the extractor. This Extractor makes a single pass over the entire corpus to extract the relations. Finally, the assessor assigns a score to all the extracted relations based on the redundancy method.

References

[1] http://www.nltk.org/book/ch07.html.

[2] L. R. Welch. “Hidden Markov models and the Baum-Welch algorithm.” IEEE Information Theory Society Newsletter, vol. 53, no. 4, December 2003.

[3] A. McCallum, D. Freitag, and F. Pereira. “Maximum entropy Markov models for information extraction and segmentation.” in Proceedings of ICML, 2000.

[4] J. Lafferty, A. McCallum, and F. Pereira. “Conditional random fields: Probabilistic models for segmenting and labeling sequence data.” in Proceedings of ICML, 2001.

[5] E. Brill. “Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging.” Computational Linguistics, 21(4):543-565, 1995.

[6] http://en.wikipedia.org/wiki/Brill_tagger/.

[7] G. Doddington, A. Mitchell, M. Przybocki, L. Ramshaw, S. Strassel, R. Weischedel. “The automatic content extraction (ACE) program tasks, data, and evaluation.” 2004

[8] http://wiki.dbpedia.org/.

[9] M.A. Hearst. “Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th International Conference on Computational Linguistics.” Nantes, France, 1992.

[10] S. Brin. “Extracting patterns and relations from the world wide web.” WebDB Workshop at 6th International Conference on Extending Database Technology, (EDBT ). 1998

[11] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni. Open information extraction from the web. IJCAI ’07: Proceedings of the 20th International Joint Conference on Arti?cial Intelligence. Hyderabad, India, 2007.

You may also like
HMM-based ASR
Video Quality Assessment (VQA)