Text (information) retrieval deals with the problem of how to find relevant (useful) documents for any given query from a collection of text documents. Documents are typically preprocessed and represented in a format that facilitates efficient and accurate retrieval. In this section, we provide a brief overview of some basic concepts in classical text retrieval.
The contents of a document may be represented by the words contained in it. Some words such as "a", "of", and "is" do not contain semantic information. These words are called stop words
and are usually not used for document
representation. The remaining words are content words and can be used to represent the document. Variations of the same word may be mapped to the same term. For example, the words "beauty", "beautiful" and "beautify" can be denoted by the term "beaut."" This can be achieved by a stemming
program, which removes suffixes or replaces them by other characters. After removing stop words and stemming, each document can be logically represented by a vector of n terms, where n is the total number of distinct terms in the set of all documents in a document collection.
Suppose the document d is represented by the vector (d1
, . . . , di
, . . . , dn
), where di
is a number (weight) indicating the importance of the ith term in representing the contents of the document d. Most of the entries in the vector will be zero because most terms do not appear in any given document. When a term is present in a document, the weight assigned to the term is usually based
on two factors, namely the term frequency
(tf ) factor and the document frequency (df ) factor. The term frequency of a term in a document is the number of times the term appears in the document. Intuitively, the higher the term frequency of a term is, the more important the term is in representing the contents of the document. Consequently, the term frequency weight
(tfw) of a term in a document is usually a monotonically increasing function of its term frequency. The document frequency of a term is the number of documents having the term in the entire document collection. Usually, the higher the document frequency
of a term is, the less important the term is in differentiating documents having the term from documents not having it. Thus, the weight of a term with respect to document frequency is usually a monotonically decreasing
of its document frequency and is called the inverse document frequency weight (idfw).