McGraw-Hill OnlineMcGraw-Hill Higher EducationLearning Center
Student Center | Instructor's Center | Introduction to Algorithms | Home
Glossary A-B
Glossary B-D
Glossary D-F
Glossary F-J
Glossary J-M
Glossary M-O
Glossary O-P
Glossary P-S
Glossary S-T
Glossary T-Z
Chapter Overview
Glossary
PowerPoint Slides
Algorithm Pseudocode
Feedback
Help Center


small text cover image
Introduction to Algorithms, 2/e
Thomas H. Cormen, Dartmouth College
Charles E. Leiserson, Massachusetts Institute of Technology
Ronald L. Rivest, Massachusetts Institute of Technology
Clifford Stein, Columbia University

String Matching

Chapter Overview - Chapter 32

Finding all occurrences of a pattern in a text is a problem that arises frequently in text-editing programs. Typically, the text is a document being edited, and the pattern searched for is a particular word supplied by the user. Efficient algorithms for this problem can greatly aid the responsiveness of the text-editing program. String-matching algorithms are also used, for example, to search for particular patterns in DNA sequences.

We formalize the string-matching problem as follows. We assume that the text is an array T[1..n] of length n and that the pattern is an array P[1..m] of length mn. We further assume that the elements of P and T are characters drawn from a finite alphabet Σ. For example, we may have Σ = {0, 1} or Σ = {a, b, …, z}. The character arrays P and T are often called strings of characters.

We say that pattern P occurs with shift s in text T (or, equivalently, that pattern P occurs beginning at position s+ 1 in text T) if 0 ≤sn - m and T[s + 1..s + m] = P[1..m] (that is, if T[s + j] = P[j], for 1 ≤jm). If P occurs with shift s in T, then we call s a valid shift; otherwise, we call s an invalid shift. The string-matching problem is the problem of finding all valid shifts with which a given pattern P occurs in a given text T. Figure 32.1 illustrates these definitions.

Except for the naive brute-force algorithm, which we review in Section 32.1, each string-matching algorithm in this chapter performs some preprocessing based on the pattern and then finds all valid shifts; we will call this latter phase "matching." Figure 32.2 shows the preprocessing and matching times for each of the algorithms in this chapter. The total running time of each algorithm is the sum of the preprocessing and matching times. Section 32.2 presents an interesting string-matching algorithm, due to Rabin and Karp. Although the Θ((n - m + 1)m) worst-case running time of this algorithm is no better than that of the naive method, it works much better on average and in practice. It also generalizes nicely to other pattern-matching problems. Section 32.3 then describes a string-matching algorithm that begins by constructing a finite automaton specifically designed to search for occurrences of the given pattern P in a text. This algorithm takes O(m|Σ|) preprocessing time but only Θ(n) matching time. The similar but much cleverer Knuth-Morris-Pratt (or KMP) algorithm is presented in Section 32.4; the KMP algorithm has the same Θ(n) matching time, and it reduces the preprocessing time to only Θ(m).

Notation and terminology

We shall let Σ* (read "sigma-star") denote the set of all finite-length strings formed using characters from the alphabet Σ. In this chapter, we consider only finite-length strings. The zero-length empty string, denoted ε, also belongs to Σ*. The length of a string x is denoted |x|. The concatenation of two strings x and y, denoted xy, has length |x| + |y| and consists of the characters from x followed by the characters from y.

We say that a string w is a prefix of a string x, denoted w [ x, if x = wy for some string y ∈ Σ*. Note that if w [ x, then |w| ≤ |x|. Similarly, we say that a string w is a suffix of a string x, denoted w ] x, if x = yw for some y ∈ Σ*. It follows from w ] x that |w| ≤ |x|. The empty string ε is both a suffix and a prefix of every string. For example, we have ab [ abcca and cca ] abcca. It is useful to note that for any strings x and y and any character a, we have x ] y if and only if xa ] ya. Also note that [ and ] are transitive relations. The following lemma will be useful later.

Lemma 32.1 (Overlapping-suffix lemma)
Suppose that x, y, and z are strings such that x ] z and y ] z. If |x| ≤ |y|, then x ] y. If |x| ≥ |y|, then y ] x. If |x| = |y|, then x = y.

Proof See Figure 32.3 for a graphical proof.

For brevity of notation, we shall denote the k-character prefix P[1..k] of the pattern P[1..m] by Pk. Thus, P0 = ε and Pm = P = P[1..m]. Similarly, we denote the k-character prefix of the text T as Tk. Using this notation, we can state the string-matching problem as that of finding all shifts s in the range 0 ≤ sn - m such that P ] Ts+m.

In our pseudocode, we allow two equal-length strings to be compared for equality as a primitive operation. If the strings are compared from left to right and the comparison stops when a mismatch is discovered, we assume that the time taken by such a test is a linear function of the number of matching characters discovered. To be precise, the test "x = y" is assumed to take time Θ(t + 1), where t is the length of the longest string z such that z [ x and z [ y. (We write Θ(t + 1) rather than Θ(t) to handle the case in which t = 0; the first characters compared do not match, but it takes a positive amount of time to perform this comparison.)