Documentation Center

  • Trial Software
  • Product Updates

Lookahead Assertions in Regular Expressions

Lookahead Assertions

There are two types of lookaround assertions for regular expressions: lookahead and lookbehind. In both cases, the assertion is a condition that must be satisfied to return a match to the expression.

A lookahead assertion has the form (?=test) and can appear anywhere in a regular expression. MATLAB® looks ahead of the current location in the string for the test condition. If MATLAB matches the test condition, it continues processing the rest of the expression to find a match.

For example, look ahead in a path string to find the name of the folder that contains a program file (in this case, fileread.m).

str = which('fileread')
str =
   matlabroot\toolbox\matlab\iofun\fileread.m
regexp(str,'\w+(?=\\\w+\.[mp])','match')
ans = 
    'iofun'

The match expression, \w+, searches for one or more alphanumeric or underscore characters. Each time regexp finds a string that matches this condition, it looks ahead for a backslash (specified with two backslashes, \\), followed by a file name (\w+) with an .m or .p extension (\.[mp]). The regexp function returns the match that satisfies the lookahead condition, which is the folder name iofun.

Overlapping Matches

Lookahead assertions do not consume any characters in the string. As a result, you can use them to find overlapping character sequences.

For example, use lookahead to find every sequence of six nonwhitespace characters in a string by matching initial characters that precede five additional characters:

string = 'Locate several 6-char. phrases';
startIndex = regexpi(string,'\S(?=\S{5})')
startIndex =
     1     8     9    16    17    24    25

The starting indices correspond to these phrases:

Locate   severa   everal   6-char   -char.   phrase   hrases

Without the lookahead operator, MATLAB parses a string from left to right, consuming the string as it goes. If matching characters are found, regexp records the location and resumes parsing the string from the location of the most recent match. There is no overlapping of characters in this process.

string = 'Locate several 6-char. phrases';
startIndex = regexpi(string,'\S{6}')
startIndex =
     1     8    16    24

The starting indices correspond to these phrases:

Locate   severa   6-char   phrase

Logical AND Conditions

Another way to use a lookahead operation is to perform a logical AND between two conditions. This example initially attempts to locate all lowercase consonants in a text string. The text string is the first 50 characters of the help for the normest function:

helptext = help('normest');
str = helptext(1:50)
str =
 NORMEST Estimate the matrix 2-norm.
    NORMEST(S

Merely searching for non-vowels ([^aeiou]) does not return the expected answer, as the output includes capital letters, space characters, and punctuation:

c = regexp(str,'[^aeiou]','match')
c = 
  Columns 1 through 14

    ' '    'N'    'O'    'R'    'M'    'E'    'S'    'T'    ' '    
        'E'    's'    't'    'm'    't'
  ...

Try this again, using a lookahead operator to create the following AND condition:

(lowercase letter) AND (not a vowel)

This time, the result is correct:

c = regexp(str,'(?=[a-z])[^aeiou]','match')
c = 
  's'  't'  'm '  't'  't'  'h'  'm'  't'  'r'  'x'
     'n'  'r'  'm'

Note that when using a lookahead operator to perform an AND, you need to place the match expression expr after the test expression test:

(?=test)expr or (?!test)expr

See Also

| |

More About

Was this topic helpful?