Regular expressions in Java, Part 2: The Regex API

Simplify common coding tasks with the Regex API

The first half of this tutorial introduced you to regular expressions and the Regex API. You learned about the Pattern class, then worked through examples demonstrating regex constructs, from basic pattern matching with literal strings to more complex matches using ranges, boundary matchers, and quantifiers.

In Part 2 we’ll pick up where we left off, exploring methods associated with the Pattern, Matcher, and PatternSyntaxException classes. You’ll also be introduced to two tools that use regular expressions to simplify common coding tasks. The first extracts comments from code for documentation purposes. The second is a reusable library for performing lexical analysis, which is an essential component of assemblers, compilers, and similar software.

download

Get the code

Download the source code for example applications in this tutorial. Created by Jeff Friesen for JavaWorld

Explore the Regex API

Pattern, Matcher, and PatternSyntaxException are the three classes that comprise the Regex API. Each class offers methods that you can use to integrate regexes into your code.

Pattern methods

An instance of the Pattern class describes a compiled regex, also known as a pattern. Regexes are compiled to increase performance during pattern-matching operations. The following static methods support compilation.

Pattern compile(String regex) compiles regex‘s contents into an intermediate representation stored in a new Pattern object. This method either returns the object’s reference upon success, or throws PatternSyntaxException if it detects invalid syntax in the regex. Any Matcher object used by or returned from this Pattern object adheres to various default settings, such as case-sensitive searching. As an example, Pattern p = Pattern.compile("(?m)^."); creates a Pattern object that stores a compiled representation of the regex for matching all lines starting with a period character.
Pattern compile(String regex, int flags) accomplishes the same task as Pattern compile(String regex), but is able to account for flags: a bitwise-inclusive ORed set of flag constant bit values. Pattern declares CANON_EQ, CASE_INSENSITIVE, COMMENTS, DOTALL, LITERAL, MULTILINE, UNICODE_CASE, UNICODE_CHARACTER_CLASS, and UNIX_LINES constants that can be bitwise ORed together (e.g., CASE_INSENSITIVE | DOTALL) and passed to flags. Except for CANON_EQ, LITERAL, and UNICODE_CHARACTER_CLASS, these constants are an alternative to embedded flag expressions, which were demonstrated in Part 1. The Pattern compile(String regex, int flags) method throws java.lang.IllegalArgumentException when it detects a flag constant other than those defined by Pattern constants. For example, Pattern p = Pattern.compile("^.", Pattern.MULTILINE); is equivalent to the previous example, where the Pattern.MULTILINE constant and the (?m) embedded flag expression accomplish the same task.

At times you will need to obtain a copy of an original regex string that has been compiled into a Pattern object, along with the flags it is using. You can do this by calling the following methods:

String pattern() returns the original regex string that was compiled into the Pattern object.
int flags() returns the Pattern object’s flags.

After obtaining a Pattern object, you’ll typically use it to obtain a Matcher object, so that you can perform pattern-matching operations. The Matcher matcher(Charsequence input) creates a Matcher object that matches provided input text against a given Pattern object’s compiled regex. When called, it returns a reference to this Matcher object. For example, Matcher m = p.matcher(args[1]); returns a Matcher for the Pattern object referenced by variable p.

Splitting text

Most developers have written code to break input text into its component parts, such as converting a text-based employee record into a set of fields. Pattern offers a quicker way to handle this tedium, via a pair of text-splitting methods:

String[] split(CharSequence text, int limit) splits text around matches of the Pattern object’s pattern and returns the results in an array. Each entry specifies a text sequence that’s separated from the next text sequence by a pattern match (or the text’s end). All array entries are stored in the same order as they appear in the text. In this method, the number of array entries depends on limit, which also controls the number of matches that occur:
- A positive value means that at most limit - 1 matches are considered and the array’s length is no greater than the limit entries.
- A negative value means all possible matches are considered, and the array can be of any length.
- A zero means all possible matches are considered, the array can have any length, and trailing empty strings are discarded.
String[] split(CharSequence text) invokes the previous method with zero as the limit and returns the method call’s result.

Here’s how split(CharSequence text) handles the task of splitting an employee record into its field components of name, age, street address, and salary:

Pattern p = Pattern.compile(",s");
String[] fields = p.split("John Doe, 47, Hillsboro Road, 32000");
for (int i = 0; i < fields.length; i++)
   System.out.println(fields[i]);

The above code specifies a regex that matches a comma character immediately followed by a single-space character. Here’s the output:

John Doe
47
Hillsboro Road
32000

Pattern predicates and the Streams API

Java 8 introduced the Predicate<String> asPredicate() method to Pattern. This method creates a predicate (Boolean-valued function) that’s used for pattern matching. The code below demonstrates asPredicate():

List<String> progLangs = Arrays.asList("apl", "basic", "c", "c++", "c#", "cobol",
                                       "java", "javascript", "perl", "python", 
                                       "scala");
Pattern p = Pattern.compile("^c");
progLangs.stream().filter(p.asPredicate()).forEach(System.out::println);

This code creates a list of programming language names, then compiles a pattern for matching all of the names that start with the lowercase letter c. The last line above obtains a sequential stream with the list as its source. It installs a filter that uses asPredicate()‘s Boolean function, which returns true when a name begins with c, and iterates over the stream, outputting matched names to the standard output.

That last line is equivalent to the following traditional loop, which you might remember from the RegexDemo application in Part 1:

for (String progLang: progLangs) 
   if (p.matcher(progLang).find())
      System.out.println(progLang);

Matcher methods

An instance of the Matcher class describes an engine that performs match operations on a character sequence by interpreting a Pattern‘s compiled regex. Matcher objects support different kinds of pattern-matching operations:

boolean find() scans input text for the next match. This method starts its scan either at the beginning of the given text, or at the first character following the previous match. The latter option is only possible when the previous method invocation has returned true and the matcher hasn’t been reset. In either case, Boolean true is returned when a match is found. You will find an example of this method in the RegexDemo from Part 1.
boolean find(int start) resets the matcher and scans text for the next match. The scan begins at the index specified by start. Boolean true is returned when a match is found. For example, m.find(1); scans text beginning at index 1. (Index 0 is ignored.) If start contains a negative value or a value exceeding the length of the matcher’s text, this method throws java.lang.IndexOutOfBoundsException.
boolean matches() attempts to match the entire text against the pattern. This method returns true when the entire text matches. For example, Pattern p = Pattern.compile("w*"); Matcher m = p.matcher("abc!"); System.out.println(p.matches()); outputs false because the ! symbol isn’t a word character.
boolean lookingAt() attempts to match the given text against the pattern. This method returns true when any of the text matches. Unlike matches(), the entire text doesn’t need to be matched. For example, Pattern p = Pattern.compile("w*"); Matcher m = p.matcher("abc!"); System.out.println(p.lookingAt()); outputs true because the beginning of the abc! text consists of word characters only.

Unlike Pattern objects, Matcher objects record state information. Occasionally, you might want to reset a matcher to clear that information after performing a pattern match. The following methods reset a matcher:

Matcher reset() resets a matcher’s state, including the matcher’s append position (which is cleared to zero). The next pattern-match operation begins at the start of the matcher’s text. A reference to the current Matcher object is returned. For example, m.reset(); resets the matcher referenced by m.
Matcher reset(CharSequence text) resets a matcher’s state and sets the matcher’s text to text. The next pattern-match operation begins at the start of the matcher’s new text. A reference to the current Matcher object is returned. For example, m.reset("new text"); resets the m-referenced matcher and also specifies new text as the matcher’s new text.

Appending text

A matcher’s append position identifies the start of the matcher’s text that’s appended to a java.lang.StringBuffer object. The following methods use the append position:

Matcher appendReplacement(StringBuffer sb, String replacement) reads the matcher’s text characters and appends them to the sb-referenced StringBuffer object. This method stops reading after the last character preceding the previous pattern match. Next, the method appends the characters in the replacement-referenced String object to the StringBuffer object. (The replacement string may contain references to text sequences captured during the previous match, via dollar-sign characters ($) and capturing group numbers.) Finally, the method sets the matcher’s append position to the index of the last matched character plus one, then returns a reference to the current matcher. The Matcher appendReplacement(StringBuffer sb, String replacement) method throws java.lang.IllegalStateException when the matcher hasn’t yet made a match, or when the previous match attempt has failed. It throws IndexOutOfBoundsException when replacement specifies a capturing group that doesn’t exist in the pattern.
StringBuffer appendTail(StringBuffer sb) appends all text to the StringBuffer object and returns that object’s reference. Following a final call to the appendReplacement(StringBuffer sb, String replacement) method, call appendTail(StringBuffer sb) to copy remaining text to the StringBuffer object.

The following code calls appendReplacement(StringBuffer sb, String replacement) and appendTail(StringBuffer sb) to replace all occurrences of cat with caterpillar in the provided text:

Pattern p = Pattern.compile("(cat)");
Matcher m = p.matcher("one cat, two cats, or three cats on a fence");
StringBuffer sb = new StringBuffer();
while (m.find())
   m.appendReplacement(sb, "$1erpillar");
m.appendTail(sb);
System.out.println(sb);

Placing a capturing group and a reference to the capturing group in the replacement text instructs the program to insert erpillar after each cat match. The above code results in the following output:

one caterpillar, two caterpillars, or three caterpillars on a fence

Replacing text

Matcher provides a pair of text-replacement methods that complement appendReplacement(StringBuffer sb, String replacement). These methods let you replace either the first match or all matches:

String replaceFirst(String replacement) resets the matcher, creates a new String object, copies all of the matcher’s text characters (up to the first match) to the string, appends the replacement characters to the string, copies remaining characters to the string, and returns the String object. (The replacement string may contain references to text sequences captured during the previous match, via dollar-sign characters and capturing-group numbers.)
String replaceAll(String replacement) operates similarly to replaceFirst(String replacement), but replaces all matches with replacement‘s characters.

The s+ regex detects one or more occurrences of whitespace characters in the input text. Below, we use this regex and call the replaceAll(String replacement) method to remove duplicate whitespace:

Pattern p = Pattern.compile("s+");
Matcher m = p.matcher("Remove     the tt duplicate whitespace.   ");
System.out.println(m.replaceAll(" "));

Here is the output:

Remove the duplicate whitespace.

Capturing group-oriented methods

The source code for the RegexDemo application includes an m.group() method call. The group() method is one of several capturing group-oriented Matcher methods:

int groupCount() returns the number of capturing groups in a matcher’s pattern. This count doesn’t include the special capturing group number 0, which denotes the entire pattern.
String group() returns the previous match’s characters. This method returns an empty string to indicate a successful match against the empty string. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
String group(int group) resembles the previous method, except that it returns the previous match’s characters as recorded by the capturing group number that group specifies. Note that group(0) is equivalent to group(). If no capturing group with the specified group number exists in the pattern, the method throws IndexOutOfBoundsException. It throws IllegalStateException when either the matcher hasn’t yet attempted a match or the previous match operation failed.
String group(String name) returns the previous match’s characters as recorded by the named capturing group. If there is no capturing group in the pattern with the given name, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.

The following example demonstrates the groupCount() and group(int group) methods:

Pattern p = Pattern.compile("(.(.(.)))");
Matcher m = p.matcher("abc");
m.find();
System.out.println(m.groupCount());
for (int i = 0; i <= m.groupCount(); i++)
   System.out.println(i + ": " + m.group(i));

It results in the following output:

3
0: abc
1: abc
2: bc
3: c

Match-position methods

Matcher provides several methods that return the start and end indexes of a match:

int start() returns the previous match’s start index. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
int start(int group) resembles the previous method, except that it returns the previous match’s start index associated with the capturing group that group specifies. If no capturing group with the specified capturing group number exists in the pattern, IndexOutOfBoundsException is thrown. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
int start(String name) resembles the previous method, except that it returns the previous match’s start index associated with the capturing group that name specifies. If no capturing group with the specified name exists in the pattern, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
int end() returns the index of the last matched character plus one in the previous match. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
int end(int group) resembles the previous method, except that it returns the previous match’s end index associated with the capturing group that group specifies. If no capturing group with the specified group number exists in the pattern, IndexOutOfBoundsException is thrown. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.
int end(String name) resembles the previous method, except that it returns the previous match’s end index associated with the capturing group that name specifies. If no capturing group with the specified name exists in the pattern, IllegalArgumentException is thrown. IllegalStateException is thrown when either the matcher hasn’t yet attempted a match or the previous match operation failed.

The following example demonstrates two of the match-position methods reporting start/end match positions for capturing group number 2:

Pattern p = Pattern.compile("(.(.(.)))");
Matcher m = p.matcher("abcabcabc");
while (m.find())
{
   System.out.println("Found " + m.group(2));
   System.out.println("  starting at index " + m.start(2) +
                      " and ending at index " + (m.end(2) - 1));
   System.out.println();
}

This example produces the following output:

Found bc
  starting at index 1 and ending at index 2

Found bc
  starting at index 4 and ending at index 5

Found bc
  starting at index 7 and ending at index 8

PatternSyntaxException methods

An instance of the PatternSyntaxException class describes a syntax error in a regex. This exception is thrown from Pattern‘s compile() and matches() methods, and is constructed via the following constructor:

PatternSyntaxException(String desc, String regex, int index)

The constructor stores the specified description, regex, and index where the syntax error occurs in the regex. The index is set to -1 when the syntax error location isn’t known.

Although you’ll probably never need to instantiate PatternSyntaxException, you will need to extract the aforementioned values when creating a formatted error message. Invoke the following methods to accomplish this task:

String getDescription() returns the syntax error’s description.
int getIndex() returns either the approximate index (within a regex) where the syntax error occurs or -1 when the index is unknown.
String getPattern() returns the erroneous regex.

Additionally, the inherited String getMessage() method returns a multiline string containing the values returned from the aforementioned methods along with a visual indication of the syntax error position in the pattern.

What constitutes a syntax error? Here’s an example:

java RegexDemo (?itree Treehouse

In this case we’ve failed to specify the closing parenthesis metacharacter ()) in the embedded flag expression. The error results in the following output:

regex = (?itree
input = Treehouse
Bad regex: Unknown inline modifier near index 3
(?itree
   ^
Description: Unknown inline modifier
Index: 3
Incorrect pattern: (?itree

Build useful regex-oriented applications with the Regex API

Regexes let you create powerful text-processing applications. This section presents a pair of useful applications that invite you to further explore the classes and methods in the Regex API. The second application also introduces Lexan: a reusable library for performing lexical analysis.

Regex for documentation

Documentation is one of the necessary tasks of developing professional quality software. Fortunately, regex can help with many aspects of documentation. The code in Listing 1 extracts the lines containing single-line and multiline C-style comments from one source file to another. Comments must be located on a single line for the code to work:

Listing 1. Extracting comments

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

public class ExtCmnt
{
   public static void main(String[] args)
   {
      if (args.length != 2)
      {
         System.err.println("usage: java ExtCmnt infile outfile");
         return;
      }

      Pattern p;
      try
      {
         // The following pattern defines multiline comments that appear on the
         // same line (e.g., /* same line */) and single-line comments (e.g., //
         // some line). The comment may appear anywhere on the line.

         p = Pattern.compile(".*/*.**/|.*//.*$");
      }
      catch (PatternSyntaxException pse)
      {
         System.err.printf("Regex syntax error: %s%n", pse.getMessage());
         System.err.printf("Error description: %s%n", pse.getDescription());
         System.err.printf("Error index: %s%n", pse.getIndex());
         System.err.printf("Erroneous pattern: %s%n", pse.getPattern());
         return;
      }

      try (FileReader fr = new FileReader(args[0]);
           BufferedReader br = new BufferedReader(fr);
           FileWriter fw = new FileWriter(args[1]);
           BufferedWriter bw = new BufferedWriter(fw))
      {
         Matcher m = p.matcher("");
         String line;
         while ((line = br.readLine()) != null)
         {
            m.reset(line);
            if (m.matches()) /* entire line must match */
            {
               bw.write(line);
               bw.newLine();
            }
         }
      }
      catch (IOException ioe)
      {
         System.err.println(ioe.getMessage());
         return;
      }
   }
}

Listing 1’s main() method first validates its command line and then compiles a regex for locating single-line and multiline comments into a Pattern object. Assuming no PatternSyntaxException arises, main() opens the source file and creates the target file, obtains a matcher to match each read line against the pattern, and reads the source file’s contents line by line. For each line, the matcher tries to match the line against the comment pattern. If there’s a match, main() writes the line (followed by a new-line) to the target file. (We’ll explore file I/O logic in a future Java 101 tutorial.)

Compile Listing 1 as follows:

javac ExtCmnt.java

Run the application against ExtCmnt.java:

java ExtCmnt ExtCmnt.java out

You should observe the following output in the out file:

         // The following pattern defines multiline comments that appear on the
         // same line (e.g., /* same line */) and single-line comments (e.g., //
         // some line). The comment may appear anywhere on the line.
         p = Pattern.compile(".*/*.**/|.*//.*$");
            if (m.matches()) /* entire line must match */

In the ".*/*.**/|.*//.*$" pattern string, the vertical bar metacharacter (|) acts as a logical OR operator telling a matcher to use that operator’s left regex construct operand to locate a match in the matcher’s text. If no match exists, the matcher uses that operator’s right regex construct operand in another match attempt. (The parentheses metacharacters in a capturing group form another logical operator.)

Regex for lexical analysis

An even more useful application of regexes is a reusable library for performing lexical analysis, a key component of any code compiler or assembler. In this case, an input stream of characters is grouped into an output stream of tokens, which are names representing sequences of characters that have a collective meaning. For example, upon encountering the letter sequence c, o, u, n, t, e, r in the input stream, a lexical analyzer might output token ID (identifier). The character sequence associated with the token is known as the lexeme.

Regexes are much more efficient than state-based lexical analyzers, which must be written by hand and are typically not reusable. An example of a regex-based lexical analyzer is JLex, the lexical generator for Java, which relies on regexes to specify the rules for breaking an input stream into tokens. Another example is Lexan.

Getting to know Lexan

Lexan is a reusable Java library for lexical analysis. It’s based on code in the Cogito Learning website‘s Writing a Parser in Java blog series. The library consists of the following classes, which you will find in the ca.javajeff.lexan package included with the source download for this article:

Lexan: the lexical analyzer
LexanException: an exception arising from Lexan‘s constructor
LexException: an exception arising from bad syntax during lexical analysis
Token: a name with a regex attribute
TokLex: a token/lexeme pair

The Lexan(java.lang.Class<?> tokensClass) constructor creates a new lexical analyzer. It requires a single java.lang.Class object argument denoting a class of static Token constants. Using the Reflection API, the constructor reads each Token constant into a Token[] array of values. If no Token constants are present, LexanException is thrown.

Lexan also provides the following pair of methods:

List<TokLex> getTokLexes() returns this lexical analyzer’s list of TokLexes.
void lex(String str) lexes an input string into a list of TokLexes. LexException is thrown if a character is encountered that doesn’t match any of the Token[] array’s patterns.

LexanException provides no methods, but relies on its inherited getMessage() method to return the exception’s message. In contrast, LexException also provides the following methods:

int getBadCharIndex() returns the index of the character that didn’t match any token patterns.
String getText() returns the text that was being lexed when the exception occurred.

Token overrides the toString() method to return the token’s name. It also provides a String getPattern() method that returns the token’s regex attribute.

TokLex provides a Token getToken() method that returns its token. It also provides a String getLexeme() method that returns its lexeme.

Demonstrating Lexan

I’ve created a LexanDemo application that demonstrates the library. This application consists of LexanDemo, BinTokens, MathTokens, and NoTokens classes. Listing 2 presents LexanDemo‘s source code.

Listing 2. Demonstrating Lexan

import ca.javajeff.lexan.Lexan;
import ca.javajeff.lexan.LexanException;
import ca.javajeff.lexan.LexException;
import ca.javajeff.lexan.TokLex;

public final class LexanDemo
{
   public static void main(String[] args)
   {
      lex(MathTokens.class, " sin(x) * (1 + var_12) ");
      lex(BinTokens.class, " 1 0 1 0 1");
      lex(BinTokens.class, "110");
      lex(BinTokens.class, "1 20");
      lex(NoTokens.class, "");
   }

   private static void lex(Class<?> tokensClass, String text)
   {
      try
      {
         Lexan lexan = new Lexan(tokensClass);
         lexan.lex(text);
         for (TokLex tokLex: lexan.getTokLexes())
            System.out.printf("%s: %s%n", tokLex.getToken(), 
                              tokLex.getLexeme());
      }
      catch (LexanException le)
      {
         System.err.println(le.getMessage());
      }
      catch (LexException le)
      {
         System.err.println(le.getText());
         for (int i = 0; i < le.getBadCharIndex(); i++)
            System.err.print("-");
         System.err.println("^");
         System.err.println(le.getMessage());
      }
      System.out.println();
   }
}

Listing 2’s main() method invokes the lex() utility method to demonstrate lexical analysis via Lexan. Each call to this method passes a Class object for a class of tokens and a string to analyze.

The lex() method first instantiates the Lexan class, passing the Class object to Lexan‘s constructor. It then invokes Lexan‘s lex() method on the string.

If lexical analysis succeeds, Lexan‘s getTokLexes() method is called to return a list of TokLex objects. For each object, TokLex‘s getToken() method is called to return the token and its getLexeme() method is called to return the lexeme. Both values are output. If lexical analysis fails, either LexanException or LexException is thrown and handled appropriately.

For brevity, let’s consider only MathTokens out of the remaining classes making up this application. Listing 3 presents this class’s source code.

Listing 3. Describing a set of tokens for a small math language

import ca.javajeff.lexan.Token;

public final class MathTokens
{
   public final static Token FUNC = new Token("FUNC", "sin|cos|exp|ln|sqrt");
   public final static Token LPAREN = new Token("LPAREN", "(");
   public final static Token RPAREN = new Token("RPAREN", ")");
   public final static Token PLUSMIN = new Token("PLUSMIN", "[+-]");
   public final static Token TIMESDIV = new Token("TIMESDIV", "[*/]");
   public final static Token CARET = new Token("CARET", "^");
   public final static Token INTEGER = new Token("INTEGER", "[0-9]+");
   public final static Token ID = new Token("ID", "[a-zA-Z][a-zA-Z0-9_]*");
}

Listing 3 reveals that MathTokens defines a sequence of Token constants. Each constant is initialized to a Token object. That object’s constructor receives a string naming the token, along with a regex that describes all character strings belonging to that token. The string-based token name should match the name of the constant (for clarity), but this isn’t mandatory.

The position of a Token constant in the list of Tokens is important. Token constants higher in the list take precedence over constants that are lower down. For example, when sin is encountered, Lexan chooses FUNC instead of ID as the token. If ID appeared before FUNC, ID would be chosen.

Compiling and running LexanDemo

The source download for this article includes the lexan.zip archive, which contains all the distribution files for Lexan. Unzip this archive and set the current directory to the lexan home directory’s demos subdirectory.

If you’re using Windows, execute the following command to compile the demo’s source files:

javac -cp ..librarylexan.jar *.java

Following a successful compilation, execute this command to run the demo:

java -cp ..librarylexan.jar;. LexanDemo

You should observe the following output:

FUNC: sin
LPAREN: (
ID: x
RPAREN: )
TIMESDIV: *
LPAREN: (
INTEGER: 1
PLUSMIN: +
ID: var_12
RPAREN: )

ONE: 1
ZERO: 0
ONE: 1
ZERO: 0
ONE: 1

ONE: 1
ONE: 1
ZERO: 0
1 20
--^
Unexpected character in input: 20

no tokens

The Unexpected character in input: 20 message arises from a thrown LexanException, which is caused by BinTokens not defining a Token constant with 2 as its regex. Note the exception handler’s output of the text being lexed and the location of the offensive character. The no tokens message arises from a thrown LexException because NoTokens defines no Token constants.

Behind the scenes

Lexan relies on the Lexan class as its engine. Check out Listing 4 to see how this class is implemented and how regexes contribute to the engine’s reusability.

Listing 4. Architecting a regex-based lexical analyzer

package ca.javajeff.lexan;
import java.lang.reflect.Field;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;

/**
 *  A lexical analyzer. You can use this class to transform an input stream of 
 *  characters into an output stream of tokens.
 *
 *  @author Jeff Friesen
 */

public final class Lexan
{
   private List<TokLex> tokLexes;

   private Token[] values;

   /**
    *  Initialize a lexical analyzer to a set of Token objects.
    *
    *  @param tokensClass the Class object of a class containing a set of Token
    *         objects
    *
    *  @throws LexanException unable to construct a Lexan object, possibly 
    *          because there are no Token objects in the class
    */

   public Lexan(Class<?> tokensClass) throws LexanException
   {
      try
      {
         tokLexes = new ArrayList<>();
         List<Token> _values = new ArrayList<>();
         Field[] fields = tokensClass.getDeclaredFields();
         for (Field field: fields)
            if (field.getType().getName().equals("ca.javajeff.lexan.Token"))
               _values.add((Token) field.get(null));
         values = _values.toArray(new Token[0]);
         if (values.length == 0)
            throw new LexanException("no tokens");
      }
      catch (IllegalAccessException iae)
      {
         throw new LexanException(iae.getMessage());
      }

   /**
    *  Get this lexical analyzer's list of toklexes.
    *
    *  @return list of toklexes
    */

   public List<TokLex> getTokLexes()
   {
      return tokLexes;
   }

   /**
    *  Lex an input string into a list of toklexes.
    *
    *  @param str the string being lexed
    *
    *  @throws LexException unexpected character found in input
    */

   public void lex(String str) throws LexException
   {
      String s = new String(str).trim(); // remove leading whitespace
      int index = (str.length() - s.length());
      tokLexes.clear();
      while (!s.equals(""))
      {
         boolean match = false;
         for (int i = 0; i < values.length; i++)
         {
            Token token = values[i];
            Matcher m = token.getPattern().matcher(s);
            if (m.find())
            {
               match = true;
               tokLexes.add(new TokLex(token, m.group().trim()));
               String t = s;
               s = m.replaceFirst("").trim(); // remove leading whitespace
               index += (t.length() - s.length());
               break;
            }
         }
         if (!match)
            throw new LexException("Unexpected character in input: " + s, str,
                                   index);
      }
   }
}

The code in the lex() method is based on the code presented in the blog post “Writing a Parser in Java: The Tokenizer” from Cogito Learning. Check out that post to learn more about how Lexan leverages the Regex API for code compilation .

In conclusion

Regular expressions are a useful tool that every developer needs to understand. Java’s Regex API makes it easy to integrate them into your applications and libraries. Now that you possess a basic understanding of regexes and this API, study java.util.regex‘s SDK documentation to learn even more about regexes and additional API methods.

Regular expressions in Java, Part 2: The Regex API

Simplify common coding tasks with the Regex API

Explore the Regex API

Pattern methods

Splitting text

Pattern predicates and the Streams API

Matcher methods

Appending text

Replacing text

Capturing group-oriented methods

Match-position methods

PatternSyntaxException methods

Build useful regex-oriented applications with the Regex API

Regex for documentation

Listing 1. Extracting comments

Regex for lexical analysis

Getting to know Lexan

Demonstrating Lexan

Listing 2. Demonstrating Lexan

Listing 3. Describing a set of tokens for a small math language

Compiling and running LexanDemo

Behind the scenes

Listing 4. Architecting a regex-based lexical analyzer

In conclusion

Related content

How to describe Java code with annotations

How to use assertions in Java

How to use typesafe enums in Java

Evaluate Java expressions with operators

More from this author

Inheritance in Java, Part 2: Object and its methods

Inheritance in Java, Part 1: The extends keyword

Class and object initialization in Java

Classes and objects in Java

How to use Java generics to avoid ClassCastExceptions

Packages and static imports in Java

Data structures and algorithms in Java: A beginner’s guide

Avoid memory leaks in inner classes

Most popular authors

Show me more

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI