How do I get a substring between tags in a String?
Author: Deron Eriksson
Description: This Java example shows how to get a substring between tags in a String.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 2.0 (Eclipse 3.3.0)


The StringUtils class in the Commons LangS library can be used to extract a substring from between tags in a String. The tags can be either the same tag or different tags. In addition, StringUtils also has a method that returns an array of Strings if multiple substrings are found.

The SubstringBetweenTest class illustrates that. First, it reads in the test.html file as a String (utilizing the readFileToString() method of the FileUtils class from Commons IO). Following this, it extracts the title substring from between the opening and closing title tags. Next, it gets all of the substrings between the opening and closing td tags. Finally, after this, it finds the substring between "head" and "head", which points out a limitation of trying to parse something like HTMLW using this method.

SubstringBetweenTest.java

package test;

import java.io.File;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;

public class SubstringBetweenTest {

	public static void main(String[] args) throws Exception {

		File file = new File("test.html");
		String testHtml = FileUtils.readFileToString(file); // from commons io

		String title = StringUtils.substringBetween(testHtml, "<title>", "</title>");
		System.out.println("title:" + title); // good

		String[] tds = StringUtils.substringsBetween(testHtml, "<td>", "</td>");
		for (String td : tds) {
			System.out.println("td value:" + td); // good
		}

		String moreStuff = StringUtils.substringBetween(testHtml, "head");
		System.out.println("\n'head' to 'head':" + moreStuff); // not so good

	}

}

The test.html file that SubstringBetweenTest reads is shown here.

test.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>My Title</title>
</head>
<body>
<table>
	<tr>
		<td>One</td>
		<td>Two</td>
	</tr>
	<tr>
		<td>Three</td>
		<td>Four</td>
	</tr>
</table>
</body>
</html>

The console output of executing SubstringBetweenTest is shown below.

Results

title:My Title
td value:One
td value:Two
td value:Three
td value:Four

'head' to 'head':>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<title>My Title</title>
</