How do I convert web page content without tags to a String?
Author: Deron Eriksson
Description: This Java tutorial describes how to convert web page content (ie, remove tags) to a String.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 1.5.1


Page:    1 2 >

The HTMLParser library, available at http://htmlparser.sourceforge.net/, provides a wide variety of classes for, you guessed it, parsing HTMLW. I downloaded the library (htmlparser.jar) and added the extension number to the jarW file (htmlparser-1.6.jar) to make it easier in the future to tell the version of the jar file. I added the jar file to my lib/ directory in a test project and added the jar file to the project's build path.

'testing' project

Sometimes it can be useful to strip out all the htmlW tags from an html document. For instance, if you write an application that pulls in data from a web page, the analysis of the web page content (depending on the nature of your application) can be simplified if you remove all the html tags and are left with the plain page content.

The ConvertUrlContentToString class performs this task using the StringExtractor example application class from the HTMLParser library. You can instantiate a StringExtractor with a web page URL. Next you can call the extractStrings() method to extract the content from the web page with the tags removed. The extractStrings() method takes a boolean parameter. If this parameter is false, the links in the web page are not included in the output, and if the parameter is true, the links is the web pare are included.

ConvertUrlContentToString.java

package test;

import org.htmlparser.parserapplications.StringExtractor;
import org.htmlparser.util.ParserException;

public class ConvertUrlContentToString {

	public static void main(String[] args) {
		try {
			StringExtractor se = new StringExtractor("http://www.google.com");
			String content = se.extractStrings(false);
			String contentWithLinks = se.extractStrings(true);
			System.out.println(content);
			System.out.println("================================================");
			System.out.println(contentWithLinks);
		} catch (ParserException e) {
			e.printStackTrace();
		}
	}
}

Executing the ConvertUrlContentToString class generates the output shown in the screen capture below. First the content with no tags and no links is displayed. Then, the content with links is displayed.

ConvertUrlContentToString execution

(Continued on page 2)

Page:    1 2 >