How do I convert web page content without tags to a String?
Author: Deron Eriksson
Description: This Java tutorial describes how to convert web page content (ie, remove tags) to a String.
Tutorial created using: Windows XP || JDK 1.5.0_09 || Eclipse Web Tools Platform 1.5.1


Page: < 1 2

(Continued from page 1)

Since StringExtractor is a demonstration application class in the HTMLParser library, it would probably be better to work at a slightly lower level with HTMLParser classes. So, let's create ConvertUrlContentToString2 class that, like ConvertUrlContentToString, will output content without links and then content with links from a web page.

We can create a Parser object with our web page URL string as a parameter. We instantiate a StringBean object, which we feed to the visitAllNodesWith method of the parser object. We can get the web page content from the stringBean object. Following that, we can instruct the stringBean that we'd like to display the HTMLW page's links along with the content. The parser needs to be reset to the beginning of the HTML page, and then we can feed the stringBean to the parser and get the generated content from the stringBean.

ConvertUrlContentToString2.java

package test;

import org.htmlparser.Parser;
import org.htmlparser.beans.StringBean;
import org.htmlparser.util.ParserException;

public class ConvertUrlContentToString2 {

	public static void main(String[] args) {
		try {
			Parser parser = new Parser("http://www.google.com");
			StringBean stringBean = new StringBean();
			parser.visitAllNodesWith(stringBean);
			String content = stringBean.getStrings();
			System.out.println(content);
			System.out.println("================================================");
			System.out.println("================================================");
			stringBean.setLinks(true);
			parser.reset(); // start the parsing from the beginning again
			parser.visitAllNodesWith(stringBean);
			String contentWithLinks = stringBean.getStrings();
			System.out.println(contentWithLinks);
		} catch (ParserException e) {
			e.printStackTrace();
		}
	}
}

The output from the execution of ConvertUrlContentToString2 is shown below.

ConvertUrlContentToString2 execution
Page: < 1 2