JavaFX: Parse HTML content with DOMParser

I have started an interesting project. I will illustrate a crawler based on JavaFX scripting. Though there is no specific reason beyond using JavaFX except to enhance my knowledge on JavaFX scripting.

Here is some preliminary code that I have already written:

Import statement and package declaration –

———————————————————

package com.we4tech.linkcrawler.fx.script;

import java.lang.*;

import java.io.*;

import java.net.*;

import java.util.*;

import org.xml.sax.SAXException;

import org.w3c.dom.Document;

import org.w3c.dom.NodeList;

import org.w3c.dom.Node;

import org.w3c.dom.NamedNodeMap;

import com.sun.org.apache.xerces.internal.parsers.DOMParser;

Sample class skeleton –

————————————————————–

/**

* Define structure for Crawler class.

* @author hasan (hasan -AT- somewherein.net)

*/

public class Crawler {

/**

* Define the target url.

*/

public attribute url: String;

public attribute debug: Boolean;

private attribute foundLinks: String*;

/**

* Start crawling process.

*/

public operation start();

private operation followupLink(link: String);

/**

* Return a list of all available links.

*/

public function getLinks(): String*;

/**

* Print out debug output without timestamp and [LEVEL] prefix.

*/

private operation debug(message: String);

private operation error(message: String);

}

Implementation –

————————————————————————-

operation Crawler.start() {

debug(“Initiating crawler process..”);

debug(“URL is set to – {url}”);

debug(“Crawling just initiated.”);

followupLink(url);

}

operation Crawler.followupLink(link: String) {

debug(“Follow up link – {link}”);

if (link == null or link.length() == 0) {

error(“URL is empty”);

return;

} else {

try {

// Run in background thread through EDT.

do {

var parse = new DOMParser();

parse.parse(link);

var document = parse.getDocument();

var nodes = document.getElementsByTagName(“a”);

for (i in [0..nodes.getLength()]) {

var node = nodes.item(i);

var attributes = node.getAttributes();

if (attributes null) {

for (j in [0..attributes.getLength()]) {

var attr = attributes.item(j);

if (attr null) {

var nodeName = attr.getNodeName();

if (“href” == nodeName) {

var nodeValue = attr.getNodeValue();

debug(“{nodeName} link – {nodeValue}”);

insert nodeValue into foundLinks;

// start new follow up process.

followupLink(

}

}

}

}

}

}

} catch (e) {

error(“Error found during opening up new URLConnection – {e}”);

}

}

}

operation Crawler.debug(message: String) {

if (debug) {

System.out.println(“[DEBUG] – [{new Date()}] – {message}”);

}

}

operation Crawler.error(message: String) {

System.out.println(“[ERROR] – [{new Date()}] – {message}”);

}

function Crawler.getLinks() {

return foundLinks;

}

NOTE: I have published these codes to make some sense to those people who have started learning JavaFX. This might help other people to know more on it.

Best wishes,

A webmaster may or may not have qualms about pay per click. However majority of the search engine optimization experts advise to go with it and use other modalities like email marketing as well. In page optimization specially with the web design is encouraged too.
Advertisements

One response to “JavaFX: Parse HTML content with DOMParser

  1. GGner

    That’s what i’m looking for. I’m a jfx learner and trying to write a javafx html parser.
    But I couldn’t see the definiton of followupLink() function in your code.
    Is it right?
    Thanks…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

my tweets

May 2007
S S M T W T F
« Apr   Jun »
 1234
567891011
12131415161718
19202122232425
262728293031  

Flickr Photos

RSS my shared links on google reader

  • An error has occurred; the feed is probably down. Try again later.
%d bloggers like this: