Overview
Jsoup is an open source Java library, It used to parse data from HTML Documents. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do. It scrape and parse HTML from a URL, file or String and forms DOM Tree.
Example
Fetch the Google homepage, parse it to a DOM, and select the all anchor tags from it.
We will use Spring Documentation Blog to showcase the features of Jsoup library.
Download Jsoup Library
The jsoup is available in Maven central repository. For non-Maven user download it from JSoup site and add it to project class-path.
0 1 2 3 4 5 6 |
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.3</version> </dependency> |
Loading
The loading phase comprises the fetching and parsing of the HTML into a Document. Loading of document can be done form URL, Document or String.
Let’s load a Document from the Spring Documentation Blog URL:
0 1 2 3 |
String url = "https://spring.io/docs"; Document doc = Jsoup.connect(url).get(); |
Here
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
StringBuffer html = new StringBuffer() .append("<!DOCTYPE html>") .append("<html lang=\"en\">") .append("<head>") .append("<meta charset=\"UTF-8\" />") .append("<title>Hollywood Life</title>") .append("<meta name=\"description\" content=\"Helloworld\" />") .append("<meta name=\"keywords\" content=\"Helloworld\" />") .append("</head>") .append("<body>"); .append("<div id='color'>Say, Hello World !</div> />"); .append("</body>"); .append("</html>"); Document doc = Jsoup.parse(html.toString()); |
Jsoup is also also supporting header parameters which one browser sends during request of URL.
0 1 2 3 4 5 6 7 8 9 |
Connection connection = Jsoup.connect(url); connection.userAgent("Mozilla"); connection.timeout(5000); connection.cookie("cookiename", "val234"); connection.cookie("cookiename", "val234"); connection.referrer("http://google.com"); connection.header("headersecurity", "xyz123"); Document docCustomConn = connection.get(); |
Extraction of Data
The Document select method receives a String representing the selector, using the same selector syntax as in a CSS or JavaScript, and retrieves the matching list of Elements. This list can be empty but not null.
0 1 2 3 4 5 |
Elements links = doc.select("a"); // returns all anchor tags present in document. Elements logo = doc.select(".spring-logo"); //selects all elements which has class "spring-logo" Elements idBasedSelection = doc.select("#first_name"); //selects all elemets which has id "first_name" Elements paragraphTableElements = doc.select("table[id=\"mp-left\"]"); //selects all table having id "mp-left" |
Once we get list of Elements, We can get specific element from the list.
0 1 2 3 4 |
Element firstLink = doc.select("a").first(); Element lastLink = doc.select("a").last(); Element secondLink = doc.select("a").get(1); |
We can iterate through all Elements and get separate Element too.
Adding to that we can also get attribute details of selected element.
0 1 2 3 |
String href=firstLink.attr("href"); String className=firstLink.attr("class"); |
Examples
1. Get All Hyperlinks
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
package com.codenuclear; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JSoupExamples { public static void main(String[] args) throws IOException { String url="https://spring.io/docs"; Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a"); for(Element link : links){ System.out.println("\nLink :"+link.attr("href")); System.out.println("Title :"+link.text()); } } } |
Output :
2. Get All Images
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
package com.codenuclear; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JSoupExamples { public static void main(String[] args) throws IOException { String url="https://www.google.com"; Document doc = Jsoup.connect(url).get(); Elements images = doc.select("img[src~=(?i)\\.(png|jpe?g|gif)]"); for (Element image : images) { System.out.println("\nsrc : " + image.attr("src")); System.out.println("height : " + image.attr("height")); System.out.println("width : " + image.attr("width")); System.out.println("alt : " + image.attr("alt")); } } } |
Output :
3. Select Elements based on Id/Class Name
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
package com.codenuclear; import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JSoupExamples { public static void main(String[] args) throws IOException { String url="http://www.codenuclear.com/category/java/"; Document doc = Jsoup.connect(url).get(); /***Selecting with Id ***/ Elements breadCrumbs= doc.select("#flash-breadcrumbs"); for (Element breadCrumb : breadCrumbs) { System.out.println(breadCrumb.text()); } /***Selecting with Class Name ***/ Elements widgetTitles = doc.select(".widget-title"); for (Element widgetTitle : widgetTitles) { System.out.println(widgetTitles.text()); } } } |