文章目录
- 1. 概述
- 2. 注意
- 2.0 js解析问题
- 2.1 关闭HtmlUnit日志
- 3. 使用
- 3.1 抓取IT之家周榜内容 - 单页面
- 3.2 抓取IT之家周榜第九篇文章的内容 - 双页面
- 3.3 模拟用户操作 - (这个功能个人感觉非常非常的鸡肋,只能用于非常简单的JS,但是一般网站的动作触发都会进行一系列复杂的JS操作,所以想爬虫还是推荐Selenium)
- 3.4 文件下载
- 3.5 弹框处理
注意: 对于百度翻译、百度搜索、腾讯翻译等页面依然抓取不了结果,对于加密的JS文件解析基本不生效 — 推荐使用Selenium爬复杂JS、以及加密JS页面的内容
1. 概述
官方文档: https://htmlunit.sourceforge.io/
有具体Demo的讲解文档(搭配官方文档效果更佳):https://www.scrapingbee.com/java-webscraping-book/
作用: 一个"用于Java程序的无GUI浏览器"。它对HTML文档进行建模,并提供一个API,允许您调用页面,填写表单,单击链接等…就像您在"正常"浏览器中所做的那样
2. 注意
2.0 js解析问题
根据官方文档描述,仅能解析js库: htmx, jQuery, jQuery, MochiKit, GWT, Sarissa, MooTools, Prototype, Ext, Dojo, Dojo, YUI所以遇到经过加密的JS文件、以及其他库很可能会解析失败 === 所以模拟抓百度翻译、腾讯翻译、有道翻译这些加密的JS抓不了,建议使用Selenium(Java)进行抓,不过这工具比较重,好用是非常好用、直接爬就完事压根就不用分析浏览器的请求
2.1 关闭HtmlUnit日志
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(Level.OFF);
3. 使用
依赖: https://search.maven.org/artifact/net.sourceforge.htmlunit/htmlunit
<dependency><groupId>net.sourceforge.htmlunit</groupId><artifactId>htmlunit</artifactId><version>2.58.0</version>
</dependency>
3.1 抓取IT之家周榜内容 - 单页面
抓取IT之家周榜的内容
/*** IT之家*/@Test@SneakyThrowspublic void test10() {//浏览器设置WebClient webClient = new WebClient();webClient.setAjaxController(new NicelyResynchronizingAjaxController());webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);webClient.getOptions().setThrowExceptionOnScriptError(false);webClient.getOptions().setCssEnabled(true);webClient.getOptions().setJavaScriptEnabled(true);webClient.getOptions().setActiveXNative(false);//打开页面HtmlPage page = webClient.getPage("https://www.ithome.com/");//鼠标悬浮到周榜上DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");page = (HtmlPage) inputEle.mouseOver();DomElement ulElement = page.getFirstByXPath("//div[@id='rank']//ul[@id='d-2']");//周榜信息System.out.println(ulElement.asNormalizedText());}
抓取成功
3.2 抓取IT之家周榜第九篇文章的内容 - 双页面
/*** IT之家周榜第九篇内容*/@Test@SneakyThrowspublic void test11() {WebClient webClient = new WebClient();webClient.setAjaxController(new NicelyResynchronizingAjaxController());webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);webClient.getOptions().setThrowExceptionOnScriptError(false);webClient.getOptions().setCssEnabled(true);webClient.getOptions().setJavaScriptEnabled(true);webClient.getOptions().setActiveXNative(false);HtmlPage page = webClient.getPage("https://www.ithome.com/");//鼠标悬浮到周榜上DomElement inputEle = page.getFirstByXPath("//div[@id='rank']//li[@data-id='2']");page = (HtmlPage) inputEle.mouseOver();//获取文章链接List<DomElement> articleLinkElems = page.getByXPath("//div[@id='rank']//ul[@id='d-2']//a");if(CollUtil.isNotEmpty(articleLinkElems)) {//第九篇文章page = articleLinkElems.get(8).click();DomElement articleDivElem = page.getFirstByXPath("//div[@id='dt']//div[@class='fl content']");System.out.println(articleDivElem.asNormalizedText());}}
抓取成功
3.3 模拟用户操作 - (这个功能个人感觉非常非常的鸡肋,只能用于非常简单的JS,但是一般网站的动作触发都会进行一系列复杂的JS操作,所以想爬虫还是推荐Selenium)
示例页面
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>HtmlUnit测试</title>
</head><body><form id="form" onclick="return false;"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"><label for="uname"><b>账号</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname" required><label for="psw"><b>密码</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw" required><button id="loginBtn" type="button">登陆</button></div></form><form id="form2" method="post" action="http://127.0.0.1:8080/login"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"><label for="uname"><b>账号2</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname2" required><label for="psw"><b>密码2</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw2" required><button id="loginBtn2" type="submit">登陆2</button></div></form></body><script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script>
<script>$(function () {//登陆function loginOperation() {$.post("http://127.0.0.1:8080/login",$("#form").serialize(),responseData => {$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)$("form").hide();},"json")return false;}$("#loginBtn").click(loginOperation);})
</script></html>
登录接口代码 == springboot == 注意下面是两个文件的代码
@Configuration
public class SystemConfig {//允许跨域@Beanpublic CorsFilter corsFilter() {CorsConfiguration corsConfiguration = new CorsConfiguration();corsConfiguration.addAllowedOriginPattern("*");corsConfiguration.setAllowCredentials(true);corsConfiguration.addAllowedMethod("*");corsConfiguration.addAllowedHeader("*");UrlBasedCorsConfigurationSource configSource = new UrlBasedCorsConfigurationSource();configSource.registerCorsConfiguration("/**", corsConfiguration);return new CorsFilter(configSource);}
}@Controller
@RequestMapping
@ResponseBody
public class LoginController {@PostMapping("login")public Map login(HttpServletRequest request) {Map parameterMap = new HashMap(request.getParameterMap());parameterMap.put("name", "嗯嗯*");return parameterMap;}}
模拟用户表单操作
/*** 模拟用户输入*/@Test@SneakyThrowspublic void test12() {WebClient webClient = new WebClient();webClient.setAjaxController(new NicelyResynchronizingAjaxController());webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);webClient.getOptions().setThrowExceptionOnScriptError(false);webClient.getOptions().setCssEnabled(true);webClient.getOptions().setJavaScriptEnabled(true);webClient.getOptions().setActiveXNative(false);//ajax手动提交的请求HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");DomElement loginNameElem = page.getElementById("uname");loginNameElem.setAttribute("value", "root");DomElement passwordElem = page.getElementById("psw");passwordElem.setAttribute("value", "pswroot");//提交form1的表单DomElement startLoginBtnElem = page.getElementById("loginBtn");page = startLoginBtnElem.click();DomElement userInfoDivElem = page.getFirstByXPath("//h1");System.out.println(userInfoDivElem.asNormalizedText());//==================================================//表单提交 == 返回的是JSON结果的页面,不是htmlPage页面故需要将结果转成UnexpectedPagepage = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");HtmlInput inputloginNameElem = (HtmlInput) page.getElementById("uname2");inputloginNameElem.setAttribute("value", "root2");HtmlInput inputpasswordElem = (HtmlInput) page.getElementById("psw2");inputpasswordElem.setAttribute("value", "pswroot2");//提交form2的表单HtmlForm enclosingForm = inputloginNameElem.getEnclosingForm();UnexpectedPage page2 = webClient.getPage(enclosingForm.getWebRequest(null));//获取响应结果System.out.println(page2.getWebResponse().getContentAsString(UTF_8));}
3.4 文件下载
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>HtmlUnit测试</title></head><body><form id="form" onclick="return false;"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"><label for="uname"><b>账号</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname" required><label for="psw"><b>密码</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw" required><button id="loginBtn" type="button">登陆</button></div></form><form id="form2" method="post" action="http://127.0.0.1:8080/login"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"><label for="uname"><b>账号2</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname2" required><label for="psw"><b>密码2</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw2" required><button id="loginBtn2" type="submit">登陆2</button></div></form><a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a><br/><a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a></body><script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script><script>$(function() {//登陆function loginOperation() {$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)$("form").hide();}, "json")return false;}$("#loginBtn").click(loginOperation);})</script></html>
文件下载接口
package work.linruchang.qq.htmlunitweb.controller;import cn.hutool.core.util.StrUtil;
import lombok.SneakyThrows;
import org.springframework.core.io.FileSystemResource;
import org.springframework.http.HttpHeaders;
import org.springframework.http.MediaType;
import org.springframework.http.ResponseEntity;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.net.URLEncoder;
import java.util.HashMap;
import java.util.Map;/*** 作用:** @author LinRuChang* @version 1.0* @date 2022/02/09* @since 1.8**/
@Controller
@RequestMapping
@ResponseBody
public class HtmlUnitController {/*** 下载文件测试* http://127.0.0.1:8080/download* @param request* @param httpServletResponse* @return*/@GetMapping("download")@SneakyThrowspublic ResponseEntity login(HttpServletRequest request, HttpServletResponse httpServletResponse) {System.out.println(request.getSession().getId() + "开始下载");FileSystemResource fileSystemResource = new FileSystemResource("E:\\微信\\文件\\WeChat Files\\wxid_n7xzf77wr3wv22\\FileStorage\\File\\2022-02\\房东符金瑞名下楼栋需要批量处理.xlsx");HttpHeaders headers = new HttpHeaders();headers.add("Cache-Control", "no-cache, no-store, must-revalidate");headers.add("Content-Disposition", StrUtil.format("attachment; filename={}", URLEncoder.encode(fileSystemResource.getFilename())));headers.add("Pragma", "no-cache");headers.add("Expires", "0");return ResponseEntity.ok().headers(headers).contentLength(fileSystemResource.contentLength()).contentType(MediaType.parseMediaType("application/octet-stream")).body(fileSystemResource);}}
开始测试HtmlUnit下载功能
package work.linruchang.qq;import cn.hutool.core.collection.CollUtil;
import cn.hutool.core.io.IoUtil;
import cn.hutool.core.lang.Console;
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.*;
import com.gargoylesoftware.htmlunit.javascript.host.event.KeyboardEvent;
import lombok.SneakyThrows;
import org.junit.Test;import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.URLDecoder;
import java.util.List;
import java.util.logging.Level;import static java.nio.charset.StandardCharsets.UTF_8;/*** 作用:** @author LinRuChang* @version 1.0* @date 2022/02/08* @since 1.8**/
public class HtmlUnitTest {@Test@SneakyThrowspublic void test13() {WebClient webClient = new WebClient();webClient.setAjaxController(new NicelyResynchronizingAjaxController());webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);webClient.getOptions().setThrowExceptionOnScriptError(false);webClient.getOptions().setCssEnabled(true);webClient.getOptions().setJavaScriptEnabled(true);webClient.getOptions().setActiveXNative(false);HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");//DomElement downloadBtn = page.getElementById("downloadBtn");DomElement downloadBtn = page.getElementById("downloadBtn2");//触发下载按钮Page clickPage = downloadBtn.click();//下面两句是等价//Page enclosedPage = webClient.getWebWindows().get(webClient.getWebWindows().size() - 1).getEnclosedPage();Page enclosedPage = clickPage.getEnclosingWindow().getEnclosedPage();InputStream contentAsStream = enclosedPage.getWebResponse().getContentAsStream();//获取文件名String responseHeaderValue = enclosedPage.getWebResponse().getResponseHeaderValue(HttpHeader.CONTENT_DISPOSITION);String documentName = responseHeaderValue.split(";")[1].split("=")[1].trim();documentName = URLDecoder.decode(documentName);Console.log("文件下载成功:{}",documentName);//存入数据库IoUtil.copy(contentAsStream, new FileOutputStream("C:\\Users\\Administrator\\Desktop\\图片\\"+ documentName));}}
3.5 弹框处理
示例页面
<!DOCTYPE html>
<html lang="en"><head><meta charset="UTF-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>HtmlUnit测试</title></head><body><form id="form" onclick="return false;"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark" required value="ajax手动提交"><label for="uname"><b>账号</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname" required><label for="psw"><b>密码</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw" required><button id="loginBtn" type="button">登陆</button></div></form><form id="form2" method="post" action="http://127.0.0.1:8080/login"><div class="container"><input type="hidden" placeholder="Enter Username" name="mark" id="mark2" required value="form表单提交"><label for="uname"><b>账号2</b></label><input type="text" placeholder="Enter Username" name="uname" id="uname2" required><label for="psw"><b>密码2</b></label><input type="password" placeholder="Enter Password" name="psw" id="psw2" required><button id="loginBtn2" type="submit">登陆2</button></div></form><a href="http://127.0.0.1:8080/download" id="downloadBtn">下载按钮(当前页面)</a><br/><a href="http://127.0.0.1:8080/download" id="downloadBtn2" target="_blank">下载按钮2(新页面)</a><br/><button id="alertBtn">弹出信息</button><br/><button id="promptBtn">提示框信息</button><br/><button id="confirmBtn">确认框信息</button> </body><script src="file:///G:/VsCode/开源/jquery-3.5.1/jquery-3.5.1.min.js"></script><script>$(function() {var i = 0;$("#alertBtn").click(function() {alert("点击触发弹框信息: 第" + ++i + "次")})var j = 0;$("#promptBtn").click(function() {prompt("点击触发提示框信息: 第" + ++j + "次", "默认值1111")})var k = 0;$("#confirmBtn").click(function() {confirm("点击触发确认框信息: 第" + ++k + "次")}) //登陆function loginOperation() {$.post("http://127.0.0.1:8080/login", $("#form").serialize(), responseData => {$("body").append(`<h1>${JSON.stringify(responseData)}</h1>`)$("form").hide();}, "json")return false;}$("#loginBtn").click(loginOperation);})</script></html>
HtmlUnit模拟用户触发弹框
@Test@SneakyThrowspublic void test15() {WebClient webClient = new WebClient();webClient.setAjaxController(new NicelyResynchronizingAjaxController());webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);webClient.getOptions().setThrowExceptionOnScriptError(false);webClient.getOptions().setCssEnabled(true);webClient.getOptions().setJavaScriptEnabled(true);webClient.getOptions().setActiveXNative(false);List<String> alertInfos = new ArrayList<>();webClient.setAlertHandler(new CollectingAlertHandler(alertInfos));//提示框处理final List<String> promptInfos = new ArrayList<>();webClient.setPromptHandler(new PromptHandler() {@Overridepublic String handlePrompt(Page page, String message, String defaultValue) {Console.log("Prompt信息:{}、{}", message,defaultValue);promptInfos.add(message);return StrUtil.blankToDefault(message,defaultValue);}});//确认框消息处理final List<String> confirmInfos = new ArrayList<>();webClient.setConfirmHandler(new ConfirmHandler() {@Overridepublic boolean handleConfirm(Page page, String message) {confirmInfos.add(message);//true确认 false取消弹框return true;}});HtmlPage page = webClient.getPage("file:///C:/Users/Administrator/Desktop/index5.html");DomElement alertBtn = page.getElementById("alertBtn");page = alertBtn.click();DomElement promptBtn = page.getElementById("promptBtn");page = promptBtn.click();page = promptBtn.click();DomElement confirmBtn = page.getElementById("confirmBtn");page = confirmBtn.click();page = confirmBtn.click();page = confirmBtn.click();Console.log("弹框信息:{}", alertInfos);Console.log("提示框信息:{}", promptInfos);Console.log("确认框信息:{}", confirmInfos);}