jsoup爬虫

article/2025/9/28 14:10:09

文章目录

    • 1、jsoup爬虫简单介绍
    • 2、相关代码
        • 2.1导入pom依赖
        • 2.2、图片爬取
        • 2.3、图片本地化
    • 3、百度云链接爬虫

1、jsoup爬虫简单介绍

jsoup 是一款 Java 的HTML 解析器,可通过DOM,CSS选择器以及类似于JQuery的操作方法来提取和操作Html文档数据。

这两个涉及到的点有以下几个:

1、httpclient获取网页内容
2、Jsoup解析网页内容
3、要达到增量爬取的效果,那么需要利用缓存ehcache对重复URL判重
4、将爬取到的数据存入数据库
5、为解决某些网站防盗链的问题,那么需要将对方网站的静态资源(这里只处理了图片)本地化

2、相关代码

2.1导入pom依赖

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.zrh</groupId><artifactId>T226_jsoup</artifactId><version>0.0.1-SNAPSHOT</version><packaging>jar</packaging><name>T226_jsoup</name><url>http://maven.apache.org</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding></properties><dependencies><!-- jdbc驱动包 --><dependency><groupId>mysql</groupId><artifactId>mysql-connector-java</artifactId><version>5.1.44</version></dependency><!-- 添加Httpclient支持 --><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.2</version></dependency><!-- 添加jsoup支持 --><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.10.1</version></dependency><!-- 添加日志支持 --><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.16</version></dependency><!-- 添加ehcache支持 --><dependency><groupId>net.sf.ehcache</groupId><artifactId>ehcache</artifactId><version>2.10.3</version></dependency><!-- 添加commons io支持 --><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.5</version></dependency><dependency><groupId>com.alibaba</groupId><artifactId>fastjson</artifactId><version>1.2.47</version></dependency></dependencies>
</project>

2.2、图片爬取

需要修改你要爬取的图片地址

private static String URL = "http://www.yidianzhidao.com/UploadFiles/img_1_446119934_1806045383_26.jpg";

在这里插入图片描述

2.3、图片本地化

crawler.properties

dbUrl=jdbc:mysql://localhost:3306/zrh?autoReconnect=true
dbUserName=root
dbPassword=123
jdbcName=com.mysql.jdbc.Driver
ehcacheXmlPath=C://blogCrawler/ehcache.xml
blogImages=D://blogCrawler/blogImages/

log4j.properties

log4j.rootLogger=INFO, stdout,D  #Console  
log4j.appender.stdout=org.apache.log4j.ConsoleAppender  
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout  
log4j.appender.stdout.layout.ConversionPattern=[%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n#D
log4j.appender.D = org.apache.log4j.RollingFileAppender
log4j.appender.D.File = C://blogCrawler/bloglogs/log.log
log4j.appender.D.MaxFileSize=100KB
log4j.appender.D.MaxBackupIndex=100  
log4j.appender.D.Append = true
log4j.appender.D.layout = org.apache.log4j.PatternLayout
log4j.appender.D.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss}  [ %t:%r ] - [ %p ]  %m%n 

DbUtil.java

package com.zrh.util;import java.sql.Connection;
import java.sql.DriverManager;/*** 数据库工具类* @author user**/
public class DbUtil {/*** 获取连接* @return* @throws Exception*/public Connection getCon()throws Exception{Class.forName(PropertiesUtil.getValue("jdbcName"));Connection con=DriverManager.getConnection(PropertiesUtil.getValue("dbUrl"), PropertiesUtil.getValue("dbUserName"), PropertiesUtil.getValue("dbPassword"));return con;}/*** 关闭连接* @param con* @throws Exception*/public void closeCon(Connection con)throws Exception{if(con!=null){con.close();}}public static void main(String[] args) {DbUtil dbUtil=new DbUtil();try {dbUtil.getCon();System.out.println("数据库连接成功");} catch (Exception e) {e.printStackTrace();System.out.println("数据库连接失败");}}
}

PropertiesUtil.java

package com.zrh.util;import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;/*** properties工具类* @author user**/
public class PropertiesUtil {/*** 根据key获取value值* @param key* @return*/public static String getValue(String key){Properties prop=new Properties();InputStream in=new PropertiesUtil().getClass().getResourceAsStream("/crawler.properties");try {prop.load(in);} catch (IOException e) {// TODO Auto-generated catch blocke.printStackTrace();}return prop.getProperty(key);}
}

最重要的代码来了:
BlogCrawlerStarter.java(核心代码)

package com.zrh.crawler;
import java.io.File;
import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;import org.apache.commons.io.FileUtils;
import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;import com.zrh.util.DateUtil;
import com.zrh.util.DbUtil;
import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache;
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Status;/*** @author Administrator**/
public class BlogCrawlerStarter {private static Logger logger = Logger.getLogger(BlogCrawlerStarter.class);
//	https://www.csdn.net/nav/newarticlesprivate static String HOMEURL = "https://www.cnblogs.com/";private static CloseableHttpClient httpClient;private static Connection con;private static CacheManager cacheManager;private static Cache cache;/*** httpclient解析首页,获取首页内容*/public static void parseHomePage() {logger.info("开始爬取首页:" + HOMEURL);cacheManager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = cacheManager.getCache("cnblog");httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(HOMEURL);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(HOMEURL + ":爬取无响应");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String homePageContent = EntityUtils.toString(entity, "utf-8");// System.out.println(homePageContent);parseHomePageContent(homePageContent);}} catch (ClientProtocolException e) {logger.error(HOMEURL + "-ClientProtocolException", e);} catch (IOException e) {logger.error(HOMEURL + "-IOException", e);} finally {try {if (response != null) {response.close();}if (httpClient != null) {httpClient.close();}} catch (IOException e) {logger.error(HOMEURL + "-IOException", e);}}if(cache.getStatus() ==  Status.STATUS_ALIVE) {cache.flush();}cacheManager.shutdown();logger.info("结束爬取首页:" + HOMEURL);}/*** 通过网络爬虫框架jsoup,解析网页类容,获取想要数据(博客的连接)* * @param homePageContent*/private static void parseHomePageContent(String homePageContent) {Document doc = Jsoup.parse(homePageContent);//#feedlist_id .list_con .title h2 aElements aEles = doc.select("#post_list .post_item .post_item_body h3 a");for (Element aEle : aEles) {
//			这个是首页中的博客列表中的单个链接URLString blogUrl = aEle.attr("href");if (null == blogUrl || "".equals(blogUrl)) {logger.info("该博客未内容,不再爬取插入数据库!");continue;}if(cache.get(blogUrl) != null) {logger.info("该数据已经被爬取到数据库中,数据库不再收录!");continue;}
//			System.out.println("************************"+blogUrl+"****************************");parseBlogUrl(blogUrl);}}/*** 通过博客地址获取博客的标题,以及博客的类容* * @param blogUrl*/private static void parseBlogUrl(String blogUrl) {logger.info("开始爬取博客网页:" + blogUrl);httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(blogUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(blogUrl + ":爬取无响应");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogContent = EntityUtils.toString(entity, "utf-8");parseBlogContent(blogContent, blogUrl);}} catch (ClientProtocolException e) {logger.error(blogUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(blogUrl + "-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(blogUrl + "-IOException", e);}}logger.info("结束爬取博客网页:" + HOMEURL);}/*** 解析博客类容,获取博客中标题以及所有内容* * @param blogContent*/private static void parseBlogContent(String blogContent, String link) {Document doc = Jsoup.parse(blogContent);if(!link.contains("ansion2014")) {System.out.println(blogContent);}Elements titleEles = doc//#mainBox main .blog-content-box .article-header-box .article-header .article-title-box h1.select("#topics .post h1 a");System.out.println(titleEles.toString());if (titleEles.size() == 0) {logger.info("博客标题为空,不插入数据库!");return;}String title = titleEles.get(0).html();Elements blogContentEles = doc.select("#cnblogs_post_body ");if (blogContentEles.size() == 0) {logger.info("博客内容为空,不插入数据库!");return;}String blogContentBody = blogContentEles.get(0).html();//		Elements imgEles = doc.select("img");
//		List<String> imgUrlList = new LinkedList<String>();
//		if(imgEles.size() > 0) {
//			for (Element imgEle : imgEles) {
//				imgUrlList.add(imgEle.attr("src"));
//			}
//		}
//		
//		if(imgUrlList.size() > 0) {
//			Map<String, String> replaceUrlMap = downloadImgList(imgUrlList);
//			blogContent = replaceContent(blogContent,replaceUrlMap);
//		}String sql = "insert into `t_jsoup_article` values(null,?,?,null,now(),0,0,null,?,0,null)";try {PreparedStatement pst = con.prepareStatement(sql);pst.setObject(1, title);pst.setObject(2, blogContentBody);pst.setObject(3, link);if(pst.executeUpdate() == 0) {logger.info("爬取博客信息插入数据库失败");}else {cache.put(new net.sf.ehcache.Element(link, link));logger.info("爬取博客信息插入数据库成功");}} catch (SQLException e) {logger.error("数据异常-SQLException:",e);}}/*** 将别人博客内容进行加工,将原有图片地址换成本地的图片地址* @param blogContent* @param replaceUrlMap* @return*/private static String replaceContent(String blogContent, Map<String, String> replaceUrlMap) {for(Map.Entry<String, String> entry: replaceUrlMap.entrySet()) {blogContent = blogContent.replace(entry.getKey(), entry.getValue());}return blogContent;}/*** 别人服务器图片本地化* @param imgUrlList* @return*/private static Map<String, String> downloadImgList(List<String> imgUrlList) {Map<String, String> replaceMap = new HashMap<String, String>();for (String imgUrl : imgUrlList) {CloseableHttpClient httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(imgUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(HOMEURL + ":爬取无响应");}else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogImagesPath = PropertiesUtil.getValue("blogImages");String dateDir = DateUtil.getCurrentDatePath();String uuid = UUID.randomUUID().toString();String subfix = entity.getContentType().getValue().split("/")[1];String fileName = blogImagesPath + dateDir + "/" + uuid + "." + subfix;FileUtils.copyInputStreamToFile(entity.getContent(), new File(fileName));replaceMap.put(imgUrl, fileName);}}} catch (ClientProtocolException e) {logger.error(imgUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(imgUrl + "-IOException", e);} catch (Exception e) {logger.error(imgUrl + "-Exception", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(imgUrl + "-IOException", e);}}}return replaceMap;}public static void start() {while(true) {DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();parseHomePage();} catch (Exception e) {logger.error("数据库连接势失败!");} finally {try {if (con != null) {con.close();}} catch (SQLException e) {logger.error("数据关闭异常-SQLException:",e);}}try {Thread.sleep(1000*60);} catch (InterruptedException e) {logger.error("主线程休眠异常-InterruptedException:",e);}}}public static void main(String[] args) {start();}
}

在这里插入图片描述
再看下我们的数据库的数据都插入了:
在这里插入图片描述

3、百度云链接爬虫

PanZhaoZhaoCrawler3.java

package com.zrh.crawler;import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.LinkedList;
import java.util.List;
import java.util.UUID;import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;import com.zrh.util.DbUtil;
import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache;
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Status;public class PanZhaoZhaoCrawler3 {private static Logger logger = Logger.getLogger(PanZhaoZhaoCrawler3.class);private static String URL = "http://www.13910.com/daren/";private static String PROJECT_URL = "http://www.13910.com";private static Connection con;private static CacheManager manager;private static Cache cache;private static CloseableHttpClient httpClient;private static long total = 0;/*** httpclient获取首页内容*/public static void parseHomePage() {logger.info("开始爬取:" + URL);manager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = manager.getCache("cnblog");httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(URL);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info("链接超时!");} else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");parsePageContent(pageContent);}}} catch (ClientProtocolException e) {logger.error(URL + "-解析异常-ClientProtocolException", e);} catch (IOException e) {logger.error(URL + "-解析异常-IOException", e);} finally {try {if (response != null) {response.close();}if (httpClient != null) {httpClient.close();}} catch (IOException e) {logger.error(URL + "-解析异常-IOException", e);}}// 最终将数据缓存到硬盘中if (cache.getStatus() == Status.STATUS_ALIVE) {cache.flush();}manager.shutdown();logger.info("结束爬取:" + URL);}/*** Jsoup解析首页内容* @param pageContent*/private static void parsePageContent(String pageContent) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select(".showtop .key-right .darenlist .list-info .darentitle a");for (Element aEle : aEles) {String aHref = aEle.attr("href");logger.info("提取个人代理分享主页:"+aHref);String panZhaoZhaoUserShareUrl = PROJECT_URL + aHref;List<String> panZhaoZhaoUserShareUrls = getPanZhaoZhaoUserShareUrls(panZhaoZhaoUserShareUrl);for (String singlePanZhaoZhaoUserShareUrl : panZhaoZhaoUserShareUrls) {
//				System.out.println("**********************************************************"+singlePanZhaoZhaoUserShareUrl+"**********************************************************");
//				continue;parsePanZhaoZhaoUserShareUrl(singlePanZhaoZhaoUserShareUrl);}}}/*** 收集个人主页的前15条记录* @param panZhaoZhaoUserShareUrl* @return*/private static List<String> getPanZhaoZhaoUserShareUrls(String panZhaoZhaoUserShareUrl){List<String> list = new LinkedList<String>();list.add(panZhaoZhaoUserShareUrl);for (int i = 2; i < 16; i++) {list.add(panZhaoZhaoUserShareUrl+"page-"+i+".html");}return list;}/*** 解析盘找找加工后的用户URL* 原:http://yun.baidu.com/share/home?uk=1949795117* 现在:http://www.13910.com/u/1949795117/share/* @param panZhaoZhaoUserShareUrl	现在的url*/private static void parsePanZhaoZhaoUserShareUrl(String panZhaoZhaoUserShareUrl) {logger.info("开始爬取个人代理分享主页::"+panZhaoZhaoUserShareUrl);HttpGet httpGet = new HttpGet(panZhaoZhaoUserShareUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("链接超时!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");parsePanZhaoZhaoUserSharePageContent(pageContent,panZhaoZhaoUserShareUrl);}}} catch (ClientProtocolException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析异常-ClientProtocolException",e);} catch (IOException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析异常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(panZhaoZhaoUserShareUrl+"-解析异常-IOException",e);}}logger.info("结束爬取个人代理分享主页::"+URL);}/*** 通过用户分享的百度云主页URL获取的内容,得到所有加工后的链接* @param pageContent* @param panZhaoZhaoUserShareUrl	加工后的用户分享主页链接*/private static void parsePanZhaoZhaoUserSharePageContent(String pageContent, String panZhaoZhaoUserShareUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select("#flist li a");if(aEles.size() == 0) {logger.info("没有爬取到百度云地址");return;}for (Element aEle : aEles) {String ahref = aEle.attr("href");parseUserHandledTargetUrl(PROJECT_URL + ahref);}
//		System.out.println("***********************************"+aEles.size()+"***********************"+ahref+"**********************************************************");}/*** 解析地址* @param handledTargetUrl	这个地址中包含了加工后的百度云地址*/private static void parseUserHandledTargetUrl(String handledTargetUrl) {logger.info("开始爬取blog::"+handledTargetUrl);if(cache.get(handledTargetUrl) != null) {logger.info("数据库已存在该记录");return;}HttpGet httpGet = new HttpGet(handledTargetUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("链接超时!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");
//					System.out.println("**********************************************************"+pageContent+"**********************************************************");parseHandledTargetUrlPageContent(pageContent,handledTargetUrl);}}} catch (ClientProtocolException e) {logger.error(handledTargetUrl+"-解析异常-ClientProtocolException",e);} catch (IOException e) {logger.error(handledTargetUrl+"-解析异常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(handledTargetUrl+"-解析异常-IOException",e);}}logger.info("结束爬取blog::"+URL);}/*** 解析加工后的百度云地址内容* @param pageContent* @param handledTargetUrl	加工后的百度云地址*/private static void parseHandledTargetUrlPageContent(String pageContent, String handledTargetUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select(".fileinfo .panurl a");if(aEles.size() == 0) {logger.info("没有爬取到百度云地址");return;}String ahref = aEles.get(0).attr("href");
//		System.out.println("**********************************************************"+ahref+"**********************************************************");getUserBaiduYunUrl(PROJECT_URL+ahref);}/*** 获取被处理过的百度云链接内容* @param handledBaiduYunUrl	被处理过的百度云链接*/private static void getUserBaiduYunUrl(String handledBaiduYunUrl) {logger.info("开始爬取blog::"+handledBaiduYunUrl);HttpGet httpGet = new HttpGet(handledBaiduYunUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config );CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if(response == null) {logger.info("链接超时!");}else {if(response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "utf-8");
//					System.out.println("**********************************************************"+pageContent+"**********************************************************");parseHandledBaiduYunUrlPageContent(pageContent,handledBaiduYunUrl);}}} catch (ClientProtocolException e) {logger.error(handledBaiduYunUrl+"-解析异常-ClientProtocolException",e);} catch (IOException e) {logger.error(handledBaiduYunUrl+"-解析异常-IOException",e);}finally {try {if(response != null) {response.close();}} catch (IOException e) {logger.error(handledBaiduYunUrl+"-解析异常-IOException",e);}}logger.info("结束爬取blog::"+URL);}/*** 获取百度云链接* @param pageContent* @param handledBaiduYunUrl*/private static void parseHandledBaiduYunUrlPageContent(String pageContent, String handledBaiduYunUrl) {Document doc = Jsoup.parse(pageContent);Elements aEles = doc.select("#check-result-no a");if(aEles.size() == 0) {logger.info("没有爬取到百度云地址");return;}String ahref = aEles.get(0).attr("href");if((!ahref.contains("yun.baidu.com")) && (!ahref.contains("pan.baidu.com"))) return;logger.info("**********************************************************"+"爬取到第"+(++total)+"个目标对象:"+ahref+"**********************************************************");
//		System.out.println("爬取到第"+(++total)+"个目标对象:"+ahref);String sql = "insert into `t_jsoup_article` values(null,?,?,null,now(),0,0,null,?,0,null)";try {PreparedStatement pst = con.prepareStatement(sql);
//			pst.setObject(1, UUID.randomUUID().toString());pst.setObject(1, "测试类容");pst.setObject(2, ahref);pst.setObject(3, ahref);if(pst.executeUpdate() == 0) {logger.info("爬取链接插入数据库失败!!!");}else {cache.put(new net.sf.ehcache.Element(handledBaiduYunUrl, handledBaiduYunUrl));logger.info("爬取链接插入数据库成功!!!");}} catch (SQLException e) {logger.error(ahref+"-解析异常-SQLException",e);}}public static void start() {DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();parseHomePage();} catch (Exception e) {logger.error("数据库创建失败",e);}}public static void main(String[] args) {start();}
}

爬这里面的链接
在这里插入图片描述

爬想要的电影:
MovieCrawlerStarter.java

package com.zrh.crawler;import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.util.LinkedList;
import java.util.List;import org.apache.http.HttpEntity;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.util.EntityUtils;
import org.apache.log4j.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;import com.zrh.util.DbUtil;
import com.zrh.util.PropertiesUtil;import net.sf.ehcache.Cache;
import net.sf.ehcache.CacheManager;
import net.sf.ehcache.Status;public class MovieCrawlerStarter {private static Logger logger = Logger.getLogger(MovieCrawlerStarter.class);private static String URL = "http://www.8gw.com/";private static String PROJECT_URL = "http://www.8gw.com";private static Connection con;private static CacheManager manager;private static Cache cache;private static CloseableHttpClient httpClient;private static long total = 0;/*** 等待爬取的52个链接的数据* * @return*/private static List<String> getUrls() {List<String> list = new LinkedList<String>();list.add("http://www.8gw.com/8gli/index8.html");for (int i = 2; i < 53; i++) {list.add("http://www.8gw.com/8gli/index8_" + i + ".html");}return list;}/*** 获取URL主体类容* * @param url*/private static void parseUrl(String url) {logger.info("开始爬取系列列表::" + url);HttpGet httpGet = new HttpGet(url);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info("链接超时!");} else {if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String pageContent = EntityUtils.toString(entity, "GBK");parsePageContent(pageContent, url);}}} catch (ClientProtocolException e) {logger.error(url + "-解析异常-ClientProtocolException", e);} catch (IOException e) {logger.error(url + "-解析异常-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(url + "-解析异常-IOException", e);}}logger.info("结束爬取系列列表::" + url);}/*** 获取当前页中的具体影片的链接* @param pageContent* @param url*/private static void parsePageContent(String pageContent, String url) {
//		System.out.println("****************" + url + "***********************");Document doc = Jsoup.parse(pageContent);Elements liEles = doc.select(".span_2_800 #list_con li");for (Element liEle : liEles) {String movieUrl = liEle.select(".info a").attr("href");if (null == movieUrl || "".equals(movieUrl)) {logger.info("该影片未内容,不再爬取插入数据库!");continue;}if(cache.get(movieUrl) != null) {logger.info("该数据已经被爬取到数据库中,数据库不再收录!");continue;}parseSingleMovieUrl(PROJECT_URL+movieUrl);}}/*** 解析单个影片链接* @param movieUrl*/private static void parseSingleMovieUrl(String movieUrl) {logger.info("开始爬取影片网页:" + movieUrl);httpClient = HttpClients.createDefault();HttpGet httpGet = new HttpGet(movieUrl);RequestConfig config = RequestConfig.custom().setConnectTimeout(5000).setSocketTimeout(8000).build();httpGet.setConfig(config);CloseableHttpResponse response = null;try {response = httpClient.execute(httpGet);if (response == null) {logger.info(movieUrl + ":爬取无响应");return;}if (response.getStatusLine().getStatusCode() == 200) {HttpEntity entity = response.getEntity();String blogContent = EntityUtils.toString(entity, "GBK");parseSingleMovieContent(blogContent, movieUrl);}} catch (ClientProtocolException e) {logger.error(movieUrl + "-ClientProtocolException", e);} catch (IOException e) {logger.error(movieUrl + "-IOException", e);} finally {try {if (response != null) {response.close();}} catch (IOException e) {logger.error(movieUrl + "-IOException", e);}}logger.info("结束爬取影片网页:" + movieUrl);}/*** 解析页面主体类容(影片名字、影片描述、影片地址)* @param pageContent* @param movieUrl*/private static void parseSingleMovieContent(String pageContent, String movieUrl) {
//		System.out.println("****************" + movieUrl + "***********************");Document doc = Jsoup.parse(pageContent);Elements divEles = doc.select(".wrapper .main .moviedteail");
//		.wrapper .main .moviedteail .moviedteail_tt h1
//		.wrapper .main .moviedteail .moviedteail_list .moviedteail_list_short a
//		.wrapper .main .moviedteail .moviedteail_img img	Elements h1Eles = divEles.select(".moviedteail_tt h1");if (h1Eles.size() == 0) {logger.info("影片名字为空,不插入数据库!");return;}String mname = h1Eles.get(0).html();Elements aEles = divEles.select(".moviedteail_list .moviedteail_list_short a");if (aEles.size() == 0) {logger.info("影片描述为空,不插入数据库!");return;}String mdesc = aEles.get(0).html();Elements imgEles = divEles.select(".moviedteail_img img");if (null == imgEles || "".equals(imgEles)) {logger.info("影片描述为空,不插入数据库!");return;}String mimg = imgEles.attr("src");String sql = "insert into movie(mname,mdesc,mimg,mlink) values(?,?,?,99)";try {System.out.println("****************" + mname + "***********************");System.out.println("****************" + mdesc + "***********************");System.out.println("****************" + mimg + "***********************");PreparedStatement pst = con.prepareStatement(sql);pst.setObject(1, mname);pst.setObject(2, mdesc);pst.setObject(3, mimg);if(pst.executeUpdate() == 0) {logger.info("爬取影片信息插入数据库失败");}else {cache.put(new net.sf.ehcache.Element(movieUrl, movieUrl));logger.info("爬取影片信息插入数据库成功");}} catch (SQLException e) {logger.error("数据异常-SQLException:",e);}}public static void main(String[] args) {manager = CacheManager.create(PropertiesUtil.getValue("ehcacheXmlPath"));cache = manager.getCache("8gli_movies");httpClient = HttpClients.createDefault();DbUtil dbUtil = new DbUtil();try {con = dbUtil.getCon();List<String> urls = getUrls();for (String url : urls) {try {parseUrl(url);} catch (Exception e) {
//					urls.add(url);}} } catch (Exception e1) {logger.error("数据库连接势失败!");} finally {try {if (httpClient != null) {httpClient.close();}if (con != null) {con.close();}} catch (IOException e) {logger.error("网络连接关闭异常-IOException:",e);} catch (SQLException e) {logger.error("数据关闭异常-SQLException:",e);}}// 最终将数据缓存到硬盘中if (cache.getStatus() == Status.STATUS_ALIVE) {cache.flush();}manager.shutdown();}}

在这里插入图片描述
在这里插入图片描述


http://chatgpt.dhexx.cn/article/hgaASiJh.shtml

相关文章

解析卷积神经网络学习笔记——魏秀参

第二章 CNN基本部件 1.理解批处理和随机梯度下降&#xff1a; 训练模型时随机选取n个训练样本作为一个batch&#xff08;批输入&#xff09;&#xff0c;那么经过设计好的卷积神经网络就可以输出n个预测值&#xff0c;对这n个预测值求其损失函数&#xff08;注意损失函数绝不是…

【干货】卷积神经网络Alex-Net、VGG-Nets、Network-In-Network案例分析

目录 Alex-Net 网络模型 VGG-Nets 网络模型 Network-In-Network 本文将以 Alex-Net、VGG-Nets、Network-In-Network 为例&#xff0c;分析几类经典的卷积神经网络案例。 在此请读者注意&#xff0c;此处的分析比较并不是不同网络模型精度的“较量”&#xff0c;而是希望读者…

2018年国内十大技术突破:22纳米光刻机、大型航天器回收

https://www.toutiao.com/a6639830026990649860/ 2018-12-28 08:11:39 盘点这一年的核心技术&#xff1a;22纳米光刻机、450公斤人造蓝宝石、0.12毫米玻璃、大型航天器回收、盾构机“弃壳返回”、远距离虹膜识别……哪一个不夺人眼球&#xff01; 1 智能水刀削铁断金 10月份的…

AI的螺旋式上升?今日头条AI掌门人马维英离职,“重返”清华从事培育科研工作

2020-07-29 01:22:49 作者 | 蒋宝尚 编辑 | 丛 末 据媒体报道&#xff0c;字节跳动副总裁、人工智能实验室主任马维英离职&#xff0c;将到清华大学智能产业研究院任职&#xff0c;加入正在筹备该产业院的原百度总裁张亚勤团队。 对于马维英离职一事&#xff0c;字节跳动也做…

超全深度学习细粒度图像分析:项目、综述、教程一网打尽

在本文中&#xff0c;来自旷视科技、南京大学和早稻田大学的研究者对基于深度学习的细粒度图像分析进行了综述&#xff0c;从细粒度图像识别、检索和生成三个方向展开论述。此外&#xff0c;他们还对该领域未来的发展方向进行了讨论。 &#xff08;CV&#xff09;是用机器来理解…

机器学习防止模型过拟合的方法知识汇总

目录 LP范数L1范数L2范数L1范数和L2范数的区别DropoutBatch Normalization归一化、标准化 & 正则化Reference 其实正则化的本质很简单&#xff0c;就是对某一问题加以先验的限制或约束以达到某种特定目的的一种手段或操作。在算法中使用正则化的目的是防止模型出现过拟合。…

一文读懂机器学习中的正则化

正则化是一种为了减小测试误差的行为(有时候会增加训练误差)。当我们用较为复杂的模型拟合数据时,容易出现过拟合现象,导致模型的泛化能力下降,这时我们就需要使用正则化,降低模型的复杂度。本文总结阐释了正则化的相关知识点,帮助大家更好的理解正则化这一概念。 目录 L…

漆远离职阿里加盟复旦!大牛纷纷回归学界,大厂AI名存实亡?

来源丨新智元 编辑丨小咸鱼 好困 【导读】蚂蚁金服原副总裁、AI团队负责人漆远已于近日离职&#xff0c;出任复旦大学「浩清」教授&#xff0c;复旦人工智能创新与产业研究院院长。将从事深度学习、强化学习等人工智能领域的前沿研究和应用。 那个支付宝背后的AI大牛&#xff0…

正则化方法归纳总结

作者丨Poll 来源丨https://www.cnblogs.com/maybe2030/p/9231231.html 编辑丨极市平台 本文仅用于学术分享,如有侵权请联系后台删文 导读 本文先对正则化的相关概念进行解释作为基础&#xff0c;后对正则化的方法进行了总结&#xff0c;帮助大家更加清晰的了解正则化方法。 阅…

阿里副总裁、达摩院自动驾驶负责人王刚离职!

转载自&#xff1a;新智元 | 编辑&#xff1a;桃子 好困 【导读】从学界「跨界」互联网&#xff0c;再转身去创业。这一年&#xff0c;他40岁&#xff0c;依然选择挑战自我。消息称&#xff0c;阿里副总裁、达摩院自动驾驶实验室负责人王刚已于近日离职。阿里&#xff0c;是他的…

机器学习防止模型过拟合方法总结

转自 | 小白学视觉&#xff0c;作者小白 文章仅用于学术分享&#xff0c;侵删 目录 LP范数L1范数L2范数L1范数和L2范数的区别DropoutBatch Normalization归一化、标准化 & 正则化Reference 其实正则化的本质很简单&#xff0c;就是对某一问题加以先验的限制或约束以达到某种…

CV还要更热闹!旷视刚宣布4.6亿美元融资,商汤:新一轮年内完成

允中 发自 凹非寺 量子位 报道 | 公众号 QbitAI △ 商汤科技CEO徐立 你追我赶&#xff01; 机器视觉&#xff08;CV&#xff09;领域热度还在急剧升温。 昨天&#xff0c;旷视科技(Face)刚完成了C轮4.6亿美元融资&#xff0c;刷新了AI公司全球单轮融资额纪录。 而这一纪录的保…

LaTeX 有哪些「新手须知」的内容?

孟晨 &#xff0c;在 LaTeX 话题下写错 LaTeX 名字的&#xff0c;一律… 陈硕 等 137 人赞同 这是个好问题&#xff0c;虽然提问提得很大。不是很好答&#xff0c;权当抛砖引玉了。 天字第一号原则&#xff1a;不要到网上抄代码&#xff0c;尤其是似懂非懂的阶段。 除非代码的…

《解析深度学习》部分笔记

记录一些书里的知识点&#xff0c;摘自魏秀参的《解析深度学习-卷积神经网络原理与视觉实践》 第三章 卷积神经网络经典结构 1.在深度学习中&#xff0c;深度卷积神经网络呈现“分布式表示”&#xff0c;既“语义概念”到神经元是一个多对多映射。直观讲就是&#xff1a;每个语…

新型计算机离我们还有多远

作者&#xff1a;高如如&#xff0c;魏秀参 本文为《程序员》原创文章&#xff0c;未经允许不得转载&#xff0c;更多精彩文章请订阅《程序员》 自1946年ENIAC&#xff08;Electronic Numerical Integrator And Calculator&#xff0c;即电子数字积分计算机&#xff09;问世&am…

Must Know Tips/tricks in DNN

Must Know Tips/Tricks in Deep Neural Networks (byXiu-Shen Wei) 转载于http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of mult…

[读书笔录]解析卷机神经网络(魏秀参)——第二章

解析卷积神经网络——基础理论篇 第二章 卷机神经网络基本部件 2.1 “端到端”思想&#xff08;end-to-end manner&#xff09; 深度学习的一个重要思想即”端到端”的学习方式&#xff0c;属于表示学习的一种。整个学习流程并不进行人为的子问题划分&#xff0c;而是完全交…

[读书笔录]解析卷机神经网络(魏秀参)——第三章

#解析卷积神经网络——基础理论篇 第三章 卷机神经网络经典结构 3.1 CNN网络结构中的重要概念 ####感受野 感受野(receptive filed)原指听觉、视觉等神经系统中一些神经元的特性&#xff0c;即神经元只接受其所支配的刺激区域内的信号。 以单层卷积操作为例&#xff0c;如左…

[读书笔录]解析卷积神经网络(魏秀参)——第一章

解析卷积神经网络——基础理论篇 第一章 卷机神经网络基础知识 1.1发展历程 卷积神经网络发展历史中的第一件里程碑事件发生在上世纪60年代左右的神经科学中&#xff0c;1959 年提出猫的初级视皮层中单个神经元的“感受野”概念&#xff0c;紧接着于1962年发现了猫的视觉中枢…

如何看旷视南京负责人魏秀参跳槽高校工作?

链接&#xff1a;https://www.zhihu.com/question/404733616 编辑&#xff1a;深度学习与计算机视觉 声明&#xff1a;仅做学术分享&#xff0c;侵删 跳槽本是正常现象&#xff0c;之所以会在知乎引起讨论&#xff0c;说明其中有让大家值得关注的点。但我们吃瓜群众&#xff0c…