Seimi基础系列2-SeimiCrawler整合Mybatis存储数据

26 Jul 2016

最近关注SeimiCrawler整合Mybatis的朋友比较多,故仅以此文抛砖引玉。如果是不了解SeimiCrawler的朋友也可以通过此文简单了解下SeimiCrawler

SeimiCrawler简介

SeimiCrawler是一个敏捷的,独立部署的,支持分布式的Java爬虫框架,希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛,以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里,绝大多数人只需关心去写抓取的业务逻辑就够了,其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发,同时融合了Java语言本身特点与Spring的特性,并希望在国内更方便且普遍的使用更有效率的XPath解析HTML,所以SeimiCrawler默认的HTML解析器是JsoupXpath(独立扩展项目,非jsoup自带),默认解析提取HTML数据工作均使用XPath来完成(当然,数据处理亦可以自行选择其他解析器)。并结合SeimiAgent彻底完美解决复杂动态页面渲染抓取问题。

项目源码

Github托管

下面正式开始整合Mybatis的内容。数据库以MySQL为例。

依赖

<dependency>
	<groupId>cn.wanghaomiao</groupId>
    <artifactId>SeimiCrawler</artifactId>
    <version>1.2.0</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-dbcp2</artifactId>
	<version>2.1.1</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-pool2</artifactId>
	<version>2.4.2</version>
</dependency>
<dependency>
	<groupId>mysql</groupId>
	<artifactId>mysql-connector-java</artifactId>
	<version>5.1.37</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis-spring</artifactId>
	<version>1.3.0</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis</artifactId>
	<version>3.4.1</version>
</dependency>

数据表结构

假设建有数据库,库名为xiaohuo,内含表结构如下:

CREATE TABLE `blog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(300) DEFAULT NULL,
  `content` text,
  `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

对应的Model对象

package cn.wanghaomiao.model;

import cn.wanghaomiao.seimi.annotation.Xpath;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
 * Xpath语法可以参考 http://jsoupxpath.wanghaomiao.cn/
 * @since 2015/10/27.
 */
public class BlogContent {
    private Integer id;

    @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
    private String title;

    //也可以这么写 @Xpath("//div[@id='cnblogs_post_body']//text()")
    @Xpath("//div[@id='cnblogs_post_body']/allText()")
    private String content;

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        if (StringUtils.isNotBlank(content)&&content.length()>100){
            //方便查看截断下
            this.content = StringUtils.substring(content,0,100)+"...";
        }
        return ToStringBuilder.reflectionToString(this);
    }
}

整合Mybatis的配置文件

一些基本的全局设置

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <settings>
        <setting name="mapUnderscoreToCamelCase" value="true"/>
    </settings>
</configuration>
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">


    <context:annotation-config />
    <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource">
        <property name="driverClassName" value="${database.driverClassName}"/>
        <property name="url" value="${database.url}"/>
        <property name="username" value="${database.username}"/>
        <property name="password" value="${database.password}"/>
    </bean>

    <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true">
        <property name="configLocation" value="classpath:mybatis-config.xml"/>
    </bean>
    <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory">
        <property name="dataSource" ref="mybatisDataSource"/>
    </bean>
    <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
        <property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/>
        <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/>
    </bean>
</beans>

配置文件中的${database.driverClassName}是由于SeimiCrawler的demo工程还有动态配置的相关设置,此处亦可直接写死,不必再读其他配置。

package cn.wanghaomiao.dao.mybatis;

import cn.wanghaomiao.model.BlogContent;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Options;
import org.apache.ibatis.annotations.Param;

/**
 * @since 2016/7/27.
 */
public interface MybatisStoreDAO {

    @Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())")
    @Options(useGeneratedKeys = true, keyProperty = "blog.id")
    int save(@Param("blog") BlogContent blog);
}

至此,Mybatis部分的已经就绪了。

使用

package cn.wanghaomiao.crawlers;

import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO;
import cn.wanghaomiao.model.BlogContent;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.springframework.beans.factory.annotation.Autowired;

import java.util.List;

/**
 * 将解析出来的数据直接存储到数据库中,整合mybatis实现
 *
 * @author 汪浩淼 [et.tw@163.com]
 * @since 2016/07/27.
 */
@Crawler(name = "mybatis")
public class DatabaseMybatisDemo extends BaseSeimiCrawler {
    @Autowired
    private MybatisStoreDAO storeToDbDAO;

    @Override
    public String[] startUrls() {
        return new String[]{"http://www.cnblogs.com/"};
    }

    @Override
    public void start(Response response) {
        JXDocument doc = response.document();
        try {
            List<Object> urls = doc.sel("//a[@class='titlelnk']/@href");
            logger.info("{}", urls.size());
            for (Object s : urls) {
                push(Request.build(s.toString(), "renderBean"));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void renderBean(Response response) {
        try {
            BlogContent blog = response.render(BlogContent.class);
            logger.info("bean resolve res={},url={}", blog, response.getUrl());
            //使用神器paoding-jade存储到DB
            int changeNum = storeToDbDAO.save(blog);
            int blogId = blog.getId();
            logger.info("store success,blogId = {},changeNum={}", blogId, changeNum);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

接下来简单启动下,

public class Boot {
    public static void main(String[] args){
        Seimi s = new Seimi();
        s.start("mybatis");
    }
}

可以看到如下日志:

00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=<null>,title=CoordinatorLayout自定义Bahavior特效及其源码分析CoordinatorLayout自定义Bahavior特效及其源码分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中可以算是最重要的一个东西,运用它可以做出一些不错的特效...],url=http://www.cnblogs.com/soaringEveryday/p/5711545.html
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1

整合完毕!

后记

生产环境工程打包部署以及启动,推荐使用maven-seimicrawler-plugin打包插件,详细请继续参阅maven-seimicrawler-plugin或是“Seimi基础系列1-SeimiCrawler打包部署工具使用”。

完整的Demo工程地址

完整版demo