概述:
Boilerpipe即我们需要的正文提取工具,其算法的基本思想是通过训练获得一个分类器来提取出我们需要的信息,包括多种提取方式具体的参见:CommonExtractors
环境:
jdk1.6
boilerpipe-1.2.0
提取新闻正文demo代码如下:
public static void main(String[] args) throws Exception { String url = "http://finance.people.com.cn/n/2013/1011/c66323-23157265.html"; TextDocument doc = new BoilerpipeSAXInput(new InputSource(new URL(url).openStream())) .getTextDocument(); BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR; extractor.process(doc); System.out.println("title:" + doc.getTitle()); System.out.println("content:" + doc.getContent());}
依赖的lib参见附件