第一范文网 - 专业文章范例文档资料分享平台

网络爬虫Java实现原理

来源:用户分享 时间:2025/5/25 18:18:02 本文由loading 分享 下载这篇文档手机版
说明:文章内容仅供预览,部分内容可能不全,需要完整文档或者需要复制内容,请下载word后使用。下载word有问题请添加微信号:xxxxxxx或QQ:xxxxxx 处理(尽可能给您提供完整文档),感谢您的支持与谅解。

import java.util.*; import java.net.*; import java.io.*;

import javax.swing.text.*; import javax.swing.text.html.*;

/**

* That class implements a reusable spider */

public class Spider {

/**

* A collection of URLs that resulted in an error */

protected Collection workloadError = new ArrayList(3);

/**

* A collection of URLs that are waiting to be processed */

protected Collection workloadWaiting = new ArrayList(3);

/**

* A collection of URLs that were processed */

protected Collection workloadProcessed = new ArrayList(3);

/**

* The class that the spider should report its URLs to */

protected ISpiderReportable report;

/**

* A flag that indicates whether this process * should be canceled */

protected boolean cancel = false;

/**

* The constructor *

* @param report A class that implements the ISpiderReportable * interface, that will receive information that the * spider finds. */

public Spider(ISpiderReportable report) {

this.report = report; }

/**

* Get the URLs that resulted in an error. *

* @return A collection of URL's. */

public Collection getWorkloadError() {

return workloadError; }

/**

* Get the URLs that were waiting to be processed. * You should add one URL to this collection to

* begin the spider. *

* @return A collection of URLs. */

public Collection getWorkloadWaiting() {

return workloadWaiting; }

/**

* Get the URLs that were processed by this spider. *

* @return A collection of URLs. */

public Collection getWorkloadProcessed() {

return workloadProcessed; }

/**

* Clear all of the workloads. */

public void clear() {

getWorkloadError().clear(); getWorkloadWaiting().clear(); getWorkloadProcessed().clear(); }

/**

* Set a flag that will cause the begin

* method to return before it is done. */

public void cancel() {

cancel = true; }

/**

* Add a URL for processing. *

* @param url */

public void addURL(URL url) {

if ( getWorkloadWaiting().contains(url) ) return;

if ( getWorkloadError().contains(url) ) return;

if ( getWorkloadProcessed().contains(url) ) return;

log(\to workload: \+ url ); getWorkloadWaiting().add(url); }

/**

* Called internally to process a URL *

* @param url The URL to be processed. */

public void processURL(URL url) {

try {

log(\\+ url ); // get the URL's contents

URLConnection connection = url.openConnection(); if ( (connection.getContentType()!=null) &&

!connection.getContentType().toLowerCase().startsWith(\) { getWorkloadWaiting().remove(url); getWorkloadProcessed().add(url);

log(\processing because content type is: \+ connection.getContentType() ); return; }

// read the URL

InputStream is = connection.getInputStream(); Reader r = new InputStreamReader(is); // parse the URL

HTMLEditorKit.Parser parse = new HTMLParse().getParser(); parse.parse(r,new Parser(url),true); } catch ( IOException e ) {

getWorkloadWaiting().remove(url); getWorkloadError().add(url); log(\\+ url ); report.spiderURLError(url); return; }

// mark URL as complete

getWorkloadWaiting().remove(url); getWorkloadProcessed().add(url); log(\\+ url );

搜索更多关于: 网络爬虫Java实现原理 的文档
网络爬虫Java实现原理.doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
本文链接:https://www.diyifanwen.net/c0463u8hrgx9da6a52izb_4.html(转载请注明文章来源)
热门推荐
Copyright © 2012-2023 第一范文网 版权所有 免责声明 | 联系我们
声明 :本网站尊重并保护知识产权,根据《信息网络传播权保护条例》,如果我们转载的作品侵犯了您的权利,请在一个月内通知我们,我们会及时删除。
客服QQ:xxxxxx 邮箱:xxxxxx@qq.com
渝ICP备2023013149号
Top