网络爬虫Java实现原理

来源：用户分享时间：2025/5/25 18:18:02 本文由

loading 分享下载这篇文档手机版

说明：文章内容仅供预览，部分内容可能不全，需要完整文档或者需要复制内容，请下载word后使用。下载word有问题请添加微信号:xxxxxxx或QQ：xxxxxx 处理（尽可能给您提供完整文档），感谢您的支持与谅解。

import java.util.*; import java.net.*; import java.io.*;

import javax.swing.text.*; import javax.swing.text.html.*;

/**

* That class implements a reusable spider */

public class Spider {

/**

* A collection of URLs that resulted in an error */

protected Collection workloadError = new ArrayList(3);

/**

* A collection of URLs that are waiting to be processed */

protected Collection workloadWaiting = new ArrayList(3);

/**

* A collection of URLs that were processed */

protected Collection workloadProcessed = new ArrayList(3);

/**

* The class that the spider should report its URLs to */

protected ISpiderReportable report;

/**

* A flag that indicates whether this process * should be canceled */

protected boolean cancel = false;

/**

* The constructor *

* @param report A class that implements the ISpiderReportable * interface, that will receive information that the * spider finds. */

public Spider(ISpiderReportable report) {

this.report = report; }

/**

* Get the URLs that resulted in an error. *

* @return A collection of URL's. */

public Collection getWorkloadError() {

return workloadError; }

/**

* Get the URLs that were waiting to be processed. * You should add one URL to this collection to

* begin the spider. *

* @return A collection of URLs. */

public Collection getWorkloadWaiting() {

return workloadWaiting; }

/**

* Get the URLs that were processed by this spider. *

* @return A collection of URLs. */

public Collection getWorkloadProcessed() {

return workloadProcessed; }

/**

* Clear all of the workloads. */

public void clear() {

getWorkloadError().clear(); getWorkloadWaiting().clear(); getWorkloadProcessed().clear(); }

/**

* Set a flag that will cause the begin

* method to return before it is done. */

public void cancel() {

cancel = true; }

/**

* Add a URL for processing. *

* @param url */

public void addURL(URL url) {

if ( getWorkloadWaiting().contains(url) ) return;

if ( getWorkloadError().contains(url) ) return;

if ( getWorkloadProcessed().contains(url) ) return;

log(\to workload: \+ url ); getWorkloadWaiting().add(url); }

/**

* Called internally to process a URL *

* @param url The URL to be processed. */

public void processURL(URL url) {

try {

log(\\+ url ); // get the URL's contents

URLConnection connection = url.openConnection(); if ( (connection.getContentType()!=null) &&

!connection.getContentType().toLowerCase().startsWith(\) { getWorkloadWaiting().remove(url); getWorkloadProcessed().add(url);

log(\processing because content type is: \+ connection.getContentType() ); return; }

// read the URL

InputStream is = connection.getInputStream(); Reader r = new InputStreamReader(is); // parse the URL

HTMLEditorKit.Parser parse = new HTMLParse().getParser(); parse.parse(r,new Parser(url),true); } catch ( IOException e ) {

getWorkloadWaiting().remove(url); getWorkloadError().add(url); log(\\+ url ); report.spiderURLError(url); return; }

// mark URL as complete

getWorkloadWaiting().remove(url); getWorkloadProcessed().add(url); log(\\+ url );

搜索更多关于：网络爬虫Java实现原理的文档

网络爬虫Java实现原理.doc 将本文的Word文档下载到电脑，方便复制、编辑、收藏和打印

下载这篇word文档

本文链接：https://www.diyifanwen.net/c0463u8hrgx9da6a52izb_4.html（转载请注明文章来源）