Improving Support Vector Clustering with Ensembles

来源：用户分享时间：2021-04-05 本文由

逝雪深，笑意浅分享下载这篇文档手机版

说明：文章内容仅供预览，部分内容可能不全，需要完整文档或者需要复制内容，请下载word后使用。下载word有问题请添加微信号:xxxxxx或QQ：xxxxxx 处理（尽可能给您提供完整文档），感谢您的支持与谅解。

Wilfredo J. Puma-Villanueva, George B. Bezerra, Clodoaldo A. M. Lima, Fernando J. Von Zuben

LBiC/DCA/FEEC/Unicamp

C.P. 6101

Campinas/SP, Brazil, 13083-970

+55 19 3788-3885

{wilfredo, bezerra, moraes, vonzuben}@dca.fee.unicamp.br

Abstract: Support Vector Clustering (SVC) is a recently proposed clustering methodology with promising performance for high-dimensional and noisy data sets, and for clusters with arbitrary shape. However, setting the parameters of the SVC algorithm is a challenging task. Instead of searching for a single optimal configuration, the proposal involves generation, selection, and combination of distinct clustering solutions that guides to a consensus clustering. The purpose is to deal with a wide variety of clustering problems without the necessity of searching for a single and dedicated high-performance solution.

I. PROBLEM STATEMENT

Support Vector Machines (SVM) are high-performance supervised learning machines based on the Vapnik’s Statistical Learning Theory, and successively extended by a number of researchers to deal with clustering problems. The SVM variants are generally competitive among each other, even when they differ on formulation, solution strategy, and/or the choice of the kernel function. Under the availability of multiple learning machines, there are many theoretical and empirical reasons to implement an ensemble.

Ensembles involve the generation, selection, and linear/nonlinear combination of a set of individual components designed to simultaneously cope with the same task. This is typically done through the variation of some configuration parameters and/or employment of different training procedures, such as bagging and boosting. Such ensembles should properly integrate the knowledge embedded in the components, and have frequently produced more accurate and robust models. The effectiveness of the ensemble will strongly depend on the diverse behaviour and accuracy of the learning machines taken as components.

For a sample of size N composed of p-dimensional real-valued vectors, clustering is a procedure that divides the p-dimensional vectors in m disjoint groups. Data points within each group are more similar to each other than to any data point in other groups.

Each clustering procedure may produce diverse solutions depending on its parameters setup. In cases where no a priori knowledge is available, it becomes quite difficult to attest the consistency of a single solution. Cluster

boundaries tend to be fuzzy, and clustering results will significantly vary at transitory regions.

The resulting diversity among clustering proposals can be explored to synthesize an ensemble of clustering solutions. The main aspects to be explored are:

§ Reuse of the knowledge implicit in each clustering solution.

§ Clustering over distributed datasets in case where the data cannot be directly shared or grouped together because of restrictions due to ownership, privacy, and storage.

§ Attribution of a confidence level to each cluster. The ensemble proposed here will combine partitions produced by SVC (Support Vector Clustering) [1] [2]. SVC is used to map the data set into a higher dimensional feature space using a Gaussian kernel, and then searches for the minimal enclosing sphere. When the sphere is mapped back to the original data space, it will automatically separate data into clusters. The SVC methodology can generate clusters with arbitrary shape and size. Besides, it also has a unique mechanism to deal with outliers, making it especially adequate for noisy datasets.

Yang et al. [20] proposed a mechanism to improve the performance of the original SVC [1] by adopting proximity graph tools at the cluster assignment stage, thus increasing the accuracy and providing scalability to deal with large data sets by a considerable reduction of the required processing time.

II. CURRENT RESEARCH

Researches involving combination of multiple clustering are relatively few in the machine learning community. Park et al. [16] adopted several values for the width parameter of the Gaussian kernel in SVC, aiming at obtaining various adjacency matrixes, and then combined them via Spectral Graph Partitioning to obtain one consensus adjacency matrix. The following works, though relevant to the intended application, did not apply kernel-based approaches to perform clustering. Fisher [5] analyzed methods for iteratively improving an initial set of hierarchical clustering solutions. Fayyad et al. [4] obtained multiple approximate k-means solutions in main memory after making a single pass through a database

搜索“diyifanwen.net”或“第一范文网”即可找到本站免费阅读全部范文。收藏本站方便下次阅读，第一范文网，提供最新高中教育Improving Support Vector Clustering with Ensembles全文阅读和word下载服务。

Improving Support Vector Clustering with Ensembles.doc 将本文的Word文档下载到电脑，方便复制、编辑、收藏和打印

下载这篇word文档

本文链接：https://www.diyifanwen.net/wenku/1181559.html（转载请注明文章来源）

上一篇：oracle 12c安装步骤说明
下一篇：产品认证指南