用R语言做数据分析——用聚类方法进行离群点检测

用聚类的方法也可以检测离群点。将数据进行划分，那些没有被划分到任何簇的数据点即为离群点。例如，密度的聚类，如果两个对象之间是密度可达的，则这两个对象将被划分到同一组。因此，那些没有被划分到任何一组的对象与其他对象是相互孤立的，这些孤立的对象被认为是离群点。

我们还可以使用k-means算法来检测离群点。使用k-means聚类，数据将划分成k组，每一个数据点都划分到与之距离最小的分组，然后计算每个对象与簇中心之间的距离，并将距离最大的对象作为离群点。下面的例子是在iris数据集上使用k-means算法检测离群点。

> iris2 <- iris[,1:4]

> kmeans.result <- kmeans(iris2, centers = 3)

>#簇中心

> kmeans.result$centers

Sepal.Length Sepal.Width Petal.Length Petal.Width

1 5.006000 3.428000 1.462000 0.246000

2 6.850000 3.073684 5.742105 2.071053

3 5.901613 2.748387 4.393548 1.433871

> kmeans.result$cluster

[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

[37] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

[73] 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2

[109] 2 2 2 2 2 3 3 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2

[145] 2 2 3 2 2 3

>#计算对象与簇中心的距离

> centers <- kmeans.result$centers[kmeans.result$cluster,]

> distances <- sqrt(rowSums((iris2-centers)^2))

> outliers <- order(distances, decreasing = T)[1:5]

> print(outliers)

[1] 99 58 94 61 119

> print(iris2[outliers,])

Sepal.Length Sepal.Width Petal.Length Petal.Width

99 5.1 2.5 3.0 1.1

58 4.9 2.4 3.3 1.0

94 5.0 2.3 3.3 1.0

61 5.0 2.0 3.5 1.0

119 7.7 2.6 6.9 2.3

>#簇散点图

> plot(iris2[, c("Sepal.Length","Sepal.Width")], pch="o", col=kmeans.result$cluster,cex=0.3)

>#簇中心点用“*”表示

> points(kmeans.result$centers[,c("Sepal.Length","Sepal.Width")], col=1:3, pch=8, cex=1.5)

>#离群点用“+”表示

> points(iris2[outliers, c("Sepal.Length","Sepal.Width")], pch="+", col=4, cex=1.5)

相关推荐

无缓存不行?例行升级的入门级阿斯加特AN2 SSD装机点评

Ceph运维手册(基于P版本)

Docker 命令大全（docker命令大全记录表）

替代Docker build的Buildah简单介绍

Docker Desktop安装使用指南:零基础教程

Tensorflow分类loss函数总结 tensorflow绘制loss曲线

R语言学习笔记(七) -离散型数据的模型预测2

服务器硬件RAID性能横评(2)（服务器常用raid技术）

Python教程:第9篇字符串基本操作

k8s中三种POD调度策略介绍 k8s pod间调用