R可视化:iris数据探索

前言

Kaggle数据挖掘竞赛里有一个经典的探索性分析例子,对iris数据集进行了各种形式的可视化,帮助人通过直观的图形更深地理解特征与label的关系。Kaggle官网给出了Python版本的实现,链接如下:

https://www.kaggle.com/benham...

本文用R对该notebook的代码进行重现。

代码

library(tidyr)
library(dplyr)
library(ggplot2)
library(grid)
library(GGally)

Let's see what's in the iris data

head(iris)

Let's see how many examples we have of each species

summary(iris$Species)

Make scatter plot of Sepal.Length and Sepal.Width

p.scatter % gather(feature_name, feature_value, one_of(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")))) + geom_boxplot(aes(x=Species, y=feature_value)) + facet_wrap(~feature_name)
p.box.facet

Parallel coordinate graph & Andrews Curve

修改自:http://cos.name/2009/03/parallel-coordinates-and-andrews-curve/

轮廓图的思想非常简单、直观,它是在横坐标上取n个点,依次表示各个指标(即变量);横坐标上则对应各个指标的值(或者经过标准化变换后的值),然后将每一组数据对应的点依次连接即可

调和曲线图的思想和傅立叶变换十分相似:

根据三角变换方法将 n 维空间的点映射到二维平面上的曲线上,其中x取值范围为[-pi,pi]。

Another multivariate visualization technique pandas has is parallel_coordinates

Parallel coordinates plots each feature on a separate column & then draws lines

connecting the features for each data sample

p.paral % gather(feature_name, feature_value, one_of(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))), id=1:nrow(iris))) + geom_line(aes(x=feature_name, y=feature_value, group=id, colour=Species))
p.paral

One cool more sophisticated technique pandas has available is called Andrews Curves

Andrews Curves involve using attributes of samples as coefficients for Fourier series

and then plotting these

andrews_curve %
gather(x, y, -label, -id, convert = TRUE)
}

iris.andrew

关键字:r, ggplot2


本文来自互联网用户投稿,文章观点仅代表作者本人,不代表本站立场,不承担相关法律责任。如若转载,请注明出处。 如若内容造成侵权/违法违规/事实不符,请点击【内容举报】进行投诉反馈!

立即
投稿

微信公众账号

微信扫一扫加关注

返回
顶部