R可视化：iris数据探索

2016-07-30 23:46:00

前言

Kaggle数据挖掘竞赛里有一个经典的探索性分析例子，对iris数据集进行了各种形式的可视化，帮助人通过直观的图形更深地理解特征与label的关系。Kaggle官网给出了Python版本的实现，链接如下：

https://www.kaggle.com/benham...

本文用R对该notebook的代码进行重现。

代码

library(tidyr)
library(dplyr)
library(ggplot2)
library(grid)
library(GGally)

Let's see what's in the iris data

head(iris)

Let's see how many examples we have of each species

summary(iris$Species)

Make scatter plot of Sepal.Length and Sepal.Width

p.scatter % gather(feature_name, feature_value, one_of(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")))) + geom_boxplot(aes(x=Species, y=feature_value)) + facet_wrap(~feature_name)
p.box.facet

Parallel coordinate graph & Andrews Curve

修改自：http://cos.name/2009/03/parallel-coordinates-and-andrews-curve/

轮廓图的思想非常简单、直观，它是在横坐标上取n个点，依次表示各个指标(即变量)；横坐标上则对应各个指标的值(或者经过标准化变换后的值)，然后将每一组数据对应的点依次连接即可

调和曲线图的思想和傅立叶变换十分相似：

根据三角变换方法将 n 维空间的点映射到二维平面上的曲线上，其中x取值范围为[-pi,pi]。

Another multivariate visualization technique pandas has is parallel_coordinates

Parallel coordinates plots each feature on a separate column & then draws lines

connecting the features for each data sample

p.paral % gather(feature_name, feature_value, one_of(c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))), id=1:nrow(iris))) + geom_line(aes(x=feature_name, y=feature_value, group=id, colour=Species))
p.paral

One cool more sophisticated technique pandas has available is called Andrews Curves

Andrews Curves involve using attributes of samples as coefficients for Fourier series

and then plotting these

andrews_curve %
gather(x, y, -label, -id, convert = TRUE)
}

iris.andrew

关键字：r, ggplot2

本文来自互联网用户投稿，文章观点仅代表作者本人，不代表本站立场，不承担相关法律责任。如若转载，请注明出处。 如若内容造成侵权/违法违规/事实不符，请点击【内容举报】进行投诉反馈！

标签：业界 r ggplot2