Welcome to R Square

重塑一个数据集

楚新元 / 2025-07-09


本文是《R 语言实战》(第 2 版)中数据整合与重构的一个例子,书中用的是老旧的 reshape2 对数据进行了融合和重铸,有点像《变脸》里唱的那样,文化领域百花齐放,百家争鸣是对的,可是一张表表达的意思是一个意思,不同的人做出来就可能千差万别,如果你不能做到源头控制,那你就得掌握如何通过各种变换变成你需要的那张脸。

加载相关 R 包

library(tidyr)
library(dplyr)
library(knitr)

原始数据集

mydata = data.frame(
  ID = c(1, 1, 2, 2),
  Time = c(1, 2, 1, 2),
  X1 = c(5, 3, 6, 2),
  X2 = c(6, 5, 1, 4)
)
kable(mydata, align = "c")
ID Time X1 X2
1 1 5 6
1 2 3 5
2 1 6 1
2 2 2 4

原表转化为长表

mydata |> 
  pivot_longer(
    X1:X2,
    names_to = "variable",
    values_to = "value"
  ) -> md
kable(md, align = "c")
ID Time variable value
1 1 X1 5
1 1 X2 6
1 2 X1 3
1 2 X2 5
2 1 X1 6
2 1 X2 1
2 2 X1 2
2 2 X2 4

执行整合

转化会导致信息损失,就像麦子变成面粉一样,这就意味着转换后的表无法逆回到之前的状态。

长表转化为表 a

md |> 
  summarise(
    value_mean = mean(value),
    .by = c(ID, variable)
  ) |> 
  pivot_wider(
    names_from = variable,
    values_from = value_mean
  ) -> table_a
kable(table_a, align = "c")
ID X1 X2
1 4 5.5
2 4 2.5

长表转化为表 b

md |> 
  summarise(
    value_mean = mean(value),
    .by = c(Time, variable)
  ) |> 
  pivot_wider(
    names_from = variable,
    values_from = value_mean
  ) -> table_b
kable(table_b, align = "c")
Time X1 X2
1 5.5 3.5
2 2.5 4.5

长表转化为表 c

md |> 
  summarise(
    value_mean = mean(value),
    .by = c(ID, Time)
  ) |> 
  pivot_wider(
    names_from = Time,
    names_prefix = "Time",
    values_from = value_mean
  ) -> table_c
kable(table_c, align = "c")
ID Time1 Time2
1 5.5 4
2 3.5 3

不执行整合

这就意味着信息必须是无损的,只是做了形式转换,转换后可以逆回到原来的状态。

长表转为表 d

md |> 
  pivot_wider(
    names_from = variable,
    values_from = value
  ) -> table_d
kable(table_d, align = "c")
ID Time X1 X2
1 1 5 6
1 2 3 5
2 1 6 1
2 2 2 4

表 d 逆回长表

table_d |> 
  pivot_longer(
    X1:X2,
    names_to = "variable",
    values_to = "value"
  ) |> 
  kable(align = "c")
ID Time variable value
1 1 X1 5
1 1 X2 6
1 2 X1 3
1 2 X2 5
2 1 X1 6
2 1 X2 1
2 2 X1 2
2 2 X2 4

长表转为表 e

md |> 
  pivot_wider(
    names_from = Time,
    names_prefix = "Time",
    values_from = value
  ) -> table_e
kable(table_e, align = "c")
ID variable Time1 Time2
1 X1 5 3
1 X2 6 5
2 X1 6 2
2 X2 1 4

表 e 逆回长表

table_e |> 
  pivot_longer(
    cols = starts_with("Time"), 
    names_to = "Time",
    names_transform = list(Time = readr::parse_number),
    values_to = "value"
  ) |> 
  kable(align = "c")
ID variable Time value
1 X1 1 5
1 X1 2 3
1 X2 1 6
1 X2 2 5
2 X1 1 6
2 X1 2 2
2 X2 1 1
2 X2 2 4

长表转为表 f

md |> 
  pivot_wider(
    names_from = c(variable, Time), 
    names_glue = "{variable}_Time{Time}",
    values_from = value
  ) -> table_f
kable(table_f, align = "c")
ID X1_Time1 X2_Time1 X1_Time2 X2_Time2
1 5 6 3 5
2 6 1 2 4

表 f 逆回长表

table_f |> 
  pivot_longer(
    -ID,
    names_to = c("variable", "Time"),
    names_sep = "_Time",
    names_transform = list(Time = as.numeric),
    values_to = "value"
  ) |> 
  kable(align = "c")
ID variable Time value
1 X1 1 5
1 X2 1 6
1 X1 2 3
1 X2 2 5
2 X1 1 6
2 X2 1 1
2 X1 2 2
2 X2 2 4