02 R 자주 쓰는 기능(read.csv, merge, distinct, function import, factor, group by)

티스토리 뷰

카테고리 없음

02 R 자주 쓰는 기능(read.csv, merge, distinct, function import, factor, group by)

KyeongRok Kim 2020. 7. 12. 13:46

데이터 불러올 때 특정 컬럼 string type으로 불러오기

df <- read.csv('~/Desktop/mafra_total.csv', colClasses=c(data_id = 'character'))

이 기능을 쓰는 이유는 숫자가 너무 길 경우 부동소수점으로 불러와져서 뒤가 짤리는 경우가 있기 때문입니다.

merge(join)하기

df3 = merge(df_link, df2, by = 'data_id')

df_link와 df2를 merge(join)합니다. 기준은 'data_id'로 두개의 table을 join 하고 싶을 때 씁니다.

데이터 n건 샘플링해서 저장하기

  file_name = sprintf("datas_for_develop/%s_Pop.move.txt", year)
  print(file_name)
  pop.2015 <- read.table(file_name, header=TRUE, sep="\t", encoding = 'utf-8')
  sim_ran_sam <- sample(1:nrow(pop.2015), 1000)
  hello <- pop.2015[sim_ran_sam, ]

  dim(sim_ran_sam)

  length(pop.2015)
  nrow(pop.2015)
  dim(pop.2015)

  #View(hello)
  write.table(hello, sprintf('%shello.txt',year))

sim_ran_sam <- sample(1:nrow(pop.2015), 1000) sample() 함수를 이용해서 1부터 row수 사이에 1000개의 숫자를 뽑습니다.
그 숫자로 n개중에 샘플링을 해서 파일로 저장하는 코드입니다.

데이터를 돌리다보면 몇백만건씩 돌려야 할 때가 있는데 그러면 너무 느려서 로직 개발을 할 때는 샘플링을 해서 개발하고 실제로 돌릴때는 전체로 돌리는 방법을 쓸 때 사용합니다.

첫번째 row데이터 전체 보기

df2[1, ,]

r에서 data frame은 배열과는 다르게 항상 행, 열 기반으로 indexing을 한다. 그래서 위와 같이 1행을 선택 해주고 column은 ,를 입력하면 전체 column이 선택 됩니다.

첫번째 row 1열부터 10열까지 보기

df2[1, 1:10]

2번 컬럼부터 끝까지 뽑기

result <- pop.2015[sim_ran_sam, 2:ncol(pop.2015)]

저장 할 때 index없이 저장하기

write.csv(result, sprintf('datas_for_develop/%s_pop_csv.csv',year), row.names=FALSE)

distinct하고 나머지 field는 살려둠

View(owner_origin %>% filter(career==0) %>% distinct(code, .keep_all = TRUE))

code기준으로 dictinct함

toddler라는 column을 생성하고 0으로 초기화 하기

$ 이용하기

df$toddler <- 0

결과

dplyr.mutate() 이용하기

df2 <- mutate(df, kinder=1)

결과

dplyr.mutate()를 이용해 모든 row에 add_3이라는 function적용해서 kinder라는 column에 만들기

library(dplyr)
df <- read.csv('turn_farm_2.csv')

add_3 <- function(aa){
  aa + 3
}

df2 <- mutate(df, kinder=add_3(2))

Function만들고 import하기

한개의 R script에서 function을 개발 하려면 이미 df도 로딩 되어 있고, 건수가 많은 데이터를 계속 돌리려고 하면 느리다. 그래서 function만 따로 개발하려고 할 때 이 방법을 씁니다.

calc_toddler.R 이라고 파일명을 짓고 사용할 곳에서 source()로 import할 수 있습니다.

hello <- function (aa){
  1
}

내용은 위와 같이 function 모양만 만들어 놓았습니다.

부를때는

source('functions/calc_toddler.R')

이렇게 부르면 되고 사용은 hello()를 바로 호출 하면 됩니다.

1과 2로 구성된 컬럼 "A", "B" factor로 바꾸기

df$drug <- factor(df$drug, levels=c(1, 2), labels = c('A', 'B'))

group by 쓰는 법

library(dplyr)
getwd()
setwd("C:/git/python/data_class/")
df <- read.csv(file="Floating_Population_2008.csv",
               header = T,
               fileEncoding = 'utf-8',
               # encoding="utf-8"
               )
gr <- df %>% group_by(군구) %>%
  select(군구, 유동인구수) %>%
  summarise(sum_유동 = sum(유동인구수))

# sum_유동 기준으로 내림차순 정렬
gr <- gr[order(-gr$sum_유동), ]

dim(gr)
View(gr)

저작자표시

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

뷰티풀 프로그래밍

티스토리 뷰