You are on page 1of 7

4

Bin tp d liu


Bin tp s liu y khng c ngha l thay i s liu gc (v l mt ti ln,
mt s gian di trong khoa hc khng th chp nhn c), m ch c ngha t chc s
liu sao cho R c th phn tch mt cch hu hiu. Nhiu khi trong phn tch thng k,
chng ta cn phi tp trung s liu thnh mt nhm, hay tch ri thnh tng nhm, hay
thay th t k t (characters) sang s (numeric) cho tin vic tnh ton. Trong chng
ny, ti s bn qua mt s lnh cn bn cho vic bin tp s liu.

Chng ta s quay li vi d liu chol trong v d 1. tin vic theo di v
hiu cu chuyn, ti xin nhc li rng chng ta nhp s liu vo trong mt d liu R
c tn l chol t mt text file c tn l chol.txt:

> setwd(c:/works/stats)
> chol <- read.table(chol.txt, header=TRUE)
> attach(chol)

4.1 Kim tra s liu trng khng (missing value)

Trong nghin cu, v nhiu l do s liu khng th thu thp c cho tt c i
tng, hay khng th o lng tt c bin s cho mt i tng. Trong trng hp ,
s liu trng c xem l missing value (m ti tm dch l s liu trng khng). R
xem cc s liu trng khng l NA. C mt s kim nh thng k i hi cc s liu
trng khng phi c loi ra (v khng th tnh ton c) trc khi phn tch. R c
mt lnh rt c ch cho vic ny: na.omit, v cch s dng nh sau:

> chol.new <- na.omit(chol)

Trong lnh trn, chng ta yu cu R loi b cc s liu trng khng trong
data.frame chol v a cc s liu khng trng vo data.frame mi tn l chol.new.
Ch lnh trn ch l v d, v trong d liu chol khng c s liu trng khng.

4.2 Tch ri d liu: subset

Nu chng ta, v mt l do no , ch mun phn tch ring cho nam gii, chng
ta c th tch chol ra thnh hai data.frame, tm gi l nam v nu. lm chuyn ny,
chng ta dng lnh subset(data, cond), trong data l data.frame m chng ta
mun tch ri, v cond l iu kin. V d:

> nam <- subset(chol, sex==Nam)
> nu <- subset(chol, sex==Nu)

Sau khi ra hai lnh ny, chng ta c 2 d liu (hai data.frame) mi tn l nam v nu.
Ch iu kin sex == Nam v sex == Nu chng ta dng == thay v = ch
iu kin chnh xc.

Tt nhin, chng ta cng c th tch d liu thnh nhiu data.frame khc nhau vi nhng
iu kin da vo cc bin s khc. Chng hn nh lnh sau y to ra mt data.frame
mi tn l old vi nhng bnh nhn trn 60 tui:

> old <- subset(chol, age>=60)
> dim(old)
[1] 25 8

Hay mt data.frame mi vi nhng bnh nhn trn 60 tui v nam gii:

> n60 <- subset(chol, age>=60 & sex==Nam)
> dim(n60)
[1] 9 8


4.3 Chit s liu t mt data .frame

Trong chol c 8 bin s. Chng ta c th chit d liu chol v ch gi li
nhng bin s cn thit nh m s (id), tui (age) v total cholestrol (tc). t
lnh names(chol) rng bin s id l ct s 1, age l ct s 3, v bin s tc l ct s
7. Chng ta c th dng lnh sau y:

> data2 <- chol[, c(1,3,7)]

y, chng ta lnh cho R bit rng chng ta mun chn ct s 1, 3 v 7, v a tt c
s liu ca hai ct ny vo data.frame mi c tn l data2. Ch chng ta s dng
ngoc kp vung [] ch khng phi ngoc kp vng (), v chol khng phi lm mt
function. Du phy pha trc c, c ngha l chng ta chn tt c cc dng s liu trong
data.frame chol.

Nhng nu chng ta ch mun chn 10 dng s liu u tin, th lnh s l:

> data3 <- chol[1:10, c(1,3,7)]
> print(data3)
id sex tc
1 1 Nam 4.0
2 2 Nu 3.5
3 3 Nu 4.7
4 4 Nam 7.7
5 5 Nam 5.0
6 6 Nu 4.2
7 7 Nam 5.9
8 8 Nam 6.1
9 9 Nam 5.9
10 10 Nu 4.0

Ch lnh print(arg) n gin lit k tt c s liu trong data.frame arg. Tht ra,
chng ta ch cn n gin g data3, kt qu cng ging y nh print(data3).


4.4 Nhp hai data.frame thnh mt: merge

Gi d nh chng ta c d liu cha trong hai data.frame. D liu th nht tn l d1
gm 3 ct: id, sex, tc nh sau:

id sex tc
1 Nam 4.0
2 Nu 3.5
3 Nu 4.7
4 Nam 7.7
5 Nam 5.0
6 Nu 4.2
7 Nam 5.9
8 Nam 6.1
9 Nam 5.9
10 Nu 4.0

D liu th hai tn l d2 gm 3 ct: id, sex, tg nh sau:

id sex tg
1 Nam 1.1
2 Nu 2.1
3 Nu 0.8
4 Nam 1.1
5 Nam 2.1
6 Nu 1.5
7 Nam 2.6
8 Nam 1.5
9 Nam 5.4
10 Nu 1.9
11 Nu 1.7

Hai d liu ny c chung hai bin s id v sex. Nhng d liu d1 c 10 dng, cn d
liu d2 c 11 dng. Chng ta c th nhp hai d liu thnh mt data.frame bng cch
dng lnh merge nh sau:

> d <- merge(d1, d2, by="id", all=TRUE)
> d
id sex.x tc sex.y tg
1 1 Nam 4.0 Nam 1.1
2 2 Nu 3.5 Nu 2.1
3 3 Nu 4.7 Nu 0.8
4 4 Nam 7.7 Nam 1.1
5 5 Nam 5.0 Nam 2.1
6 6 Nu 4.2 Nu 1.5
7 7 Nam 5.9 Nam 2.6
8 8 Nam 6.1 Nam 1.5
9 9 Nam 5.9 Nam 5.4
10 10 Nu 4.0 Nu 1.9
11 11 <NA> NA Nu 1.7

Trong lnh merge, chng ta yu cu R nhp 2 d liu d1 v d2 thnh mt v a vo
data.frame mi tn l d, v dng bin s id lm chun. Chng ta thy bnh nhn s
11 khng c s liu cho tc, cho nn R cho l NA (mt dng not available).


4.5 M ha s liu (data coding)

Trong vic x l s liu dch t hc, nhiu khi chng ta cn phi bin i s liu t bin
lin tc sang bin mang tnh cch phn loi. Chng hn nh trong chn on long
xng, nhng ph n c ch s T ca mt cht khong trong xng (bone mineral
density hay BMD) bng hay thp hn -2.5 c xem l long xng, nhng ai c
BMD gia -2.5 v -1.0 l xp xng (osteopenia), v trn -1.0 l bnh thng. V
d, chng ta c s liu BMD t 10 bnh nhn nh sau:

-0.92, 0.21, 0.17, -3.21, -1.80, -2.60, -2.00, 1.71, 2.12, -2.11

nhp cc s liu ny vo R chng ta c th s dng function c nh sau:

bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)

phn loi 3 nhm long xng, xp xng, v bnh thng, chng ta c th dng m
s 1, 2 v 3. Ni cch khc, chng ta mun to nn mt bin s khc (hy gi l
diagnosis) gm 3 gi tr trn da vo gi tr ca bmd. lm vic ny, chng ta s
dng lnh:

# tm thi cho bin s diagnosis bng bmd
> diagnosis <- bmd

# bin i bmd thnh diagnosis
> diagnosis[bmd <= -2.5] <- 1
> diagnosis[bmd > -2.5 & bmd <= 1.0] <- 2
> diagnosis[bmd > -1.0] <- 3

# to thnh mt data frame
> data <- data.frame(bmd, diagnosis)

# lit k kim tra xem lnh c hiu qu khng
> data
bmd diagnosis
1 -0.92 3
2 0.21 3
3 0.17 3
4 -3.21 1
5 -1.80 2
6 -2.60 1
7 -2.00 2
8 1.71 3
9 2.12 3
10 -2.11 2


4.5.1 Bin i s liu bng cch dng replace

Mt cch bin i s liu khc l dng replace, d cch ny c v rm r cht t.
Tip tc v d trn, chng ta bin i t bmd sang diagnosis nh sau:

> diagnosis <- bmd
> diagnosis <- replace(diagnosis, bmd <= -2.5, 1)
> diagnosis <- replace(diagnosis, bmd > -2.5 & bmd <= 1.0, 2)
> diagnosis <- replace(diagnosis, bmd > -1.0, 3)


4.5.2 Bin i thnh yu t (factor)

Trong phn tch thng k, chng ta phn bit mt bin s mang tnh yu t (factor) v
bin s lin tc bnh thng. Bin s yu t khng th dng tnh ton nh cng tr
nhn chia, nhng bin s s hc c th s dng tnh ton. Chng hn nh trong v d
bmd v diagnosis trn, diagnosis l yu t v gi tr trung bnh gia 1 v 2 chng
c ngha thc t g c; cn bmd l bin s s hc.

Nhng hin nay, diagnosis c xem l mt bin s s hc. bin thnh bin s
yu t, chng ta cn s dng function factor nh sau:

> diag <- factor(diagnosis)
> diag
[1] 3 3 3 1 2 1 2 3 3 2
Levels: 1 2 3

Ch R by gi thng bo cho chng ta bit diag c 3 bc: 1, 2 v 3. Nu chng ta yu
cu R tnh s trung bnh ca diag, R s khng lm theo yu cu ny, v khng phi l
mt bin s s hc:

> mean(diag)
[1] NA
Warning message:
argument is not numeric or logical: returning NA in: mean.default(diag)

D nhin, chng ta c th tnh gi tr trung bnh ca diagnosis:

> mean(diagnosis)
[1] 2.3

nhng kt qu 2.3 ny khng c ngha g trong thc t c.


4.6 Chia nhm bng cut

Vi mt bin lin tc, chng ta c th chia thnh nhiu nhm bng hm cut. V d,
chng ta c bin age nh sau:

> age <- c(17,19,22,43,14,8,12,19,20,51,8,12,27,31,44)

tui thp nht l 8 v cao nht l 51. Nu chng ta mun chia thnh 2 nhm tui:

> cut(age, 2)

[1] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5]
(7.96,29.5] (7.96,29.5]

[9] (7.96,29.5] (29.5,51] (7.96,29.5] (7.96,29.5] (7.96,29.5] (29.5,51]
(29.5,51]

Levels: (7.96,29.5] (29.5,51]

cut chia bin age thnh 2 nhm: nhm 1 tui t 7.96 n 29.5; nhm 2 t 29.5 n
51. Chng ta c th m s i tng trong tng nhm tui bng hm table nh sau:

> table(cut(age, 2))

(7.96,29.5] (29.5,51]
11 4

> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
[1] low low low high low low low low low high
low low medium medium
[15] high
Levels: low medium high

> ageg <- cut(age, 3, labels=c("low", "medium", "high"))
> table(ageg)
ageg
low medium high
10 2 3

Tt nhin, chng ta cng c th chia age thnh 4 nhm (quartiles) bng cch cho nhng
thng s 0, 0.25, 0.50 v 0.75 nh sau:

cut(age,
breaks=quantiles(age, c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)


cut(age,
breaks=quantiles(c(0, 0.25, 0.50, 0.75, 1)),
labels=c(q1, q2, q3, q4),
include.lowest=TRUE)


4.7. Tp hp s liu bng cut2 (Hmisc)

Hm cut trn chia bin s theo gi tr ca bin, ch khng da vo s mu, cho
nn s lng mu trong tng nhm khng bng nhau. Tuy nhin, trong phn tch thng
k, c khi chng ta cn phi phn chia mt bin s lin tc thnh nhiu nhm da vo
phn phi ca bin s nhng s mu bng hay tng ng nhau. Chng hn nh i
vi bin s bmd chng ta c th ct dy s thnh 3 nhm vi s mu tng ng nhau
bng cch dng function cut2 (trong th vin Hmisc) nh sau:

> # nhp th vin Hmisc c th dng function cut2

> library(Hmisc)

> bmd <- c(-0.92,0.21,0.17,-3.21,-1.80,-2.60,-2.00,1.71,2.12,-2.11)

> # chia bin s bmd thnh 2 nhm v trong i tng group

> group <- cut2(bmd, g=2)

> table(group)
group
[-3.21,-0.92) [-0.92, 2.12]
5 5

Nh thy qua v d trn, g = 2 c ngha l chia thnh 2 nhm (g = group). R t ng
chia thnh nhm 1 gm gi tr bmd t -3.21 n -0.92, v nhm 2 t -0.92 n 2.12. Mi
nhm gm c 5 s.

Tt nhin, chng ta cng c th chia thnh 3 nhm bng lnh:

> group <- cut2(bmd, g=3)

V vi lnh table chng ta s bit c 3 nhm, nhm 1 gm 4 s, nhm 2 v 3 mi nhm
c 3 s:

> table(group)
group
[-3.21,-1.80) [-1.80, 0.21) [ 0.21, 2.12]
4 3 3

You might also like