You are on page 1of 16

Vertical Fragmentation

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 1

Outline
n Vertical fragmentation
à Grouping, Splitting

n Information Requirements

n Clustering algorithm
à Bond Energy Algorithm (BEA)

n Partitioning algorithm

n Checking for correctness


à Completeness, Reconstruction, Disjointness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 2

1
Vertical Fragmentation
§ Vertical fragmentation of a relation R produces
fragments R1, R2,…, Rr, each of which contains a
subset of R’s attributes as well as the primary key of R.

§ Objective
• To partition a relation into a set of smaller relations so that
many of the user applications will run on only one
fragment.

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 3

Vertical Fragmentation
§ Has been studied within the centralized context:
• design methodology
- Which allows the user queries to deal with smaller relations, thus causing a
smaller number of page accesses.
• physical clustering
– Most ‘active’ subrelations are identified and placed in faster memory
subsystem.

§ More difficult than horizontal, because more alternatives exist.


Two types of heuristic approaches :
• grouping
– Start assigning each attribute to one fragment and at each step, joins some of
the fragments until some criteria is satisfied.
• splitting
– Starts with a relation and decides on beneficial partitionings based on the
access behaviour of applications to the attributes.
Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 4

2
Vertical Fragmentation
§ Grouping
• Results in overlapping fragments

§ Splitting
• Generates non-overlapping fragments (non-primary key attributes)
• Fits the top-down design methodology

§ We do not consider the replicated key attributes to be overlapping.


§ Advantage: Easier to enforce functional dependencies (for integrity
checking etc.)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 5

VF – Information requirements
§ Application Information
• Attribute affinities
– a measure that indicates how closely related the attributes are
– This is obtained from more primitive usage data
• Attribute usage values (access frequencies)
– Given a set of queries Q = {q1, q2,…, qq} that will run on the
relation R [A1, A2,…, An],

1 if attribute Aj is referenced by query qi


use(qi,Aj) =
0 otherwise

use(qi,•) can be defined accordingly

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 6

3
VF – Definition of use(qi,Aj)
Consider the following 4 queries for relation PROJ

q1: SELECT BUDGET q2: SELECT


FROM PROJ PNAME, BUDGET
WHERE PNO=Value FROM PROJ

q3: SELECT PNAME q4: SELECT SUM(BUDGET)


FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 7

Attribute Usage Matrix (Q Matrix)


Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC

A1 A2 A3 A4

q1 1 0 1 0

q2 0 1 1 0

q3 0 1 0 1

q4 0 0 1 1

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 8

4
VF – Access Characteristics
§ Attribute usage values are not sufficient
• Doesn’t represent the weight of application frequencies

§ Frequency measure
• aff(Ai, Aj)
• Measures the bond between two attributes according to
how they are accessed by applications

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 9

VF – Affinity Measure aff(Ai,Aj)


The attribute affinity measure between two attributes Ai
and Aj of a relation R [A1, A2, …, An] with respect to the
set of applications Q = (q1, q2, …, qq) is defined as
follows :

aff (Ai, Aj) = ∑ (query access)


all queries that access Ai and Aj

query access = ∑all sites


access frequency of a query ∗
access
execution

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 10

5
Attribute Affinity Matrix
§ This is a matrix that has the attributes of the relation
we are dealing with on both axes

§ To construct the matrix we need to know the


frequency of accesses (across all sites) to each query

§ Then we can work out how often (relatively) each


pair of attributes are accessed together

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 11

Query Characteristics
§ Suppose there are three sites accessing the “PROJ” relation,
with these relative frequencies:
S1 S2 S3 Queries using
attributes Budget
q1 15 20 10 from S3 10 times a
day

q2 5 0 0

q3 25 25 25

q4 3 0 0

Matrix S describing which type of query is accessed how frequently from each site

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 12

6
VF – Calculation of aff(Ai, Aj)
Q Matrix: A1 A2 A3 A4 Access Frequencies:
S1 S2 S3
q1 1 0 1 0 q1 15 20 10
q2 0 1 1 0 q2 5 0 0
q3 25 25 25
q3 0 1 0 1
q4 3 0 0
q4 0 0 1 1
Assume each query in the previous example A1 A2 A3 A4
accesses the attributes once during each
execution.
A1 45 0 45 0
Then A2 0 80 5 75
aff(A1, A3) = 15*1 + 20*1+10*1
= 45 A3 45 5 53 3
A4 0 75 3 78
and the attribute affinity matrix
Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 13

Class Exercise:
§ Given the following construct the affinity matrix:
Q Matrix: Access Frequencies:
A1 A2 A3 A4 A5 A6
S1 S2 S3
q1 0 1 1 0 1 1 q1 60 0 45
q2 0 5 0
q2 1 1 0 1 0 0
q3 5 7 2
q3 1 0 0 1 1 0
q4 35 38 13
q4 0 0 1 0 0 1

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 14

7
Answer
A1 A2 A3 A4 A5 A6

A1 19 5 0 19 14 0

A2 5 110 105 5 105 105

A3 0 105 191 0 105 191

A4 19 5 0 19 14 0

A5 14 105 105 14 119 105

A6 0 105 191 0 105 191

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 15

Outline
n Vertical fragmentation
à Grouping, Splitting

n Information Requirements

n Clustering algorithm
à Bond Energy Algorithm (BEA)

n Partitioning algorithm

n Checking for correctness


à Completeness, Reconstruction, Disjointness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 16

8
VF – Clustering Algorithm
§ The next step is to group together the attribute(s) that have
high affinity for each other – this results in a clustered affinity
matrix
§ Then splitting the relation accordingly
§ Bond Energy Algorithm (BEA) has been used for clustering of
entities. BEA finds an ordering of entities (in our case
attributes) such that the global affinity measure

AM = ∑ ∑ (affinity of A and A with their neighbors)


i j
i j

is minimized.

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 17

Bond Energy Algorithm


Input: The AA (attribute affinity) matrix
Output: The clustered affinity matrix CA (clustered
affinity) which is a perturbation of AA

1. Initialization: Place and fix one of the columns of AA in CA.

2. Iteration: Place the remaining n-i columns in the remaining i+1


positions in the CA matrix. For each column, choose the placement
that makes the most contribution to the global affinity measure.
3. Row order: Order the rows according to the column ordering.

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 18

9
Bond Energy Algorithm
“Best” placement? Define contribution of a placement:

cont(Ai, Ak, Aj) = 2bond(A i, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)

Where

n
bond(Ax,Ay) = ∑Aff (A ,A ) aff (A ,A )
z x z y
z=1

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 19

BEA – Example
Consider the following AA matrix and the corresponding CA matrix where A1
and A2 have been placed. Place A3:
A1 A2 A3 A4 A1 A2
A1 45 0 45 0
A1 45 0
AA = A2 0 80 5 75
CA =
A2 0 80
A3 45 5 53 3
A3 45 5
A4 0 75 3 78
A4 0 75
Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)–2bond(A0 , A1)
= 2* 0 + 2* 4410 – 2*0 = 8820
Ordering (1-3-2) :
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)–2bond(A1,A2)
= 2* 4410 + 2* 890 – 2*225 = 10150 (Consider highest contribution)
Ordering (2-3-5) :
cont (A2,A3,A5) = 1780
Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 20

10
BEA – Example

Therefore, the CA matrix has the form


A1 A3 A2

A1 45 45 0

A2 0 5 80
CA =
A3 45 53 5

A4 0 3 75

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 21

BEA – Example

When A4 is placed, the final form of the CA matrix is

A1 A3 A2 A4

A1 45 45 0 0

A2 0 5 80 75
CA =
A3 45 53 5 3

A4 0 3 75 78

Note that placement of A4 is also calculated w.r.t maximum contribution

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 22

11
BEA – Example
Row organization

A1 A3 A2 A4 A1 A3 A2 A4
A1 45 45 0 0 A1 45 45 0 0
A2 0 5 80 75 A3 45 53 5 3
CA = reordering rows
A3 45 53 5 3 A2 0 5 80 75
A4 0 3 75 78 A4 0 3 75 78

Only in these cases


How to split the attributes into clusters? applications would access
tuples from different
sites

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 23

Outline
n Vertical fragmentation
à Grouping, Splitting

n Information Requirements

n Clustering algorithm
à Bond Energy Algorithm (BEA)

n Partitioning algorithm

n Checking for correctness


à Completeness, Reconstruction, Disjointness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 24

12
VF – Algorithm
How can you divide a set of clustered attributes {A1, A2,
…, An} into two (or more) sets {A1, A2, …, Ai} and {Ai,
…, An} such that there are no (or minimal) applications
that access both (or more than one) of the sets.
A1 A2 A3 … Ai Ai+1 . . . Am
A1
A2
...

TA
Ai

Ai+1
...

BA
Am

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 25

Vertical Splitting

§ Three possibilities to split the set of attributes into two fragments


§ For each
• Determine what would be the result
§ By computing
• How many cases access are made to attributes from one of the two
fragments only (good) and to attributes from the two fragments (bad).

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 26

13
VF – ALgorithm
Define
TQ = set of applications that access only TA
BQ= set of applications that access only BA
OQ= set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes

CTQ ∗ CBQ− COQ2

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 27

Split Quality
§ Compare by computing split quality
• A positive contribution for the good case
• A negative contribution for the bad case
§ Computation of the number is simple, given our
access model
• For each query q1 – q4
– Select the cases where attributes from one and where attributes
from both fragments are accessed (by inspecting matrix Q)
– For these cases we add over all sites the total number of accesses
made by taking them from Matrix M

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 28

14
Vertical Splitting
2

2 2 2
-78

Split Quality with


highest value

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 29

VF – Correctness
A relation R, defined over attribute set A and key K, generates the
vertical partitioning FR = {R1, R2, …, Rr}.
§ Completeness
à Each set of attributes in the relation appears in either one of the
fragments F1, F2, F3, … etc

§ Reconstruction
à The initial relation, R, can be reconstructed from the fragments
F1, F2, F3, …. etc

§ Disjointness
à Each set of attributes is found in only one of the fragments
à Duplicated keys are not considered to be overlapping. Disjointness is
only on the non-primary key attributes
Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 30

15
VF - Exercise
n Consider the following Attribute Usage Matrix (Q) and Access Frequencies
(M) for a relation R{A1,A2,A3,A4}, which has 4 queries (Q1,Q2,Q3,Q4)
running at 3 sites (S1,S2,S3):

n Find the Attribute Affinity Matrix (AA)


n Use the Bond Energy Algorithm to find the Clustered Affinity Matrix (CA)
n Perform vertical splitting to obtain 2 fragments. Assume that the primary key
of relation R is A1

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 31

16

You might also like