Vertical Fragmentation Updated

Vertical Fragmentation
Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 1
Outline
n Vertical fragmentation
à Grouping, Splitting
n Information Requirements
n Clustering algorithm
à Bond Energy Algorithm (BEA)
n Partitioning algorithm
n Checking for correctness

à Completeness, Reconstruction, Disjointness
1
§ Vertical fragmentation of a relation R produces
fragments R1, R2,…, Rr, each of which contains a
subset of R’s attributes as well as the primary key of R.
§ Objective
• To partition a relation into a set of smaller relations so that
many of the user applications will run on only one
fragment.
§ Has been studied within the centralized context:
• design methodology
- Which allows the user queries to deal with smaller relations, thus causing a
smaller number of page accesses.
• physical clustering
– Most ‘active’ subrelations are identified and placed in faster memory
subsystem.
§ More difficult than horizontal, because more alternatives exist.

Two types of heuristic approaches :
• grouping
– Start assigning each attribute to one fragment and at each step, joins some of
the fragments until some criteria is satisfied.
• splitting
– Starts with a relation and decides on beneficial partitionings based on the
access behaviour of applications to the attributes.
2
§ Grouping
• Results in overlapping fragments
§ Splitting
• Generates non-overlapping fragments (non-primary key attributes)
• Fits the top-down design methodology
§ We do not consider the replicated key attributes to be overlapping.

§ Advantage: Easier to enforce functional dependencies (for integrity
checking etc.)
VF – Information requirements
§ Application Information
• Attribute affinities
– a measure that indicates how closely related the attributes are
– This is obtained from more primitive usage data
• Attribute usage values (access frequencies)
– Given a set of queries Q = {q1, q2,…, qq} that will run on the
relation R [A1, A2,…, An],
1 if attribute Aj is referenced by query qi

use(qi,Aj) =
0 otherwise
use(qi,•) can be defined accordingly
3
VF – Definition of use(qi,Aj)
Consider the following 4 queries for relation PROJ
q1: SELECT BUDGET q2: SELECT

FROM PROJ PNAME, BUDGET
WHERE PNO=Value FROM PROJ
q3: SELECT PNAME q4: SELECT SUM(BUDGET)

FROM PROJ FROM PROJ
WHERE LOC=Value WHERE LOC=Value
Attribute Usage Matrix (Q Matrix)

Let A1= PNO, A2= PNAME, A3= BUDGET, A4= LOC
A1 A2 A3 A4
q1 1 0 1 0
q2 0 1 1 0
q3 0 1 0 1
q4 0 0 1 1
4
VF – Access Characteristics
§ Attribute usage values are not sufficient
• Doesn’t represent the weight of application frequencies
§ Frequency measure
• aff(Ai, Aj)
• Measures the bond between two attributes according to
how they are accessed by applications
VF – Affinity Measure aff(Ai,Aj)

The attribute affinity measure between two attributes Ai
and Aj of a relation R [A1, A2, …, An] with respect to the
set of applications Q = (q1, q2, …, qq) is defined as
follows :
aff (Ai, Aj) = ∑ (query access)

all queries that access Ai and Aj
query access = ∑all sites

access frequency of a query ∗
access
execution
5
Attribute Affinity Matrix
§ This is a matrix that has the attributes of the relation
we are dealing with on both axes
§ To construct the matrix we need to know the

frequency of accesses (across all sites) to each query
§ Then we can work out how often (relatively) each

pair of attributes are accessed together
Query Characteristics
§ Suppose there are three sites accessing the “PROJ” relation,
with these relative frequencies:
S1 S2 S3 Queries using
attributes Budget
q1 15 20 10 from S3 10 times a
day
q2 5 0 0
q3 25 25 25
q4 3 0 0
Matrix S describing which type of query is accessed how frequently from each site
6
VF – Calculation of aff(Ai, Aj)
Q Matrix: A1 A2 A3 A4 Access Frequencies:
S1 S2 S3
q1 1 0 1 0 q1 15 20 10
q2 0 1 1 0 q2 5 0 0
q3 25 25 25
q3 0 1 0 1
q4 3 0 0
q4 0 0 1 1
Assume each query in the previous example A1 A2 A3 A4
accesses the attributes once during each
execution.
A1 45 0 45 0
Then A2 0 80 5 75
aff(A1, A3) = 15*1 + 20*1+10*1
= 45 A3 45 5 53 3
A4 0 75 3 78
and the attribute affinity matrix
Class Exercise:
§ Given the following construct the affinity matrix:
Q Matrix: Access Frequencies:
A1 A2 A3 A4 A5 A6
S1 S2 S3
q1 0 1 1 0 1 1 q1 60 0 45
q2 0 5 0
q2 1 1 0 1 0 0
q3 5 7 2
q3 1 0 0 1 1 0
q4 35 38 13
q4 0 0 1 0 0 1
7
Answer
A1 A2 A3 A4 A5 A6
A1 19 5 0 19 14 0
A2 5 110 105 5 105 105
A3 0 105 191 0 105 191
A4 19 5 0 19 14 0
A5 14 105 105 14 119 105
A6 0 105 191 0 105 191
Outline

8
VF – Clustering Algorithm
§ The next step is to group together the attribute(s) that have
high affinity for each other – this results in a clustered affinity
matrix
§ Then splitting the relation accordingly
§ Bond Energy Algorithm (BEA) has been used for clustering of
entities. BEA finds an ordering of entities (in our case
attributes) such that the global affinity measure
AM = ∑ ∑ (affinity of A and A with their neighbors)

i j
i j
is minimized.
Bond Energy Algorithm

Input: The AA (attribute affinity) matrix
Output: The clustered affinity matrix CA (clustered
affinity) which is a perturbation of AA
1. Initialization: Place and fix one of the columns of AA in CA.
2. Iteration: Place the remaining n-i columns in the remaining i+1

positions in the CA matrix. For each column, choose the placement
that makes the most contribution to the global affinity measure.
3. Row order: Order the rows according to the column ordering.
9
Bond Energy Algorithm
“Best” placement? Define contribution of a placement:
cont(Ai, Ak, Aj) = 2bond(A i, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)
Where
n
bond(Ax,Ay) = ∑Aff (A ,A ) aff (A ,A )
z x z y
z=1
BEA – Example
Consider the following AA matrix and the corresponding CA matrix where A1
and A2 have been placed. Place A3:
A1 A2 A3 A4 A1 A2
A1 45 0 45 0
A1 45 0
AA = A2 0 80 5 75
CA =
A2 0 80
A3 45 5 53 3
A3 45 5
A4 0 75 3 78
A4 0 75
Ordering (0-3-1) :
cont(A0,A3,A1) = 2bond(A0 , A3)+2bond(A3 , A1)–2bond(A0 , A1)
= 2* 0 + 2* 4410 – 2*0 = 8820
Ordering (1-3-2) :
cont(A1,A3,A2) = 2bond(A1 , A3)+2bond(A3 , A2)–2bond(A1,A2)
= 2* 4410 + 2* 890 – 2*225 = 10150 (Consider highest contribution)
Ordering (2-3-5) :
cont (A2,A3,A5) = 1780
10
BEA – Example
Therefore, the CA matrix has the form

A1 A3 A2
A1 45 45 0
A2 0 5 80
CA =
A3 45 53 5
A4 0 3 75
BEA – Example
When A4 is placed, the final form of the CA matrix is
A1 A3 A2 A4
A1 45 45 0 0
A2 0 5 80 75
CA =
A3 45 53 5 3
A4 0 3 75 78
Note that placement of A4 is also calculated w.r.t maximum contribution
11
BEA – Example
Row organization
A1 A3 A2 A4 A1 A3 A2 A4
A1 45 45 0 0 A1 45 45 0 0
A2 0 5 80 75 A3 45 53 5 3
CA = reordering rows
A3 45 53 5 3 A2 0 5 80 75
A4 0 3 75 78 A4 0 3 75 78
Only in these cases

How to split the attributes into clusters? applications would access
tuples from different
sites
Outline

12
VF – Algorithm
How can you divide a set of clustered attributes {A1, A2,
…, An} into two (or more) sets {A1, A2, …, Ai} and {Ai,
…, An} such that there are no (or minimal) applications
that access both (or more than one) of the sets.
A1 A2 A3 … Ai Ai+1 . . . Am
A1
A2
...
TA
Ai
Ai+1
...
BA
Am
Vertical Splitting
§ Three possibilities to split the set of attributes into two fragments

§ For each
• Determine what would be the result
§ By computing
• How many cases access are made to attributes from one of the two
fragments only (good) and to attributes from the two fragments (bad).
13
VF – ALgorithm
Define
TQ = set of applications that access only TA
BQ= set of applications that access only BA
OQ= set of applications that access both TA and BA
and
CTQ = total number of accesses to attributes by applications
that access only TA
CBQ = total number of accesses to attributes by applications
that access only BA
COQ = total number of accesses to attributes by applications
that access both TA and BA
Then find the point along the diagonal that maximizes
CTQ ∗ CBQ− COQ2
Split Quality
§ Compare by computing split quality
• A positive contribution for the good case
• A negative contribution for the bad case
§ Computation of the number is simple, given our
access model
• For each query q1 – q4
– Select the cases where attributes from one and where attributes
from both fragments are accessed (by inspecting matrix Q)
– For these cases we add over all sites the total number of accesses
made by taking them from Matrix M
14
Vertical Splitting
2
2 2 2
-78
Split Quality with

highest value
VF – Correctness
A relation R, defined over attribute set A and key K, generates the
vertical partitioning FR = {R1, R2, …, Rr}.
§ Completeness
à Each set of attributes in the relation appears in either one of the
fragments F1, F2, F3, … etc
§ Reconstruction
à The initial relation, R, can be reconstructed from the fragments
F1, F2, F3, …. etc
§ Disjointness
à Each set of attributes is found in only one of the fragments
à Duplicated keys are not considered to be overlapping. Disjointness is
only on the non-primary key attributes
15
VF - Exercise
n Consider the following Attribute Usage Matrix (Q) and Access Frequencies
(M) for a relation R{A1,A2,A3,A4}, which has 4 queries (Q1,Q2,Q3,Q4)
running at 3 sites (S1,S2,S3):
n Find the Attribute Affinity Matrix (AA)

n Use the Bond Energy Algorithm to find the Clustered Affinity Matrix (CA)
n Perform vertical splitting to obtain 2 fragments. Assume that the primary key
of relation R is A1
16

Vertical Fragmentation Updated

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vertical Fragmentation Updated

Uploaded by

Copyright:

Available Formats

Vertical Fragmentation

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 1

n Checking for correctness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 2

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 3

§ More difficult than horizontal, because more alternatives exist.

§ We do not consider the replicated key attributes to be overlapping.

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 5

1 if attribute Aj is referenced by query qi

use(qi,•) can be defined accordingly

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 6

q1: SELECT BUDGET q2: SELECT

q3: SELECT PNAME q4: SELECT SUM(BUDGET)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 7

Attribute Usage Matrix (Q Matrix)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 8

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 9

VF – Affinity Measure aff(Ai,Aj)

aff (Ai, Aj) = ∑ (query access)

query access = ∑all sites

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 10

§ To construct the matrix we need to know the

§ Then we can work out how often (relatively) each

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 11

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 12

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 14

A2 5 110 105 5 105 105

A3 0 105 191 0 105 191

A5 14 105 105 14 119 105

A6 0 105 191 0 105 191

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 15

n Checking for correctness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 16

AM = ∑ ∑ (affinity of A and A with their neighbors)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 17

Bond Energy Algorithm

1. Initialization: Place and fix one of the columns of AA in CA.

2. Iteration: Place the remaining n-i columns in the remaining i+1

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 18

cont(Ai, Ak, Aj) = 2bond(A i, Ak)+2bond(Ak, Al) –2bond(Ai, Aj)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 19

Therefore, the CA matrix has the form

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 21

When A4 is placed, the final form of the CA matrix is

Note that placement of A4 is also calculated w.r.t maximum contribution

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 22

Only in these cases

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 23

n Checking for correctness

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 24

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 25

§ Three possibilities to split the set of attributes into two fragments

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 26

CTQ ∗ CBQ− COQ2

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 27

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 28

Split Quality with

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 29

n Find the Attribute Affinity Matrix (AA)

Distributed DBMS © 1998 M. Tamer Özsu & Patrick Valduriez Page 5. 31

You might also like