You are on page 1of 16

Efficient and Scalable MetaFeature-based

Document Classification using Massively Parallel


Computing
Sergio Canuto, Marcos Andre Goncalves, Wisllay Santos,
Thierson Rosa, Wellington Martins
sergiodaniel@dcc.ufmg.br

Automatic Text Classification (ATC)


categories
I

ATC goal:
I

F : X Y, X = Rd , Y = {1, 2, . . . , m}

news article, webpage, tweet, etc..


I

Given:
I
I

A set of training examples {xi |xi Rd }


For each training instance, its category (yi |yi {1, 2, . . . , m})

ATC with meta-level features:


I

Transform the original feature space X (bag-of-words) into a


new one.
I

The new space M is potentially smaller and more informed.

Our goal changes to finding a function F : M Y .

Distance-based Meta-features
I

Global information:
I

Local information:
I

Distance between a test example and a class centroid.


Distance between a test example and each one of its k nearest
neighbors.

They use Cosine, Euclidean and Manhattan distances [1].

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
3

Distance-based Meta-features
I

Global information:
I

Local information:
I

Distance between a test example and a class centroid.


Distance between a test example and each one of its k nearest
neighbors.

They use Cosine, Euclidean and Manhattan distances [1].

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
4

Meta-feature Generation Problems

An efficient meta-feature generator is very important, since we


have to generate them on classification time.
I

However, kNN has slow classification time, and kNN-based


meta-features inherits this poor performance.
For textual datasets the performance problem is hardened,
since the kNN algorithm will have to run on high dimensional
data.

GPU-based Meta-feature Generation

Both kNN library, used in [1] and the state-of-art kNN parallel
implementation use a DxV matrix, with D training documents
and V features.
Our kNN GPU implementation considers the high
dimensionality and heterogeneity in the representation of the
text documents.
I

It takes advantage of Zipfs law.

The parallel implementation makes the generation of


meta-features feasible for big datasets.

[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010

Inverted Index Implementation


I

Compact representation of the inverted index in the GPU


memory:

d1

t1

t3

d2

t2

t5

d3

t1

d4

t2

d5

t1

t4

7
8
2 3 4
5
6
9
0
1
d1 d1 d1 d2 d2 d3 d4 d5 d5 d5
t1 t3 t4 t2 t5 t1 t2 t1 t3 t5
E (entries)
0

t1

t3

t5

Document Collection

t2

1
2

t3

t4

t5
1 2
count

2
0
3 4
1
t1 t2 t3 t4 t5
0
7 8
3
5
index
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex

Calculating the Distances


I

For each query, we generate a reduced logical array from the


full inverted index.
q
0

t1

t2

t1

t3

t3

t4

t4

7
startq

t1

t3

t4
query

4
t5

7 8
index

1
5

Parallel copy

2
0
3 4
1
t1 t2 t3 t4 t5
3
1 2
2
2
df

2
0
1
t1 t3 t4
3
2
1
dfq

2
0
1
t1 t3 t4
3
5
6
indexq

2 3
5
0
4
1
t1 t1 t1 t3 t3 t4
Logical array
d1 d3 d5 d1 d5 d1 Eq
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex

Calculating the Distances


I

The elements of the logical array are equally distributed


among the gpu cores in to distribute the distance calculations.
To sort the distances, we use truncated bitonic sort.

Experimental Setup
I

Evaluation: efficiency with wall time of each experiment, and


efficacy with MicroF1 and MacroF1 .

Software: Liblinear 1.92, Ubuntu Server 12.04.

Hardware: Intel Core i7-870, 16Gb RAM, nVIDIA Tesla K40.

General information on the datasets:

Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni

10

Classes
7
20
11
90
7
103

# attributes
40,194
61,049
59,990
19,589
803,358
134,932

# documents
8,274
18,766
24,897
13,327
861,454
804,427

Density
140.325
130.780
38.805
78.1646
31.805
79.133

Space in Disk
14MB
30MB
8.5MB
13MB
327MB
884MB

Experimental Results - Effectiveness


I

Bag-of-words presented good results on our two big datasets.


I

Big datasets allow SVM to deal better with the high


dimensional data.

The combination of meta-features and bag of words improves


the results on our big datasets.
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni

Meta-features
62.50 2.27
89.26 0.23
63.83 2.05
38.96 1.04
74.33 0.17
55.77 0.92

Bag
54.55 1.64
87.08 0.33
53.62 1.12
29.13 2.03
75.15 0.18
55.32 0.66

Bag + Meta-features
62.93 2.03
90.11 0.30
63.58 1.15
37.36 1.31
79.90 0.20
57.21 0.32

Table : MacroF1 of the meta-features, bag-of-words and the combination of


meta-features and bag-of-words.

11

Experimental Results - Execution Time


I

On the small datasets:


I
I

High speedup in relation to the non parallel ANN.


The parallel BF-CUDA does not optimize the distance
calculations to deal with textual documents.
Low speedup on REUT90.

GTkNN was the only implementation able to generate


meta-features for the larger datasets.

Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni

GTkNN
40 1
187 4
112 3
625 12
4637 43
33884 111

Execution Time
BF-CUDA
ANN
259 46
1590 29
2004 17 10947 1323
1760 91 13589 1539
2242 5
3024 303
*
*
*
*

Speedup
BF-CUDA ANN
6.4
39.6
10.7
68.7
15.7
141.3
3.6
4.8
*
*
*
*

Table : Average time in seconds to generate meta-features using different kNN


strategies.

12

Experimental Results - Execution Time


BF-CUDA
ANN
GTkNN

2.5

Seconds

1.5

0.5

0
200

400

600 800 1000 1200 1400 1600 1800 2000


Number of training samples

Figure : Time to generate meta-features for one example with different sample sizes
from the MED dataset. GTkNN keeps very low execution time (up to 0.005 seconds).
The other two approaches slow down dramatically as the training dataset grows in size.
13

Experimental Results - Memory and Efficiency of the


Literature Implementation
I

Our inverted index structure provides a very compact way to


represent documents.
The traditional data representaion using a D x V matrix with
D training documentos and V features is not a good choice.
I

Infeasible for large datasets with many documents per class.

Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni

Memory Consumption
GTkNN
BF-CUDA
ANN
92
1697
945
93
1257
395
90
2541
2487
90
909
494
339
1859104
2857048
120
43245
69328

Table : Memory consuption in Megabytes.

14

Conclusions and Future Work

15

We provide an efficent and scalable way to generate


meta-features.

We analyse the behaviour of meta-features on big datasets.

As future work, we intend to exploit other classification and


ranking tasks.

sergiodaniel@dcc.ufmg.br

16

You might also like