Apresentacao Paper SIGIR Sergio

Efficient and Scalable MetaFeature-based
Document Classification using Massively Parallel

Computing
Sergio Canuto, Marcos Andre Goncalves, Wisllay Santos,
Thierson Rosa, Wellington Martins
sergiodaniel@dcc.ufmg.br
Automatic Text Classification (ATC)

categories
I
ATC goal:
I
F : X Y, X = Rd , Y = {1, 2, . . . , m}
news article, webpage, tweet, etc..

I
Given:
I
I
A set of training examples {xi |xi Rd }

For each training instance, its category (yi |yi {1, 2, . . . , m})
ATC with meta-level features:

I
Transform the original feature space X (bag-of-words) into a

new one.
I
The new space M is potentially smaller and more informed.
Our goal changes to finding a function F : M Y .
Distance-based Meta-features
I
Global information:
I
Local information:
I
Distance between a test example and a class centroid.

Distance between a test example and each one of its k nearest
neighbors.
They use Cosine, Euclidean and Manhattan distances [1].
[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
3
Distance-based Meta-features
I
Global information:
I
Local information:
I
Distance between a test example and a class centroid.

Distance between a test example and each one of its k nearest
neighbors.
They use Cosine, Euclidean and Manhattan distances [1].
4
Meta-feature Generation Problems
An efficient meta-feature generator is very important, since we

have to generate them on classification time.
I
However, kNN has slow classification time, and kNN-based

meta-features inherits this poor performance.
For textual datasets the performance problem is hardened,
since the kNN algorithm will have to run on high dimensional
data.
GPU-based Meta-feature Generation
Both kNN library, used in [1] and the state-of-art kNN parallel
implementation use a DxV matrix, with D training documents
and V features.
Our kNN GPU implementation considers the high
dimensionality and heterogeneity in the representation of the
text documents.
I
It takes advantage of Zipfs law.
The parallel implementation makes the generation of

meta-features feasible for big datasets.
Inverted Index Implementation

I
Compact representation of the inverted index in the GPU

memory:
d1
t1
t3
d2
t2
t5
d3
t1
d4
t2
d5
t1
t4
7
8
2 3 4
5
6
9
0
1
d1 d1 d1 d2 d2 d3 d4 d5 d5 d5
t1 t3 t4 t2 t5 t1 t2 t1 t3 t5
E (entries)
0
t1
t3
t5
Document Collection
t2
1
2
t3
t4
t5
1 2
count
2
0
3 4
1
t1 t2 t3 t4 t5
0
7 8
3
5
index
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex
Calculating the Distances

I
For each query, we generate a reduced logical array from the

full inverted index.
q
0
t1
t2
t1
t3
t3
t4
t4
7
startq
t1
t3
t4
query
4
t5
7 8
index
1
5
Parallel copy
2
0
3 4
1
t1 t2 t3 t4 t5
3
1 2
2
2
df
2
0
1
t1 t3 t4
3
2
1
dfq
2
0
1
t1 t3 t4
3
5
6
indexq
2 3
5
0
4
1
t1 t1 t1 t3 t3 t4
Logical array
d1 d3 d5 d1 d5 d1 Eq
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex
Calculating the Distances

I
The elements of the logical array are equally distributed

among the gpu cores in to distribute the distance calculations.
To sort the distances, we use truncated bitonic sort.
Experimental Setup
I
Evaluation: efficiency with wall time of each experiment, and

efficacy with MicroF1 and MacroF1 .
Software: Liblinear 1.92, Ubuntu Server 12.04.
Hardware: Intel Core i7-870, 16Gb RAM, nVIDIA Tesla K40.
General information on the datasets:
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
10
Classes
7
20
11
90
7
103
# attributes
40,194
61,049
59,990
19,589
803,358
134,932
# documents
8,274
18,766
24,897
13,327
861,454
804,427
Density
140.325
130.780
38.805
78.1646
31.805
79.133
Space in Disk
14MB
30MB
8.5MB
13MB
327MB
884MB
Experimental Results - Effectiveness

I
Bag-of-words presented good results on our two big datasets.

I
Big datasets allow SVM to deal better with the high

dimensional data.
The combination of meta-features and bag of words improves

the results on our big datasets.
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
Meta-features
62.50 2.27
89.26 0.23
63.83 2.05
38.96 1.04
74.33 0.17
55.77 0.92
Bag
54.55 1.64
87.08 0.33
53.62 1.12
29.13 2.03
75.15 0.18
55.32 0.66
Bag + Meta-features
62.93 2.03
90.11 0.30
63.58 1.15
37.36 1.31
79.90 0.20
57.21 0.32
Table : MacroF1 of the meta-features, bag-of-words and the combination of

meta-features and bag-of-words.
11
Experimental Results - Execution Time

I
On the small datasets:

I
I
High speedup in relation to the non parallel ANN.

The parallel BF-CUDA does not optimize the distance
calculations to deal with textual documents.
Low speedup on REUT90.
GTkNN was the only implementation able to generate

meta-features for the larger datasets.
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
GTkNN
40 1
187 4
112 3
625 12
4637 43
33884 111
Execution Time
BF-CUDA
ANN
259 46
1590 29
2004 17 10947 1323
1760 91 13589 1539
2242 5
3024 303
*
*
*
*
Speedup
BF-CUDA ANN
6.4
39.6
10.7
68.7
15.7
141.3
3.6
4.8
*
*
*
*
Table : Average time in seconds to generate meta-features using different kNN

strategies.
12
Experimental Results - Execution Time

BF-CUDA
ANN
GTkNN
2.5
Seconds
1.5
0.5
0
200
400
600 800 1000 1200 1400 1600 1800 2000

Number of training samples
Figure : Time to generate meta-features for one example with different sample sizes
from the MED dataset. GTkNN keeps very low execution time (up to 0.005 seconds).
The other two approaches slow down dramatically as the training dataset grows in size.
13
Experimental Results - Memory and Efficiency of the

Literature Implementation
I
Our inverted index structure provides a very compact way to

represent documents.
The traditional data representaion using a D x V matrix with
D training documentos and V features is not a good choice.
I
Infeasible for large datasets with many documents per class.
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
Memory Consumption
GTkNN
BF-CUDA
ANN
92
1697
945
93
1257
395
90
2541
2487
90
909
494
339
1859104
2857048
120
43245
69328
Table : Memory consuption in Megabytes.
14
Conclusions and Future Work
15
We provide an efficent and scalable way to generate

meta-features.
We analyse the behaviour of meta-features on big datasets.
As future work, we intend to exploit other classification and

ranking tasks.
sergiodaniel@dcc.ufmg.br
16

Apresentacao Paper SIGIR Sergio

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apresentacao Paper SIGIR Sergio

Uploaded by

Copyright:

Available Formats

Efficient and Scalable MetaFeature-based

Document Classification using Massively Parallel

Automatic Text Classification (ATC)

news article, webpage, tweet, etc..

A set of training examples {xi |xi Rd }

ATC with meta-level features:

Transform the original feature space X (bag-of-words) into a

The new space M is potentially smaller and more informed.

Our goal changes to finding a function F : M Y .

Distance between a test example and a class centroid.

They use Cosine, Euclidean and Manhattan distances [1].

Distance between a test example and a class centroid.

They use Cosine, Euclidean and Manhattan distances [1].

Meta-feature Generation Problems

An efficient meta-feature generator is very important, since we

However, kNN has slow classification time, and kNN-based

GPU-based Meta-feature Generation

It takes advantage of Zipfs law.

The parallel implementation makes the generation of

Inverted Index Implementation

Compact representation of the inverted index in the GPU

Calculating the Distances

For each query, we generate a reduced logical array from the

Calculating the Distances

The elements of the logical array are equally distributed

Evaluation: efficiency with wall time of each experiment, and

Software: Liblinear 1.92, Ubuntu Server 12.04.

Hardware: Intel Core i7-870, 16Gb RAM, nVIDIA Tesla K40.

General information on the datasets:

Experimental Results - Effectiveness

Bag-of-words presented good results on our two big datasets.

Big datasets allow SVM to deal better with the high

The combination of meta-features and bag of words improves

Table : MacroF1 of the meta-features, bag-of-words and the combination of

Experimental Results - Execution Time

On the small datasets:

High speedup in relation to the non parallel ANN.

GTkNN was the only implementation able to generate

Table : Average time in seconds to generate meta-features using different kNN

Experimental Results - Execution Time

600 800 1000 1200 1400 1600 1800 2000

Experimental Results - Memory and Efficiency of the

Our inverted index structure provides a very compact way to

Infeasible for large datasets with many documents per class.

Table : Memory consuption in Megabytes.

Conclusions and Future Work

We provide an efficent and scalable way to generate

We analyse the behaviour of meta-features on big datasets.

As future work, we intend to exploit other classification and

You might also like