Professional Documents
Culture Documents
ATC goal:
I
F : X Y, X = Rd , Y = {1, 2, . . . , m}
Given:
I
I
Distance-based Meta-features
I
Global information:
I
Local information:
I
[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
3
Distance-based Meta-features
I
Global information:
I
Local information:
I
[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
4
Both kNN library, used in [1] and the state-of-art kNN parallel
implementation use a DxV matrix, with D training documents
and V features.
Our kNN GPU implementation considers the high
dimensionality and heterogeneity in the representation of the
text documents.
I
[1] S. Gopal and Y. Yang. Multilabel classification with meta-level features. In Proc. SIGIR, pages 315322, 2010
d1
t1
t3
d2
t2
t5
d3
t1
d4
t2
d5
t1
t4
7
8
2 3 4
5
6
9
0
1
d1 d1 d1 d2 d2 d3 d4 d5 d5 d5
t1 t3 t4 t2 t5 t1 t2 t1 t3 t5
E (entries)
0
t1
t3
t5
Document Collection
t2
1
2
t3
t4
t5
1 2
count
2
0
3 4
1
t1 t2 t3 t4 t5
0
7 8
3
5
index
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex
t1
t2
t1
t3
t3
t4
t4
7
startq
t1
t3
t4
query
4
t5
7 8
index
1
5
Parallel copy
2
0
3 4
1
t1 t2 t3 t4 t5
3
1 2
2
2
df
2
0
1
t1 t3 t4
3
2
1
dfq
2
0
1
t1 t3 t4
3
5
6
indexq
2 3
5
0
4
1
t1 t1 t1 t3 t3 t4
Logical array
d1 d3 d5 d1 d5 d1 Eq
7
8
2 3
5
6
9
0
4
1
t1 t1 t1 t2 t2 t3 t3 t4 t5 t5
d1 d3 d5 d2 d4 d1 d5 d1 d2 d5
invertedIndex
Experimental Setup
I
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
10
Classes
7
20
11
90
7
103
# attributes
40,194
61,049
59,990
19,589
803,358
134,932
# documents
8,274
18,766
24,897
13,327
861,454
804,427
Density
140.325
130.780
38.805
78.1646
31.805
79.133
Space in Disk
14MB
30MB
8.5MB
13MB
327MB
884MB
Meta-features
62.50 2.27
89.26 0.23
63.83 2.05
38.96 1.04
74.33 0.17
55.77 0.92
Bag
54.55 1.64
87.08 0.33
53.62 1.12
29.13 2.03
75.15 0.18
55.32 0.66
Bag + Meta-features
62.93 2.03
90.11 0.30
63.58 1.15
37.36 1.31
79.90 0.20
57.21 0.32
11
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
GTkNN
40 1
187 4
112 3
625 12
4637 43
33884 111
Execution Time
BF-CUDA
ANN
259 46
1590 29
2004 17 10947 1323
1760 91 13589 1539
2242 5
3024 303
*
*
*
*
Speedup
BF-CUDA ANN
6.4
39.6
10.7
68.7
15.7
141.3
3.6
4.8
*
*
*
*
12
2.5
Seconds
1.5
0.5
0
200
400
Figure : Time to generate meta-features for one example with different sample sizes
from the MED dataset. GTkNN keeps very low execution time (up to 0.005 seconds).
The other two approaches slow down dramatically as the training dataset grows in size.
13
Dataset
4UNI
20NG
ACM
REUT90
MED
RCV1Uni
Memory Consumption
GTkNN
BF-CUDA
ANN
92
1697
945
93
1257
395
90
2541
2487
90
909
494
339
1859104
2857048
120
43245
69328
14
15
sergiodaniel@dcc.ufmg.br
16