Weka Tutorials Spanish

Practical Data Mining
Tutorial 1: Introduction to the WEKA Explorer
Mark Hall, Eibe Frank and Ian H. Witten
May 5, 2011
2006-2012
c University of Waikato
1 Getting started
This tutorial introduces the main graphical user in- Este tutorial presenta la interfaz gráfica de usuario
terface for accessing WEKA’s facilities, called the principal para acceder a las instalaciones de
WEKA Explorer. We will work with WEKA 3.6 WEKA, llamado Explorer WEKA. Vamos a tra-
(although almost everything is the same with other bajar con WEKA 3.6 (aunque casi todo es lo
versions), and we assume that it is installed on mismo con otras versiones), y suponemos que se
your system. ha instalado en su sistema.
Invoke WEKA from the Windows START menu Invocar WEKA desde el menú INICIO de Win-
(on Linux or the Mac, double-click weka.jar or dows (en Linux o Mac, haga doble clic en weka.jar
weka.app). This starts up the WEKA GUI o weka.app). Esto pone en marcha el GUI
Chooser. Click the Explorer button to enter the Chooser WEKA. Haga clic en el Explorer botón
WEKA Explorer. para entrar en el WEKA Explorer.
Just in case you are wondering about the other Sólo en caso de que usted se está preguntando
buttons in the GUI Chooser: Experimenter sobre el resto de botones en la GUI Chooser:
is a user interface for comparing the predictive Experimenter es una interfaz de usuario para
performance of learning algorithms; Knowledge- comparar el rendimiento predictivo de algoritmos
Flow is a component-based interface that has a de aprendizaje; KnowledgeFlow es una interfaz
similar functionality as the Explorer; and Sim- basada en componentes que tiene una funcionali-
ple CLI opens a command-line interface that em- dad similar a la de Explorer; y Simple CLI se
ulates a terminal and lets you interact with WEKA abre un comando-lı́nea de interfaz que emula una
in this fashion. terminal y le permite interactuar con WEKA de
esta manera.
2 The panels in the Explorer
The user interface to the Explorer consists of six La interfaz de usuario de la Explorer se compone
panels, invoked by the tabs at the top of the win- de seis paneles, invocadas por las etiquetas en la
dow. The Preprocess panel is the one that is parte superior de la ventana. El panel de Prepro-
open when the Explorer is first started. This cess es la que está abierta cuando la Explorer por
tutorial will introduce you to two others as well: primera vez. Este tutorial le introducirá a otros
Classify and Visualize. (The remaining three dos, ası́: Classify y Visualize. (Los otros tres
panels are explained in later tutorials.) Here’s a paneles se explican en tutoriales más tarde.) He
brief description of the functions that these three aquı́ una breve descripción de las funciones que es-
panels perform. tos tres grupos de realizar.
Preprocess is where you to load and preprocess Preprocess es donde puedes cargar los datos y
data. Once a dataset has been loaded, the preproceso. Una vez que un conjunto de
panel displays information about it. The datos se ha cargado, el panel muestra infor-
dataset can be modified, either by editing mación sobre Àl. El conjunto de datos puede
it manually or by applying a filter, and the ser modificado, ya sea mediante la edición
modified version can be saved. As an alter- de forma manual o mediante la aplicación de
native to loading a pre-existing dataset, an un filtro, y la versión modificada se puede
artificial one can be created by using a gen- guardar. Como alternativa a la carga de un
erator. It is also possible to load data from conjunto de datos pre-existentes, una artifi-
a URL or from a database. cial se pueden crear mediante el uso de un
generador. También es posible cargar datos
desde una URL o desde una base de datos.
1
Classify is where you invoke the classification Classify es donde se invoca a los métodos de clasi-
methods in WEKA. Several options for the ficación en WEKA. Varias opciones para el
classification process can be set, and the re- proceso de clasificación se puede establecer,
sult of the classification can be viewed. The y el resultado de la clasificación se puede ver.
training dataset used for classification is the El conjunto de datos de entrenamiento uti-
one loaded (or generated) in the Preprocess lizados para la clasificación es la carga (o gen-
panel. erada) en el panel de Preprocess.
Visualize is where you can visualize the dataset Visualize es donde se puede visualizar el conjunto
loaded in the Preprocess panel as two- de datos cargados en el panel de Preprocess
dimensional scatter plots. You can select the como diagramas de dispersión de dos dimen-
attributes for the x and y axes. siones. Puede seleccionar los atributos de los
x y y ejes.
3 The Preprocess panel
Preprocess is the panel that opens when the Preprocess es el panel que se abre cuando el Ex-
WEKA Explorer is started. plorer WEKA se ha iniciado.
3.1 Loading a dataset
Before changing to any other panel, the Explorer Antes de cambiar a cualquier otro panel, el Ex-
must have a dataset to work with. To load one plorer debe tener un conjunto de datos para tra-
up, click the Open file... button in the top bajar. Para cargar una, haga clic en el botón de
left corner of the panel. Look around for the Open file... en la esquina superior izquierda del
folder containing datasets, and locate a file named panel. Mire a su alrededor para la carpeta que con-
weather.nominal.arff (this file is in the data tiene los conjuntos de datos y busque un archivo
folder that is supplied when WEKA is installed). llamado weather.nominal.arff (este archivo está
This contains the nominal version of the standard en el carpeta de data que se suministra cuando
“weather” dataset. Open this file. Now your WEKA se instala). Este contiene la versión nomi-
screen will look like Figure 1. nal de la norma “tiempo” conjunto de datos. Abrir
archivo. Ahora la pantalla se verá como la Fig-
ure 1.
The weather data is a small dataset with only 14 Los datos de clima es un conjunto de datos
examples for learning. Because each row is an in- pequeño con sólo 14 ejemplos para el aprendizaje.
dependent example, the rows/examples are called Debido a que cada fila es un ejemplo independi-
“instances.” The instances of the weather dataset ente, las filas/ejemplos son llamados “casos”. Los
have 5 attributes, with names ‘outlook’, ‘temper- casos del conjunto de datos meteorológicos tienen
ature’, ‘humidity’, ‘windy’ and ‘play’. If you click 5 atributos, con ‘perspectivas nombres’ , la ‘tem-
on the name of an attribute in the left sub-panel, peratura’, ‘humedad’, ‘jugar’ con mucho ‘viento’
information about the selected attribute will be y. Si hace clic en el nombre de un atributo en
shown on the right. You can see the values of the el sub-panel de la izquierda, la información acerca
attribute and how many times an instance in the del atributo seleccionado se muestra a la derecha.
dataset has a particular value. This information is Usted puede ver los valores de los atributos y las
also shown in the form of a histogram. veces que una instancia del conjunto de datos tiene
un valor particular. Esta información se muestra
también en la forma de un histograma.
2
Figure 1: The Explorer’s Preprocess panel.
All attributes in this dataset are “nominal,” i.e. Todos los atributos de este conjunto de datos son
they have a predefined finite set of values. Each “nominales”, es decir, tienen un conjunto finito de
instance describes a weather forecast for a particu- valores predefinidos. Cada instancia se describe
lar day and whether to play a certain game on that un pronóstico del tiempo para un dı́a en particular
day. It is not really clear what the game is, but let y si a jugar un cierto juego en ese dı́a. No está
us assume that it is golf. The last attribute ‘play’ muy claro lo que el juego es, pero supongamos que
is the “class” attribute—it classifies the instance. es el golf. ‘Jugar’ el último atributo es el atrib-
Its value can be ‘yes’ or ‘no’. Yes means that the uto “class”—que clasifica la instancia. Su valor
weather conditions are OK for playing golf, and no puede ser ‘si’ o ‘no’. Sı́ significa que las condi-
means they are not OK. ciones climáticas están bien para jugar al golf, y
no significa que no están bien.
3.2 Exercises
To familiarize yourself with the functions discussed Para familiarizarse con las funciones discutido
so far, please do the following two exercises. The hasta ahora, por favor, los dos ejercicios siguientes.
solutions to these and other exercises in this tuto- Las soluciones a estos y otros ejercicios de este tu-
rial are given at the end. torial se dan al final.
Ex. 1: What are the values that the attribute Ex. 1: Cuáles son los valores que la ‘temperatura’
‘temperature’ can have? el atributo puede tener?
3
Ex. 2: Load a new dataset. Press the ‘Open file’ Ex. 2: Carga un nuevo conjunto de datos. Pulse
button and select the file iris.arff. How el botón ‘Abrir el archivo’ y seleccione el
many instances does this dataset have? How archivo iris.arff. Cuántos casos se han
many attributes? What is the range of pos- esta base de datos? Cómo muchos atribu-
sible values of the attribute ’petallength’ ? tos? Cuál es el rango de valores posibles de
‘petallength’ el atributo?
3.3 The dataset editor
It is possible to view and edit an entire dataset Es posible ver y editar un conjunto de datos desde
from within WEKA. To experiment with this, load el interior de WEKA. Para experimentar con esto,
the file weather.nominal.arff again. Click the cargar el archivo weather.nominal.arff nuevo.
Edit... button from the row of buttons at the Haga clic en el botón de Edit... de la fila de
top of the Preprocess panel. This opens a new botones en la parte superior del panel de Pre-
window called Viewer, which lists all instances of process. Esto abre una nueva ventana llamada
the weather data (see Figure 2). Viewer, que enumera todas las instancias de los
datos meteorológicos (véase la Figure 2).
3.3.1 Exercises
Ex. 3: What is the function of the first column in Ex. 3: Cuál es la función de la primera columna
the Viewer? de la Viewer?
Ex. 4: Considering the weather data, what is the Ex. 4: Teniendo en cuenta los datos meteo-
class value of instance number 8? rológicos, cuál es el valor de la clase de
número de instancia 8?
Ex. 5: Load the iris data and open it in the edi- Ex. 5: Carga los datos de iris y abrirlo en el ed-
tor. How many numeric and how many nom- itor. Cómo los atributos nominales muchas
inal attributes does this dataset have? numérico y el número de este conjunto de
datos se tienen?
3.4 Applying a filter
In WEKA, “filters” are methods that can be used En WEKA, “filtros” son métodos que se pueden
to modify datasets in a systematic fashion—that utilizar para modificar bases de datos de manera
is, they are data preprocessing tools. WEKA sistemática—es decir, son datos del proceso pre-
has several filters for different tasks. Reload the vio herramientas. WEKA tiene varios filtros para
weather.nominal dataset, and let’s remove an at- diferentes tareas. Actualizar el weather.nominal
tribute from it. The appropriate filter is called conjunto de datos, y vamos a eliminar un atributo
Remove; its full name is: de ella. El filtro adecuado se llama Remove, su
nombre completo es:
weka.filters.unsupervised.attribute.Remove
4
Figure 2: The data viewer.
Examine this name carefully. Filters are organized Examine cuidadosamente este nombre. Los filtros
into a hierarchical structure whose root is weka. están organizados en una estructura jerárquica,
Those in the unsupervised category don’t require cuya raı́z es weka. Los que están en la categorı́a de
a class attribute to be set; those in the supervised unsupervised no requieren un atributo de clase
category do. Filters are further divided into ones que se establece, los de la categorı́a supervised
that operate primarily on attributes/columns (the hacer. Los filtros se dividen en los que operan prin-
attribute category) and ones that operate primar- cipalmente en los atributos/columnas (la categorı́a
ily on instances/rows (the instance category). attribute) y los que operan principalmente en ca-
sos/filas (la categorı́a instance).
If you click the Choose button in the Preprocess Si hace clic en el botón Choose en el panel de
panel, a hierarchical editor opens in which you se- Preprocess, se abre un editor jerárquico en el que
lect a filter by following the path corresponding to se selecciona un filtro, siguiendo la ruta de acceso
its full name. Use the path given in the full name correspondiente a su nombre completo. Utilice la
above to select the Remove filter. Once it is se- ruta dada en por encima del nombre completo para
lected, the text “Remove” will appear in the field seleccionar el filtro de Remove. Una vez que se se-
next to the Choose button. lecciona, el texto “Eliminar” aparecerá en el campo
situado junto al botón de Choose.
5
Click on the field containing this text. A window Haga clic en el campo que contiene este texto. Se
opens called the GenericObjectEditor, which is abre una ventana denominada GenericObjectE-
used throughout WEKA to set parameter values ditor, que se utiliza en todo WEKA para estable-
for all of the tools. It contains a short explana- cer valores de los parámetros de todas las her-
tion of the Remove filter—click More to get a ramientas. Contiene una breve explicación del fil-
fuller description. Underneath there are two fields tro de Remove—haga clic More para obtener una
in which the options of the filter can be set. The descripción más completa. Debajo hay dos campos
first option is a list of attribute numbers. The sec- en los que las opciones del filtro se puede estable-
ond option—InvertSelection—is a switch. If it cer. La primera opción es una lista de números de
is ‘false’, the specified attributes are removed; if it atributo. La segunda opción—InvertSelection—
is ‘true’, these attributes are NOT removed. es un interruptor. Si se trata de ‘falsos’, los atribu-
tos especificados se quitan, si es ‘verdadero’, estos
atributos no se quitan.
Enter “3” into the attributeIndices field and Ingrese “3” en el campo attributeIndices y haga
click the OK button. The window with the fil- clic en el botón de OK. La ventana con las op-
ter options closes. Now click the Apply button ciones de filtro se cierra. Ahora haga clic en el
on the right, which runs the data through the fil- botón de Apply a la derecha, es decir, los datos a
ter. The filter removes the attribute with index 3 través del filtro. El filtro elimina el atributo con el
from the dataset, and you can see that the set of ı́ndice 3 del conjunto de datos, y se puede ver que el
attributes has been reduced. This change does not conjunto de atributos se ha reducido. Este cambio
affect the dataset in the file; it only applies to the no afecta al conjunto de datos en el archivo, sólo se
data held in memory. The changed dataset can be aplica a los datos recogidos en la memoria. El con-
saved to a new ARFF file by pressing the Save... junto de datos modificado se puede guardar en un
button and entering a file name. The action of the archivo ARFF nuevo pulsando el botón de Save...
filter can be undone by pressing the Undo button. y entrar en un nombre de archivo. La acción del fil-
Again, this applies to the version of the data held tro se puede deshacer pulsando el botón de Undo.
in memory. Una vez más, esto se aplica a la versión de los datos
contenidos en la memoria.
What we have described illustrates how filters in Lo que hemos descrito se muestra cómo los filtros
WEKA are applied to data. However, in the par- en WEKA se aplican a los datos. Sin embargo,
ticular case of Remove, there is a simpler way of en el caso particular de Remove, hay una man-
achieving the same effect. Instead of invoking the era más sencilla de lograr el mismo efecto. En lu-
Remove filter, attributes can be selected using the gar de invocar el Remove filtro, los atributos se
small boxes in the Attributes sub-panel and re- pueden seleccionar con los cuadros pequeños en la
moved using the Remove button that appears at Attributes sub-panel y eliminar con el botón de
the bottom, below the list of attributes. Remove que aparece en la parte inferior, debajo
de la lista de atributos.
3.4.1 Exercises
Ex. 6: Ensure that the weather.nominal Ex. 6: Asegúrese de que el weather.nominal

dataset is loaded. Use the filter conjunto de datos se carga. Utilice el filtro
weka.unsupervised.instance.RemoveWithValues weka.unsupervised.instance.RemoveWithValues
to remove all instances in which the ‘humid- para eliminar todos los casos en los que el
ity’ attribute has the value ‘high’. To do atributo ‘humedad’ tiene el valor ‘alto’. Para
this, first make the field next to the Choose ello, en primer lugar que el campo situado
button show the text ‘RemoveWithValues’. junto al botón de Choose mostrará el
Then click on it to get the GenericOb- texto ‘RemoveWithValues’, a continuación,
jectEditor window and figure out how to haga clic en ella para mostrar la ventana
change the filter settings appropriately. de GenericObjectEditor y encontrar la
manera de cambiar la configuración del filtro
adecuadamente.
6
Ex. 7: Undo the change to the dataset that you Ex. 7: Deshacer el cambio en el conjunto de datos
just performed, and verify that the data is que acaba de realizar, y verificar que los
back in its original state. datos vuelve a su estado original.
4 The Visualize panel
We now take a look at WEKA’s data visualization Ahora eche un vistazo a las instalaciones de
facilities. These work best with numeric data, so WEKA de visualización de datos. Estos funcio-
we use the iris data. nan mejor con datos numéricos, por lo que utilizar
los datos del iris.
First, load iris.arff. This data contains flower En primer lugar, la carga iris.arff. Estos datos
measurements. Each instance is classified as one contienen mediciones de flores. Cada caso se clasi-
of three types: iris-setosa, iris-versicolor and iris- fica como uno de tres tipos: setosa iris, iris versi-
virginica. The dataset has 50 examples of each color y virginica iris. El conjunto de datos cuenta
type: 150 instances in all. con 50 ejemplos de cada tipo: 150 casos en total.
Click the Visualize tab to bring up the visual- Haga clic en la ficha Visualize para que aparezca
ization panel. It shows a grid containing 25 two- el panel de visualización. Muestra una cuadrı́cula
dimensional scatter plots, with every possible com- que contiene 25 gráficos de dispersión de dos di-
bination of the five attributes of the iris data on mensiones, con todas las combinaciones posibles
the x and y axes. Clicking the first plot in the sec- de los cinco atributos de los datos del iris en los x
ond row opens up a window showing an enlarged y y ejes. Al hacer clic en la primera parcela en la
plot using the selected axes. Instances are shown segunda fila se abre una ventana que muestra una
as little crosses whose color cross depends on the trama ampliada con los ejes seleccionados. Las in-
instance’s class. The x axis shows the ‘sepallength’ stancias se muestran como pequeñas cruces cuyo
attribute, and the y axis shows ‘petalwidth’. color depende de la clase de cruz de la instancia.
El eje x muestra el atributo ‘sepallength’, y ‘petal-
width’ muestra el y eje.
Clicking on one of the crosses opens up an In- Al hacer clic en una de las cruces se abre una ven-
stance Info window, which lists the values of all tana de Instance Info, que enumera los valores
attributes for the selected instance. Close the In- de todos los atributos de la instancia seleccionada.
stance Info window again. Cierre la ventana de Instance Info de nuevo.
The selection fields at the top of the window that Los campos de selección en la parte superior de
contains the scatter plot can be used to change the la ventana que contiene el diagrama de dispersión
attributes used for the x and y axes. Try changing se puede utilizar para cambiar los atributos uti-
the x axis to ‘petalwidth’ and the y axis to ‘petal- lizados por los x y y ejes. Pruebe a cambiar el
length’. The field showing “Colour: class (Num)” eje x a ‘petalwidth’ y el y eje ‘petallength’. El
can be used to change the colour coding. campo muestra “Color: clase (Num)”se puede uti-
lizar para cambiar el código de colores.
Each of the colorful little bar-like plots to the right Cada una de las parcelas de colores poco como
of the scatter plot window represents a single at- la barra a la derecha de la ventana del gráfico de
tribute. Clicking a bar uses that attribute for the dispersión representa un único atributo. Haciendo
x axis of the scatter plot. Right-clicking a bar does clic en un bar que utiliza atributos para los x eje
the same for the y axis. Try to change the x and del diagrama de dispersión. Derecho clic en un bar
y axes back to ‘sepallength’ and ‘petalwidth’ using hace lo mismo con los y eje. Trate de cambiar los
these bars. x y y ejes de nuevo a ‘sepallength’ y ‘petalwidth’
utilizando estas barras.
7
The Jitter slider displaces the cross for each in- El control deslizante Jitter desplaza la cruz por
stance randomly from its true position, and can cada instancia al azar de su verdadera posición, y
reveal situations where instances lie on top of one puede revelar las situaciones en que casos se en-
another. Experiment a little by moving the slider. cuentran en la parte superior de uno al otro. Ex-
perimente un poco moviendo la barra deslizante.
The Select Instance button and the Reset, El botón de Select Instance y Reset, Clear, y
Clear and Save buttons let you change the Save los botones le permiten cambiar el conjunto
dataset. Certain instances can be selected and the de datos. Algunos casos se pueden seleccionar y
others removed. Try the Rectangle option: select eliminar los demás. Pruebe la opción Rectangle:
an area by left-clicking and dragging the mouse. seleccionar un área por la izquierda haciendo clic
The Reset button now changes into a Submit y arrastrando el ratón. El Reset botón ahora se
button. Click it, and all instances outside the rect- transforma en un botón de Submit. Haga clic en
angle are deleted. You could use Save to save the él, y todos los casos fuera del rectángulo se elim-
modified dataset to a file, while Reset restores the inan. Usted podrı́a utilizar Save para guardar el
original dataset. conjunto de datos modificados en un archivo, mien-
tras que Reset restaura el conjunto de datos orig-
inal.
5 The Classify panel
Now you know how to load a dataset from a file Ahora usted sabe cómo cargar un conjunto de
and visualize it as two-dimensional plots. In this datos de un archivo y visualizarlo como parce-
section we apply a classification algorithm—called las de dos dimensiones. En esta sección se aplica
a “classifier” in WEKA—to the data. The clas- un algoritmo de clasificación—denominado “clasi-
sifier builds (“learns”) a classification model from ficador” en WEKA—a los datos. El clasificador se
the data. basa (“aprende”) un modelo de clasificación de los
datos.
In WEKA, all schemes for predicting the value of a En WEKA, todos los esquemas para predecir el
single attribute based on the values of some other valor de un atributo único, basado en los valores
attributes are called “classifiers,” even if they are de algunos atributos de otros se llaman “clasi-
used to predict a numeric target—whereas other ficadores”, incluso si se utilizan para predecir
people often describe such situations as “numeric un objetivo numérico—mientras que otras per-
prediction” or “regression.” The reason is that, sonas a menudo describen situaciones tales como
in the context of machine learning, numeric pre- “numérica predicción” o “regresión”. La razón es
diction has historically been called “classification que, en el contexto de aprendizaje de máquina,
with continuous classes.” la predicción numérica históricamente ha sido lla-
mada “la clasificación con clases continuas.”
Before getting started, load the weather Antes de empezar, carga la información del
data again. Go to the Preprocess panel, tiempo nuevo. Ir al panel de Preprocess,
click the Open file button, and select haga clic en el botón de Open file, y selec-
weather.nominal.arff from the data direc- cione weather.nominal.arff desde el directorio
tory. Then switch to the classification panel de datos. Luego cambiar a la mesa de clasificación,
by clicking the Classify tab at the top of the haga clic en la ficha Classify en la parte supe-
window. The result is shown in Figure 3. rior de la ventana. El resultado se muestra en la
Figura 3.
8
Figure 3: The Classify panel.
5.1 Using the C4.5 classifier
A popular machine learning method for data min- Una máquina popular método de aprendizaje para
ing is called the C4.5 algorithm, and builds de- la minerı́a de datos se denomina el algoritmo C4.5,
cision trees. In WEKA, it is implemented in a y construye árboles de decisión. En WEKA, se
classifier called “J48.” Choose the J48 classifier implementa en un clasificador llamado “J48”. Se-
by clicking the Choose button near the top of the leccione el clasificador J48 haciendo clic en el botón
Classifier tab. A dialogue window appears show- de Choose en la parte superior de la ficha Clas-
ing various types of classifier. Click the trees entry sifier. Una ventana de diálogo aparece mostrando
to reveal its subentries, and click J48 to choose the los diferentes tipos de clasificadores. Haga clic en
J48 classifier. Note that classifiers, like filters, are la entrada trees a revelar sus subentradas, y haga
organized in a hierarchy: J48 has the full name clic en J48 elegir el clasificador J48. Tenga en
weka.classifiers.trees.J48. cuenta que los clasificadores, como los filtros, están
organizados en una jerarquı́a: J48 tiene el nombre
completo weka.classifiers.trees.J48.
The classifier is shown in the text box next to the El clasificador se muestra en el cuadro de texto
Choose button: it now reads J48 –C 0.25 –M 2. junto al botón Choose: J48 –C 0.25 –M 2 se
The text after “J48” gives the default parameter sustituirá por el texto. El texto después de “J48”
settings for this classifier. We can ignore these, be- da la configuración de los parámetros por defecto
cause they rarely require changing to obtain good para este clasificador. Podemos ignorar esto, ya
performance from C4.5. que rara vez se requieren cambios para obtener un
buen rendimiento de C4.5.
9
Decision trees are a special type of classification Los árboles de decisión son un tipo especial de
model. Ideally, models should be able to predict modelo de clasificación. Idealmente, los modelos
the class values of new, previously unseen instances deben ser capaces de predecir los valores de la clase
with high accuracy. In classification tasks, accu- de nuevo, no visto previamente casos con gran pre-
racy is often measured as the percentage of cor- cisión. En las tareas de clasificación, la precisión
rectly classified instances. Once a model has been se mide como el porcentaje de casos clasificados
learned, we should test it to see how accurate it is correctamente. Una vez que un modelo que se ha
when classifying instances. aprendido, hay que probarlo para ver cómo es ex-
acto es la hora de clasificar los casos.
One option in WEKA is to evaluate performance Una opción en WEKA es evaluar el rendimiento
on the training set—the data that was used to en el conjunto de entrenamiento—los datos que
build the classifier. This is NOT generally a good se utilizó para construir el clasificador. Esto no
idea because it leads to unrealistically optimistic es generalmente una buena idea porque conduce a
performance estimates. You can easily get 100% las estimaciones de rendimiento irrealmente opti-
accuracy on the training data by simple rote learn- mista. Usted puede obtener el 100% de precisión
ing, but this tells us nothing about performance en los datos de entrenamiento por el aprendizaje
on new data that might be encountered when the de memoria sencillo, pero esto no nos dice nada
model is applied in practice. Nevertheless, for il- sobre el rendimiento de los nuevos datos que se
lustrative purposes it is instructive to consider per- pueden encontrar cuando el modelo se aplica en la
formance on the training data. práctica. No obstante, a tı́tulo ilustrativo es in-
structivo considerar el rendimiento de los datos de
entrenamiento.
In WEKA, the data that is loaded using the Pre- En WEKA, los datos que se carga mediante
process panel is the “training data.” To eval- el panel de Preprocess es el “datos de entre-
uate on the training set, choose Use training namiento.” Para evaluar el conjunto de entre-
set from the Test options panel in the Clas- namiento, elegir Use training set desde el panel
sify panel. Once the test strategy has been set, de Test options en el panel Classify. Una vez
the classifier is built and evaluated by pressing the que la estrategia de prueba se ha establecido, el
Start button. This processes the training set us- clasificador se construye y se evaluó con el botón
ing the currently selected learning algorithm, C4.5 Start. Este proceso conjunto de entrenamiento
in this case. Then it classifies all the instances in utilizando el algoritmo seleccionado aprendizaje,
the training data—because this is the evaluation C4.5 en este caso. Luego se clasifica a todas las
option that has been chosen—and outputs perfor- instancias en los datos de entrenamiento—porque
mance statistics. These are shown in Figure 4. esta es la opción de evaluación que se ha elegido—
y estadı́sticas de resultados de desempeño. Estos
se muestran en la Figure 4.
5.2 Interpreting the output
The outcome of training and testing appears in El resultado de la formación y la prueba aparece
the Classifier output box on the right. You can en el cuadro de Classifier output a la derecha.
scroll through the text to examine it. First, look at Puede desplazarse por el texto para examinarla.
the part that describes the decision tree that was En primer lugar, busque en la parte que describe
generated: el árbol de decisión que se ha generado:
J48 pruned tree

------------------
outlook = sunny
| humidity = high: no (3.0)
| humidity = normal: yes (2.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
10
Figure 4: Output after building and testing the classifier.
Number of Leaves : 5
Size of the tree : 8
This represents the decision tree that was built, Esto representa el árbol de decisión que fue con-
including the number of instances that fall under struido, incluyendo el número de casos que corre-
each leaf. The textual representation is clumsy to sponden a cada hoja. La representación textual es
interpret, but WEKA can generate an equivalent torpe de interpretar, pero WEKA puede generar
graphical representation. You may have noticed una representación gráfica equivalente. Puede
that each time the Start button is pressed and a haber notado que cada vez que el botón se pulsa
new classifier is built and evaluated, a new entry Start y un clasificador de nueva construcción y se
appears in the Result List panel in the lower left evaluó, una nueva entrada aparece en el panel de
corner of Figure 4. To see the tree, right-click on Result List en la esquina inferior izquierda de la
the trees.J48 entry that has just been added to Figure 4. Para ver el árbol, haga clic en la entrada
the result list, and choose Visualize tree. A win- trees.J48 que acaba de ser añadido a la lista de re-
dow pops up that shows the decision tree in the sultados, y elija Visualize tree. Aparece una ven-
form illustrated in Figure 5. Right-click a blank tana que muestra el árbol de decisión en la forma
spot in this window to bring up a new menu en- ilustrada en la Figure 5. Haga clic en un punto en
abling you to auto-scale the view, or force the tree blanco en esta ventana para que aparezca un nuevo
to fit into view. You can pan around by dragging menú que le permite auto-escala de la vista, o la
the mouse. fuerza del árbol para ajustarse a la vista. Puede
desplazarse por arrastrando el ratón.
11
Figure 5: The decision tree that has been built.
This tree is used to classify test instances. The Este árbol se utiliza para clasificar los casos de
first condition is the one in the so-called “root” prueba. La primera condición es la de la llamada
node at the top. In this case, the ‘outlook’ at- “raı́z” del nodo en la parte superior. En este caso,
tribute is tested at the root node and, depending el atributo ‘perspectivas’ se prueba en el nodo raı́z
on the outcome, testing continues down one of the y, dependiendo del resultado, la prueba continúa
three branches. If the value is ‘overcast’, testing por una de las tres ramas. Si el valor es ‘cubierto’,
ends and the predicted class is ‘yes’. The rectan- finaliza las pruebas y la clase predicha es ‘sı́’. Los
gular nodes are called “leaf” nodes, and give the nodos rectangulares se denominan “hojas” nodos,
class that is to be predicted. Returning to the root y dar la clase que se predijo. Volviendo al nodo
node, if the ‘outlook’ attribute has value ’sunny’, raı́z, si el atributo ‘perspectivas’ tiene un valor
the ‘humidity’ attribute is tested, and if ’outlook’ ‘sol’, el atributo ‘humedad’ se prueba, y si ‘per-
has value ‘rainy, the ’windy’ attribute is tested. No spectivas’ tiene un valor de ‘lluvias’, el atributo
paths through this particular tree have more than ‘viento’ se prueba. No hay caminos a través de
two tests. este árbol en particular tiene más de dos pruebas.
Now let us consider the remainder of the infor- Consideremos ahora el resto de la información en
mation in the Classifier output area. The next el área de Classifier output. Las dos siguientes
two parts of the output report on the quality of the partes del informe de salida en la calidad del mod-
classification model based on the testing option we elo de clasificación basado en la opción de prueba
have chosen. que hemos elegido.
The following states how many and what propor- Los siguientes estados cuántos y qué proporción de
tion of test instances have been correctly classified: casos de prueba han sido correctamente clasifica-
dos:
Correctly Classified Instances 14 100%
12
This is the accuracy of the model on the data Esta es la precisión del modelo sobre los datos
used for testing. It is completely accurate (100%), utilizados para la prueba. Es totalmente preciso
which is often the case when the training set is (100%), que es a menudo el caso cuando el con-
used for testing. There are some other perfor- junto de entrenamiento se utiliza para la prueba.
mance measures in the text output area, which we Hay algunas medidas de desempeño en la zona de
won’t discuss here. salida de texto, que no vamos a discutir aquı́.
At the bottom of the output is the confusion ma- En la parte inferior de la salida es la matriz de
trix: confusión:
=== Confusion Matrix ===
a b <-- classified as
9 0 | a = yes
0 5 | b = no
Each element in the matrix is a count of instances. Cada elemento de la matriz es un recuento de los
Rows represent the true classes, and columns rep- casos. Las filas representan las clases de verdad, y
resent the predicted classes. As you can see, all 9 las columnas representan las clases previsto. Como
‘yes’ instances have been predicted as yes, and all puede ver, todos los 9 ‘sı́’ casos se han previsto
5 ‘no’ instances as no. como sı́, y los 5 ‘no’ casos como no.
5.2.1 Exercise
Ex 8: How would the following instance be clas- Ex. 8: Cómo serı́a la siguiente instancia se clasi-
sified using the decision tree? ficarán con el árbol de decisión?
outlook = sunny, temperature = cool, hu- perspectivas = soleado, temperatura = fria,
midity = high, windy = TRUE humedad = viento, alta = TRUE
5.3 Setting the testing method
When the Start button is pressed, the selected Cuando el botón se pulsa Start, el algoritmo de
learning algorithm is started and the dataset that aprendizaje seleccionadas se inicia y el conjunto
was loaded in the Preprocess panel is used to de datos que se cargó en el panel de Preprocess
train a model. A model built from the full train- se utiliza para entrenar a un modelo. Un modelo
ing set is then printed into the Classifier output construido a partir del conjunto de entrenamiento
area: this may involve running the learning algo- completo se imprime en el área de Classifier out-
rithm one final time. put: esto puede implicar que ejecuta el algoritmo
de aprendizaje por última vez.
The remainder of the output in the Classifier El resto de la producción en el área de Classifier
output area depends on the test protocol that output depende del protocolo de prueba que fue
was chosen using Test options. The Test op- elegido con Test options. El cuadro de Test op-
tions box gives several possibilities for evaluating tions da varias posibilidades para la evaluación de
classifiers: los clasificadores:
13
Use training set Uses the same dataset that was Usar el conjunto de la formacion Utiliza el
used for training (the one that was loaded in mismo conjunto de datos que se utilizó para
the Preprocess panel). This is the option la formación (la que se cargó en el panel
we used above. It is generally NOT recom- de Preprocess). Esta es la opción que
mended because it gives over-optimistic per- usamos anteriormente. Por lo general, no
formance estimates. se recomienda porque da estimaciones de
rendimiento demasiado optimistas.
Supplied test set Lets you select a file contain- prueba suministrados conjunto Permite
ing a separate dataset that is used exclusively seleccionar un archivo que contiene un
for testing. conjunto de datos independiente que se
utiliza exclusivamente para la prueba.
Cross-validation This is the default option, and La validacion cruzada Esta es la opción por de-
the most commonly-used one. It first splits fecto, y el más comúnmente utilizado. En
the training set into disjoint subsets called primer lugar, se divide el conjunto de entre-
“folds.” The number of subsets can be en- namiento en subconjuntos disjuntos llama-
tered in the Folds field. Ten is the de- dos “pliegues”. El número de subconjun-
fault, and in general gives better estimates tos se pueden introducir en el campo Folds.
than other choices. Once the data has been Diez es el valor predeterminado, y en gen-
split into folds of (approximately) equal size, eral proporciona mejores estimaciones que
all but one of the folds are used for train- otras opciones. Una vez que los datos se
ing and the remaining, left-out, one is used ha dividido en los pliegues de (aproximada-
for testing. This involves building a new mente) igual tamaño, todos menos uno de
model from scratch from the corresponding los pliegues se utilizan para la formación y el
subset of data and evaluating it on the let- restante a cabo, a la izquierda-, uno se utiliza
out fold. Once this has been done for the para la prueba. Esto implica la construcción
first test fold, a new fold is selected for test- de un nuevo modelo a partir de cero desde el
ing and the remaining folds used for train- subconjunto de datos correspondientes y la
ing. This is repeated until all folds have evaluación que sobre la que-a veces. Una vez
been used for testing. In this way each in- que esto se ha hecho para la primera prueba
stance in the full dataset is used for testing doble, una nueva tapa está seleccionado para
exactly once, and an instance is only used las pruebas y los pliegues restante utilizado
for testing when it is not used for train- para el entrenamiento. Esto se repite hasta
ing. WEKA’s cross-validation is a strat- que todos los pliegues se han utilizado para
ified cross-validation, which means that la prueba. De esta manera, cada instan-
the class proportions are preserved when di- cia del conjunto de datos completo se utiliza
viding the data into folds: each class is rep- para probar una sola vez, y una instancia
resented by roughly the same number of in- sólo se utiliza para la prueba cuando no se
stances in each fold. This gives slightly im- utiliza para el entrenamiento. WEKA cruz
proved performance estimates compared to de la validación es una stratified cross-
unstratified cross-validation. validation, lo que significa que las propor-
ciones de clase se conservan al dividir los
datos en los pliegues: cada clase está rep-
resentada por aproximadamente el mismo
número de casos en cada pliegue. Esto
proporciona un rendimiento mejorado liger-
amente en comparación con las estimaciones
sin estratificar la validación cruzada.
14
Percentage split Shuffles the data randomly Shuffles Porcentaje dividir los datos al azar y
and then splits it into a training and a test luego se divide en un entrenamiento y un
set according to the proportion specified. In conjunto de pruebas de acuerdo a la pro-
practice, this is a good alternative to cross- porción especificada. En la práctica, esta es
validation if the size of the dataset makes una buena alternativa a la validación cruzada
cross-validation too slow. si el tamaño del conjunto de datos hace que
la validación cruzada demasiado lento.
The first two testing methods, evaluation on the Los dos primeros métodos de prueba, la evaluación
training set and using a supplied test set, involve en el conjunto de entrenamiento y el uso de una
building a model only once. Cross-validation in- unidad de prueba suministrada, implicarı́a la con-
volves building a model N +1 times, where N is the strucción de un modelo de una sola vez. La val-
chosen number of folds. The first N times, a frac- idación cruzada consiste en la construcción de un
tion (N − 1)/N (90% for ten-fold cross-validation) modelo de N + 1 veces, donde N es el número
of the data is used for training, and the final elegido de los pliegues. Los primeros N veces, una
time the full dataset is used. The percentage split fracción (N − 1)/N (90% de diez veces la vali-
method involves building the model twice, once on dación cruzada) de los datos se utiliza para el en-
the reduced dataset and again on the full dataset. trenamiento y el tiempo final del conjunto de datos
completo se utiliza. El método de dividir el por-
centaje implica la construcción del modelo en dos
ocasiones, una vez en el conjunto de datos reduci-
dos y de nuevo en el conjunto de datos completo.
5.3.1 Exercise
Ex 9: Load the iris data using the Preprocess Ex. 9 carga los datos del iris mediante el panel
panel. Evaluate C4.5 on this data using de Preprocess. Evaluar C4.5 en estos datos
(a) the training set and (b) cross-validation. utilizando (a) el conjunto de entrenamiento
What is the estimated percentage of correct y (b) la validación cruzada. Cuál es el por-
classifications for (a) and (b)? Which esti- centaje estimado de clasificaciones correctas
mate is more realistic? para (a) y (b)? Que estiman es más realista?
5.4 Visualizing classification errors
WEKA’s Classify panel provides a way of visu- Panel de WEKA de Classify proporciona una
alizing classification errors. To do this, right-click manera de visualizar los errores de clasificación.
the trees.J48 entry in the result list and choose Para ello, haga clic en la entrada trees.J48 en
Visualize classifier errors. A scatter plot win- la lista de resultados y elegir Visualize classi-
dow pops up. Instances that have been classified fier errors. Una ventana gráfica de dispersión
correctly are marked by little crosses; whereas ones aparece. Casos que han sido clasificados correc-
that have been classified incorrectly are marked by tamente marcadas por pequeñas cruces, mientras
little squares. que los que han sido clasificados incorrectamente
están marcados por pequeños cuadrados.
5.4.1 Exercise
15
Ex 10: Use the Visualize classifier errors func- Ex. 10: Utilice la función de Visualize classi-
tion to find the wrongly classified test in- fier errors para encontrar las instancias de
stances for the cross-validation performed in prueba de mal clasificadas para la validación
Exercise 9. What can you say about the lo- cruzada realizada en el ejercicio 9. Qué
cation of the errors? puede decir acerca de la ubicación de los er-
rores?
16
6 Answers To Exercises
1. Hot, mild and cool. 1. caliente, suave y fresco.
2. The iris dataset has 150 instances and 5 at- 2. El conjunto de datos del iris tiene 150 casos y
tributes. So far we have only seen nomi- atributos 5. Hasta ahora sólo hemos visto
nal values, but the attribute ‘petallength’ is los valores de nominal, pero ‘petallength’ el
a numeric attribute and contains numeric atributo es un atributo de numeric y con-
values. In this dataset the values for this tiene valores numéricos. En este conjunto
attribute lie between 1.0 and 6.9 (see Mini- de datos los valores de este atributo se en-
mum and Maximum in the right panel). cuentran entre 1.0 y 6.9 (véase Minimum
Maximum y en el panel derecho).
3. The first column is the number given to an in- 3. La primera columna es el número dado en una
stance when it is loaded from the ARFF file. instancia cuando se carga desde el archivo
It corresponds to the order of the instances ARFF. Se corresponde con el orden de las
in the file. instancias en el archivo.
4. The class value of this instance is ‘no’. The row 4. El valor de la clase de esta instancia es “no”. La
with the number 8 in the first column is the fila con el número 8 en la primera columna
instance with instance number 8. es la instancia con el número de instancia
5. This can be easily seen in the Viewer window. 5. Esto puede verse fácilmente en la ventana de
The iris dataset has four numeric and one Viewer. El conjunto de datos del iris tiene
nominal attribute. The nominal attribute is cuatro numérico y un atributo nominal. El
the class attribute. atributo nominal es el atributo de clase.
6. Select the RemoveWithValues filter after 6. Seleccione el RemoveWithValues filtro de-

clicking the Choose button. Click on the spués de hacer clic en el botón de Choose.
field that is located next to the Choose but- Haga clic en el campo que se encuentra
ton and set the field attributeIndex to 3 al lado del botón de Choose y establezca
and the field nominalIndices to 1. Press el campo attributeIndex a 3 y el campo
OK and Apply. nominalIndices a 1. Pulse OK y Apply.
7. Click the Undo button. 7. Haga clic en el botón de Undo.
8. The test instance would be classified as ’no’. 8. La instancia de prueba serı́a clasificado como
‘no’.
17
9. Percent correct on the training data is 98%. 9. porcentaje correcto en los datos de entre-
Percent correct under cross-validation is namiento es de 98%. Porcentaje de respues-
96%. The cross-validation estimate is more tas correctas en la validación cruzada es del
realistic. 96%. La estimación de la validación cruzada
es más realista.
10. The errors are located at the class boundaries. 10. Los errores se encuentran en los lı́mites de
clase.
18
Tutorial 2: Nearest Neighbor Learning and Decision Trees
Eibe Frank and Ian H. Witten
May 5, 2011
2006-2012
1 Introduction
In this tutorial you will experiment with nearest En este tutorial podrás experimentar con la clasi-
neighbor classification and decision tree learning. ficación más cercano vecino y árbol de decisión
For most of it we use a real-world forensic glass aprendizaje. Para la mayorı́a de los que usamos
classification dataset. un mundo real forenses conjunto de datos de clasi-
ficación de vidrio.
We begin by taking a preliminary look at this Empezamos por echar un vistazo preliminar a esta
dataset. Then we examine the effect of selecting base de datos. A continuación, examinamos el
different attributes for nearest neighbor classifica- efecto de la selección de atributos diferentes para
tion. Next we study class noise and its impact la clasificación del vecino más cercano. A contin-
on predictive performance for the nearest neighbor uación se estudia el ruido de clase y su impacto
method. Following that we vary the training set en el rendimiento predictivo del método del ve-
size, both for nearest neighbor classification and cino más cercano. Después de que variar el tamaño
decision tree learning. Finally, you are asked to del conjunto de la formación, tanto para la clasifi-
interactively construct a decision tree for an image cación del vecino más cercano y el árbol de decisión
segmentation dataset. aprendizaje. Por último, se le pide para construir
de forma interactiva un árbol de decisión para un
conjunto de datos de segmentación de la imagen.
Before continuing with this tutorial you should re- Antes de continuar con este tutorial es necesario
view in your mind some aspects of the classification que revise en su mente algunos aspectos de la tarea
task: de clasificación:
• How is the accuracy of a classifier measured? • Cómo es la precisión de un clasificador de

medir?
• What are irrelevant attributes in a data set, • Cuáles son los atributos irrelevantes en un
and can additional attributes be harmful? conjunto de datos y atributos adicionales
pueden ser perjudiciales?
• What is class noise, and how would you mea- • Cuál es el ruido de clase, y cómo medir su
sure its effect on learning? efecto en el aprendizaje?
• What is a learning curve? • Qué es una curva de aprendizaje?
• If you, personally, had to invent a decision • Si usted, personalmente, tenı́a que inventar
tree classifier for a particular dataset, how un clasificador de árbol de decisión para un
would you go about it? conjunto de datos particular, cómo hacerlo?
1
2 The glass dataset
The glass dataset glass.arff from the US Foren- El conjunto de datos de cristal glass.arff de
sic Science Service contains data on six types of los EE.UU. Servicio de Ciencias Forenses contiene
glass. Glass is described by its refractive index and datos sobre los seis tipos de vidrio. El vidrio es
the chemical elements it contains, and the aim is descrito por su ı́ndice de refracción y los elementos
to classify different types of glass based on these quı́micos que contiene, y el objetivo es clasificar
features. This dataset is taken from the UCI data los diferentes tipos de vidrio sobre la base de es-
sets, which have been collected by the University tas caracterı́sticas. Este conjunto de datos se ha
of California at Irvine and are freely available on tomado de los conjuntos de datos de la UCI, que
the World Wide Web. They are often used as a han sido recogidos por la Universidad de Califor-
benchmark for comparing data mining algorithms. nia en Irvine y están disponibles libremente en la
World Wide Web. A menudo se utilizan como ref-
erencia para comparar los algoritmos de minerı́a
de datos.
Find the dataset glass.arff and load it into the Encontrar el conjunto de datos glass.arff y car-
WEKA Explorer. For your own information, an- garlo en la Explorer WEKA. Para su propia in-
swer the following questions, which review material formación, conteste las siguientes preguntas, que
covered in Tutorial 1. el material objeto de examen en el Tutorial 1.
Ex. 1: How many attributes are there in the glass Ex. 1: Cómo los atributos con los que cuenta el
dataset? What are their names? What is the conjunto de datos de cristal? Cuáles son sus
class attribute? nombres? Cuál es el atributo de la clase?
Run the classification algorithm IBk Ejecutar el algoritmo de clasificación IBK

(weka.classifiers.lazy.IBk). Use cross- (weka.classifiers.lazy.IBk). Utilice la vali-
validation to test its performance, leaving the dación cruzada para probar su funcionamiento, de-
number of folds at the default value of 10. Recall jando el número de pliegues en el valor predeter-
that you can examine the classifier options in minado de 10. Recuerde que usted puede exami-
the GenericObjectEditor window that pops nar las opciones del clasificador en la ventana de
up when you click the text beside the Choose GenericObjectEditor que aparece al hacer clic
button. The default value of the KNN field is 1: en el texto junto al botón Choose. El valor por
this sets the number of neighboring instances to defecto del campo KNN es una: este establece el
use when classifying. número de casos de vecinos a utilizar en la clasifi-
cación.
Ex. 2: What is the accuracy of IBk (given in the Ex. 2: Qué es la exactitud de IBk (que figuran
Classifier output box)? en el cuadro de Classifier output)?
Run IBk again, but increase the number of neigh- Ejecutar IBK otra vez, pero aumentar el número
boring instances to k = 5 by entering this value in de casos de vecinos a k = 5 por entrar en este valor
the KNN field. Here and throughout this tutorial, en el campo KNN. Aquı́ ya lo largo de este tuto-
continue to use cross-validation as the evaluation rial, seguir utilizando la validación cruzada como
method. el método de evaluación.
Ex. 3: What is the accuracy of IBk with 5 neigh- Ex. 3: Qué es la exactitud de IBk con 5 casos de
boring instances (k = 5)? vecinos (k = 5)?
2
3 Attribute selection for glass classification
Now we find what subset of attributes produces Ahora nos encontramos con lo subconjunto de los
the best cross-validated classification accuracy for atributos produce la exactitud de la clasificación
the IBk nearest neighbor algorithm with k = 1 on mejor validación cruzada para el algoritmo de ve-
the glass dataset. WEKA contains automated at- cino más cercano IBk con k = 1 en el conjunto
tribute selection facilities, which we examine in a de datos de vidrio. WEKA contiene automatizado
later tutorial, but it is instructive to do this man- instalaciones para la selección de atributos, que se
ually. examinan más adelante en un tutorial, pero es in-
structivo para hacerlo manualmente.
Performing an exhaustive search over all possi- Realización de una búsqueda exhaustiva sobre to-
ble subsets of the attributes is infeasible (why?), dos los posibles subconjuntos de los atributos no es
so we apply a procedure called “backwards selec- factible (por qué?), por lo que aplicar un proced-
tion.” To do this, first consider dropping each imiento llamado “al revés de selección.” Para ello,
attribute individually from the full dataset con- en primer lugar considerar abandonar cada atrib-
sisting of nine attributes (plus the class), and run uto individual del conjunto de datos completa que
a cross-validation for each reduced version. Once consiste en nueve atributos (además de la clase), y
you have determined the best 8-attribute dataset, ejecutar una validación cruzada para cada versión
repeat the procedure with this reduced dataset to reducida. Una vez que haya determinado el con-
find the best 7-attribute dataset, and so on. junto de datos más de 8 atributo, repita el proced-
imiento con este conjunto de datos reduce a en-
contrar el mejor conjunto de datos 7-atributo, y
ası́ sucesivamente.
Ex. 4: Record in Table 1 the best attribute set Ex. 4: Registro en la Table 1 el mejor conjunto
and the greatest accuracy obtained in each de atributos y la mayor precisión obtenida
iteration. en cada iteración.
Table 1: Accuracy obtained using IBk, for different attribute subsets

Subset size Attributes in “best” subset Classification accuracy
9 attributes
8 attributes
7 attributes
6 attributes
5 attributes
4 attributes
3 attributes
2 attributes
1 attribute
0 attributes
The best accuracy obtained in this process is quite La mejor precisión obtenida en este proceso es un
a bit higher than the accuracy obtained on the full poco mayor que la precisión obtenida en el con-
dataset. junto de datos completo.
Ex. 5: Is this best accuracy an unbiased estimate Ex. 5: Es esto mejor precisión una estimación no
of accuracy on future data? Be sure to ex- sesgada de precisión en los datos de futuro?
plain your answer. Asegúrese de explicar su respuesta.
3
(Hint: to obtain an unbiased estimate of accuracy (Sugerencia: para obtener una estimación objetiva
on future data, we must not look at the test data de la exactitud en los datos de futuro, no debemos
at all when producing the classification model for mirar el at all datos de prueba cuando se pro-
which we want to obtain the estimate.) duce el modelo de clasificación para la que quer-
emos obtener la estimación.)
4 Class noise and nearest-neighbor learning
Nearest-neighbor learning, like other techniques, Aprendizaje más cercana al vecino, al igual que
is sensitive to noise in the training data. In this otras técnicas, es sensible al ruido en los datos de
section we inject varying amounts of class noise entrenamiento. En esta sección se inyectan canti-
into the training data and observe the effect on dades variables de class noise en los datos de en-
classification performance. trenamiento y observar el efecto en el rendimiento
de la clasificación.
You can flip a certain percentage of class labels in Puede invertir un cierto porcentaje de las eti-
the data to a randomly chosen other value using an quetas de clase en los datos a un valor es-
unsupervised attribute filter called AddNoise, in cogido de forma aleatoria otras mediante un atrib-
weka.filters.unsupervised.attribute. How- uto sin supervisión filtro llamado AddNoise,
ever, for our experiment it is important that the en weka.filters.unsupervised.attribute. Sin
test data remains unaffected by class noise. embargo, para nuestro experimento es importante
que los datos de prueba no se ve afectado por el
ruido de la clase.
Filtering the training data without filtering the Filtrado de los datos de entrenamiento sin fil-
test data is a common requirement, and is achieved trar los datos de prueba es un requisito común, y
using a “meta” classifier called FilteredClassi- se realiza con un “meta” clasificador denominado
fier, in weka.classifiers.meta. This meta clas- FilteredClassifier, en weka.classifiers.meta.
sifier should be configured to use IBk as the clas- Este clasificador meta debe estar configurado para
sifier and AddNoise as the filter. The Filtered- utilizar como IBk AddNoise el clasificador y el
Classifier applies the filter to the data before run- filtro. El FilteredClassifier se aplica el filtro a
ning the learning algorithm. This is done in two los datos antes de ejecutar el algoritmo de apren-
batches: first the training data and then the test dizaje. Esto se hace en dos tandas: en primer lugar
data. The AddNoise filter only adds noise to the los datos de entrenamiento y, a continuación los
first batch of data it encounters, which means that datos de prueba. El AddNoise filtro sólo hacı́a
the test data passes through unchanged. que el primer lote de datos que encuentra, lo que
significa que los datos de prueba pasa a través de
cambios.
Table 2: Effect of class noise on IBk, for different neighborhood sizes

Percent noise k=1 k=3 k=5
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4
Ex. 6: Reload the original glass dataset, and Ex. 6: Actualizar el conjunto de datos de vidrio
record in Table 2 the cross-validated accu- original, y registrar en la Table 2 la exactitud
racy estimate of IBk for 10 different percent- validación cruzada estimación de IBk por 10
ages of class noise and neighborhood sizes diferentes porcentajes de ruido de la clase y
k = 1, k = 3, k = 5 (determined by the value el barrio tamaños k = 1, k = 3, k = 5 (de-
of k in the k-nearest-neighbor classifier). terminado por el valor de k en el clasificador
k vecino más cercano).
Ex. 7: What is the effect of increasing the amount Ex. 7: Cuál es el efecto de aumentar la cantidad
of class noise? de ruido de clase?
Ex. 8: What is the effect of altering the value of Ex. 8: Qué elemento es el efecto de modificar el
k? valor de k?
5 Varying the amount of training data
In this section we consider “learning curves,” En esta sección tenemos en cuenta “las curvas de
which show the effect of gradually increasing the aprendizaje”, que muestran el efecto de aumen-
amount of training data. Again we use the glass tar gradualmente la cantidad de datos de entre-
data, but this time with both IBk and the C4.5 namiento. Una vez más se utilizan los datos de
decision tree learner, implemented in WEKA as vidrio, pero esta vez con dos IBk y la decisión C4.5
J48. alumno árbol, implementado en WEKA como J48.
To obtain learning curves, use the Filtered- Para obtener las curvas de aprendizaje, el uso de
Classifier again, this time in conjunction with la FilteredClassifier, esta vez en relación con el
weka.filters.unsupervised.instance.Resample, weka.filters.unsupervised.instance.Resample,
which extracts a certain specified percentage of a que extrae un porcentaje especificado de un con-
given dataset and returns the reduced dataset.1 junto de datos y devuelve el conjunto de datos
Again this is done only for the first batch to which reducidos.2 Una vez más esto se hace sólo para el
the filter is applied, so the test data passes un- primer grupo al que se aplica el filtro, por lo que
modified through the FilteredClassifier before los datos de prueba pasa sin modificar a través
it reaches the classifier. de la FilteredClassifier antes que alcanza el
clasificador.
Ex. 9: Record in Table 3 the data for learn- Ex. 9: Registro en la Table 3 los datos de las
ing curves for both the one-nearest-neighbor curvas de aprendizaje tanto para el uno-
classifier (i.e., IBk with k = 1) and J48. clasificador del vecino más cercano (es decir,
IBk con k = 1) y J48.
1 Thisfilter performs sampling with replacement, rather than sampling without replacement, but the effect is minor and
we will ignore it here.
2 Este filtro realiza el muestreo con reemplazo, en lugar de muestreo sin reemplazo, pero el efecto es menor y se lo ignora
aquı́.
5
Table 3: Effect of training set size on IBk and J48
Percentage of training set IBk J48
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Ex. 10: What is the effect of increasing the Ex. 10: Cuál es el efecto de aumentar la cantidad
amount of training data? de datos de entrenamiento?
Ex. 11: Is this effect more pronounced for IBk or Ex. 11: Es este tema efecto más pronunciado para
J48? IBk o J48?
6 Interactive decision tree construction
One of WEKA’s classifiers is interactive: it lets Uno de los clasificadores WEKA es interactiva:
the user—i.e., you!—construct your own classifier. permite que el usuario—es decir, que—construir
Here’s a competition: let’s see who can build a su propio clasificador. Aquı́ hay una competen-
classifier with the highest predictive accuracy! cia: a ver quién puede construir un clasificador
con mayor precisión de predicción!
Load the file segment-challenge.arff (in the Cargar el archivo segment-challenge.arff (en
data folder that comes with the WEKA distribu- la carpeta de datos que viene con la distribución
tion). This dataset has 20 attributes and 7 classes. de WEKA). Este conjunto de datos cuenta con 20
It is an image segmentation problem, and the task atributos y las clases 7. Se trata de un problema
is to classify images into seven different groups de segmentación de la imagen, y la tarea consiste
based on properties of the pixels. en clasificar las imágenes en siete grupos diferentes
basados en las propiedades de los pı́xeles.
Set the classifier to UserClassifier, in the Ajuste el clasificador a UserClassifier, en el

weka.classifiers.trees package. We will use a weka.classifiers.trees paquete. Vamos a uti-
supplied test set (performing cross-validation with lizar una unidad de prueba suministrada (realizar
the user classifier is incredibly tedious!). In the la validación cruzada con el clasificador de usuario
Test options box, choose the Supplied test set es muy aburrido!). En el cuadro de Test op-
option and click the Set... button. A small tions, seleccione la opción de Supplied test set
window appears in which you choose the test y haga clic en el botón de Set.... Aparecerá una
set. Click Open file... and browse to the file pequeña ventana en la que usted elija el equipo
segment-test.arff (also in the WEKA distribu- de prueba. Haga clic en Open file... y busque
tion’s data folder). On clicking Open, the small el archivo segment-test.arff (también en la car-
window updates to show the number of instances peta de datos de la distribución de WEKA). Al
(810) and attributes (20); close it. hacer clic en Open, las actualizaciones pequeña
ventana para mostrar el número de casos (810) y
atributos (20), ciérrelo.
6
Click Start. The behaviour of UserClassifier dif- Haga clic en Start. El comportamiento de User-
fers from all other classifiers. A special window ap- Classifier se diferencia de todos los otros clasifi-
pears and WEKA waits for you to use it to build cadores. Una ventana especial aparece y WEKA
your own classifier. The tabs at the top of the espera a que se utilizar para construir su propio
window switch between two views of the classifier. clasificador. Las pestañas en la parte superior del
The Tree visualizer view shows the current state interruptor de la ventana entre dos puntos de vista
of your tree, and the nodes give the number of class del clasificador. El punto de vista Tree visual-
values there. The aim is to come up with a tree izer muestra el estado actual de su árbol, y los
where the leaf nodes are as pure as possible. To nodos dar el número de valores de clase allı́. El
begin with, the tree has just one node—the root objetivo es llegar a un árbol donde los nodos hoja
node—containing all the data. More nodes will son tan puros como sea posible. Para empezar, el
appear when you proceed to split the data in the árbol tiene un solo nodo—el nodo raı́z—que con-
Data visualizer view. tiene todos los datos. Más nodos aparecerá cuando
se procede a dividir los datos en la vista de Data
visualizer.
Click the Data visualizer tab to see a 2D plot in Haga clic en la ficha Data visualizer para ver un
which the data points are colour coded by class. gráfico 2D en el que los puntos de datos están cod-
You can change the attributes used for the axes ificados por colores según la clase. Puede cambiar
either with the X and Y drop-down menus at the los atributos utilizados para los ejes, ya sea con la
top, or by left-clicking (for X) or right-clicking (for X e Y menús desplegables en la parte superior, o
Y) the horizontal strips to the right of the plot presionando el botón izquierdo (para X) o el botón
area. These strips show the spread of instances derecho del ratón (para Y) las tiras horizontales a
along each particular attribute. la derecha del área de trazado . Estas tiras mues-
tran la propagación de casos a lo largo de cada
atributo en particular.
You need to try different combinations of X and Tendrá que probar diferentes combinaciones de
Y axes to get the clearest separation you can find ejes X e Y para obtener la más clara la separación
between the colours. Having found a good separa- que se encuentran entre los colores. Cuando exista
tion, you then need to select a region in the plot: una buena separación, a continuación, deberá se-
this will create a branch in your tree. Here is a hint leccionar una región en la trama: esto creará una
to get you started: plot region-centroid-row on rama en el árbol. Aquı́ está una sugerencia para
the X-axis and intensity-mean on the Y-axis. comenzar: parcela region-centroid-row en el eje
You will see that the red class (’sky’) is nicely sep- X y intensity-media en el eje. Usted verá que la
arated from the rest of the classes at the top of the clase de color rojo (‘cielo’) está muy bien separado
plot. del resto de las clases en la parte superior de la
parcela.
There are three tools for selecting regions in the Existen tres herramientas para la selección de las
graph, chosen using the drop-down menu below regiones en el gráfico, elegidos mediante el menú
the Y-axis selector: desplegable debajo del selector de eje:
1. Rectangle allows you to select points by 1. Rectangle le permite seleccionar los puntos
dragging a rectangle around them. arrastrando un rectángulo alrededor de ellos.
2. Polygon allows you to select points by draw- 2. Polygon le permite seleccionar los puntos
ing a free-form polygon. Left-click to add dibujando un polı́gono de forma libre. Haga
vertices; right-click to complete the polygon. clic izquierdo para añadir vértices, haga clic
The polygon will be closed off by connecting para completar el polı́gono. El polı́gono se
the first and last points. cierran mediante la conexión de los puntos
primero y el último.
7
3. Polyline allows you to select points by draw- 3. Polyline le permite seleccionar los puntos
ing a free-form polyline. Left-click to add dibujando una polilı́nea de forma libre. Haga
vertices; right-click to complete the shape. clic izquierdo para añadir vértices, haga clic
The resulting shape is open, as opposed to para completar la forma. La forma re-
the polygon which is closed. sultante es abierto, en comparación con el
polı́gono que está cerrado.
When you have selected an area using any of these Cuando haya seleccionado un área usando
tools, it turns gray. Clicking the Clear button cualquiera de estas herramientas, que se vuelve
cancels the selection without affecting the classi- gris. Al hacer clic en el botón Clear cancela la
fier. When you are happy with the selection, click selección sin afectar el clasificador. Cuando usted
Submit. This creates two new nodes in the tree, está satisfecho con la selección, haga clic en Sub-
one holding all the instances covered by the selec- mit. Esto crea dos nuevos nodos en el árbol,
tion and the other holding all remaining instances. una celebración de todos los casos cubiertos por
These nodes correspond to a binary split that per- la selección y el otro posee la totalidad de los ca-
forms the chosen geometric test. sos restantes. Estos nodos se corresponden a una
división binaria que realiza la prueba geométrica
elegida.
Switch back to the Tree visualizer view to ex- Cambie de nuevo a la vista de Tree visualizer
amine the change in the tree. Clicking on different para examinar el cambio en el árbol. Al hacer clic
nodes alters the subset of data that is shown in the en los nodos diferentes altera el subconjunto de los
Data visualizer section. Continue adding nodes datos que se muestra en la sección de Data visu-
until you obtain a good separation of the classes— alizer. Continúe añadiendo nodos hasta obtener
that is, the leaf nodes in the tree are mostly pure. una buena separación de las clases—es decir, los
Remember, however, that you do not want to over- nodos hoja en el árbol son en su mayorı́a puro.
fit the data, because your tree will be evaluated on Sin embargo, recuerde que usted no desea sobrea-
a separate test set. juste de los datos, ya que el árbol será evaluado en
un conjunto de prueba independiente.
When you are satisfied with the tree, right-click Cuando esté satisfecho con el árbol, haga clic en
any blank space in the Tree visualizer view and cualquier espacio en blanco en la vista Tree visu-
choose Accept The Tree. WEKA evaluates your alizer y elija Accept The Tree. WEKA evalúa el
tree against the test set and outputs statistics that árbol contra el equipo de prueba y las estadı́sticas
show how well you did. de resultados que muestran lo bien que hizo.
You are competing for the best accuracy score Usted está compitiendo por la mejor puntuación
of a hand-built UserClassifier produced on the de exactitud de una mano-construido UserClas-
‘segment-challenge’ dataset and tested on the sifier conjunto de datos producidos en el ‘segment-
‘segment-test’ set. Try as many times as you like. challenge’ y de prueba en el set del ‘segment-test’.
A good score is anything close to 90% correct or Trate tantas veces como quieras. Un buen resul-
better. Run J48 on the data to see how well an au- tado es algo cercano a 90% de aciertos o mejor.
tomatic decision tree learner performs on the task. Ejecutar J48 en los datos para ver qué tan bien
un estudiante de árbol de decisión automática re-
aliza la tarea.
Ex. 12: When you think you have a good score, Ex. 12: Cuando usted piensa que tiene un buen
right-click the corresponding entry in the puntaje, haga clic en la entrada correspondi-
Result list, save the output using Save re- ente en la Result list, guardar el resultado
sult buffer, and copy it into your answer for con Save result buffer, y copiarlo en su
this tutorial. respuesta para este tutorial.
8
Tutorial 3: Classification Boundaries
Eibe Frank and Ian H .Witten
May 5, 2011
2008-2012
1 Introduction
In this tutorial you will look at the classification En este tutorial se verá en los lı́mites de clasifi-
boundaries that are produced by different types cación que son producidas por diferentes tipos de
of models. To do this, we use WEKA’s Bound- modelos. Para ello, utilizamos BoundaryVisual-
aryVisualizer. This is not part of the WEKA Ex- izer de WEKA. Esto es no parte del Explorador de
plorer that we have been using so far. Start up the WEKA que hemos estado utilizando hasta ahora.
WEKA GUI Chooser as usual from the Windows Poner en marcha el GUI Chooser WEKA como
START menu (on Linux or the Mac, double-click de costumbre en el menú INICIO de Windows
weka.jar or weka.app). From the Visualization (en Linux o Mac, haga doble clic en weka.jar o
menu at the top, select BoundaryVisualizer. weka.app). En el menú Visualization en la parte
superior, seleccione BoundaryVisualizer.
The boundary visualizer shows two-dimensional El visualizador muestra los lı́mites parcelas de dos
plots of the data, and is most appropriate for dimensiones de los datos, y es más adecuado para
datasets with two numeric attributes. We will use conjuntos de datos con dos atributos numéricos.
a version of the iris data without the first two Vamos a utilizar una versión de los datos del iris,
attributes. To create this from the standard iris sin los dos primeros atributos. Para crear esta par-
data, start up the Explorer, load iris.arff us- tir de los datos del iris estándar, la puesta en mar-
ing the Open file button and remove the first cha del Explorer, la carga iris.arff usando el
two attributes (‘sepallength’ and ‘sepalwidth’) by botón de Open file y quite los dos primeros atrib-
selecting them and clicking the Remove button utos (‘sepallength’ y ‘sepalwidth’), seleccionando y
that appears at the bottom. Then save the mod- haciendo clic en el botón que Remove aparece en
ified dataset to a file (using Save) called, say, la parte inferior. A continuación, guarde el con-
iris.2D.arff. junto de datos modificados en un archivo (usando
Save) llamado, por ejemplo, iris.2D.arff.
Now leave the Explorer and open this file for vi- Ahora deja el Explorer y abrir este archivo para la
sualization using the boundary visualizer’s Open visualización mediante el visualizador de Fronteras
File... button. Initially, the plot just shows the botón Open File.... Inicialmente, la trama sólo
data in the dataset.1 muestra los datos en el conjunto de datos.2
2 Visualizing 1R
Just plotting the data is nothing new. The real Sólo graficar los datos no es nada nuevo. El ver-
purpose of the boundary visualizer is to show the dadero propósito del visualizador lı́mite es mostrar
predictions of a given model for each location in la predicciones de un modelo determinado para
space. The points representing the data are color cada lugar en el espacio. Los puntos que represen-
coded based on the prediction the model generates. tan los datos están codificados por colores basados
We will use this functionality to investigate the de- en la predicción del modelo genera. Vamos a uti-
cision boundaries that different classifiers generate lizar esta funcionalidad para investigar los lı́mites
for the reduced iris dataset. de la decisión que los clasificadores diferentes para
generar el conjunto de datos del iris reducida.
1 There is a bug in the initial visualization. To get a true plot of the data, select a different attribute for either the x or
y axis by clicking the appropriate button.
2 No es un error en la visualización inicial. Para obtener una verdadera trama de los datos, seleccione un atributo
diferente, ya sea para los x o y eje haciendo clic en el botón correspondiente.
1
We start with the 1R rule learner. Use the Empezamos con el aprendiz regla 1R. Util-
Choose button of the boundary visualizer to se- ice el botón de Choose del visualizador lı́mite
lect weka.classifiers.rules.OneR . Make sure para seleccionar weka.classifiers.rules.OneR.
you tick Plot training data, otherwise only the Asegúrese de que usted marque Plot training
predictions will be plotted. Then hit the Start data, de lo contrario sólo las predicciones se
button. The program starts plotting predictions trazan. A continuación, pulse el botón Start.
in successive scan lines. Hit the Stop button once El programa comienza a las predicciones de con-
the plot has stabilized—as soon as you like, in this spirar en las sucesivas lı́neas de exploración. Pulse
case—and the training data will be superimposed el botón de Stop, una vez la trama se ha
on the boundary visualization. estabilizado—tan pronto como quiera, en este
caso—y los datos de entrenamiento se superpone
a la visualización de frontera.
Ex. 1: Explain the plot based on what you know Ex. 1: Explicar el argumento basado en lo que
about 1R. (Hint: use the Explorer to look at sabe sobre 1R. (Sugerencia: usar el Ex-
the rule set that 1R generates for this data.) plorer a mirar el conjunto de reglas que 1R
genera para estos datos.)
Ex. 2: Study the effect of the minBucketSize Ex. 2: Estudiar el efecto del parámetro min-
parameter on the classifier by regenerating BucketSize en el clasificador por la regen-
the plot with values of 1, and then 20, and eración de la parcela con valores de 1, y luego
then some critical values in between. De- 20 y, a continuación algunos valores crı́ticos
scribe what you see, and explain it. (Hint: en el medio. Describe lo que ves, y expli-
you could speed things up by using the Ex- carlo. (Sugerencia: puede acelerar las cosas
plorer to look at the rule sets.) mediante el Explorer a ver algunos de los
conjuntos de reglas.)
3 Visualizing nearest-neighbor learning
Now we look at the classification boundaries cre- Ahora nos fijamos en los lı́mites de clasifi-
ated by the nearest neighbor method. Use the cación creado por el método del vecino más cer-
boundary visualizer’s Choose... button to select cano. Utilice el botón de visualizador lı́mite de
the IBk classifier (weka.classifiers.lazy.IBk) Choose... para seleccionar el clasificador IBk
and plot its decision boundaries for the reduced (weka.classifiers.lazy.IBk) y la trama de sus
iris data. lı́mites de decisión para reducir los datos del iris.
2
In WEKA, OneR’s predictions are categorical: for En WEKA, las predicciones OneR son
each instance they predict one of the three classes. categóricos: para cada instancia que predi-
In contrast, IBk outputs probability estimates for cen una de las tres clases. Por el contrario, las
each class, and these are used to mix the colors salidas IBk estimaciones de probabilidad para
red, green, and blue that correspond to the three cada clase, y estas se utilizan para mezclar los
classes. IBk estimates class probabilities by count- colores rojo, verde y azul, que corresponden a las
ing the number of instances of each class in the set tres clases. IBk estimaciones de probabilidades de
of nearest neighbors of a test case and uses the clase contando el número de casos de cada clase
resulting relative frequencies as probability esti- en el conjunto de los vecinos más cercanos de un
mates. With k = 1, which is the default value, you caso de prueba y utiliza las frecuencias resultantes
might expect there to be only one instance in the relativa como las estimaciones de probabilidad.
set of nearest neighbors of each test case (i.e. pixel Con k = 1, que es el valor por defecto, es de
location). Looking at the plot, this is indeed al- esperar que haya una sola instancia en el conjunto
most always the case, because the estimated prob- de vecinos más cercanos de cada caso de prueba
ability is one for almost all pixels, resulting in a (es decir, lugar de pı́xeles). En cuanto a la trama,
pure color. There is no mixing of colors because esto es de hecho casi siempre el caso, ya que la
one class gets probability one and the others prob- probabilidad estimada es uno de casi todos los
ability zero. pı́xeles, dando como resultado un color puro. No
hay mezcla de colores, porque una clase recibe
una probabilidad y la probabilidad de los demás
cero.
Ex. 3: Nevertheless, there is a small area in the Ex. 3: Sin embargo, hay una pequeña área de la
plot where two colors are in fact mixed. Ex- parcela en la que dos colores son en realidad
plain this. (Hint: look carefully at the data mixta. Explique esto. (Sugerencia: mirar
using the Visualize panel in the Explorer.) cuidadosamente los datos mediante el panel
Visualizar en el Explorer.)
Ex. 4: Experiment with different values for k, say Ex. 4: Experimente con diferentes valores de k,
5 and 10. Describe what happens as k in- por ejemplo 5 y 10. Describir lo que sucede
creases. cuando aumenta k.
4 Visualizing naive Bayes
Turn now to the naive Bayes classifier. This as- Paso ahora a los ingenuos clasificador de Bayes.
sumes that attributes are conditionally indepen- Esto supone que los atributos son condicional-
dent given a particular class value. This means mente independientes dado un valor de clase es-
that the overall class probability is obtained by pecial. Esto significa que la probabilidad de clase
simply multiplying the per-attribute conditional global se obtiene simplemente multiplicando por
probabilities together. In other words, with two el atributo de probabilidades condicionales juntos.
attributes, if you know the class probabilities along En otras palabras, con dos atributos, no sé si las
the x-axis and along the y-axis, you can calculate probabilidades de clase a lo largo del eje X ya lo
the value for any point in space by multiplying largo del eje, se puede calcular el valor de cualquier
them together. This is easier to understand if you punto del espacio multiplicando juntos. Esto es
visualize it as a boundary plot. más fácil de entender si la visualizan como una
parcela de contorno.
3
Plot the predictions of naive Bayes. But first, you Parcela las predicciones de Bayes ingenuo. Pero
need to discretize the attribute values. By default, primero, tiene que discretizar los valores de atrib-
NaiveBayes assumes that the attributes are nor- uto. De forma predeterminada, NaiveBayes
mally distributed given the class (i.e., they follow asume que los atributos tienen una distribución
a bell-shaped distribution). You should override normal habida cuenta de la clase (es decir, que
this by setting useSupervisedDiscretization to siguen una distribución en forma de campana).
true using the GenericObjectEditor. This will Usted debe cambiar este ajuste de useSuper-
cause NaiveBayes to discretize the numeric at- visedDiscretization a true utilizando el Gener-
tributes in the data using a supervised discretiza- icObjectEditor. Esto hará que NaiveBayes
tion technique.3 para discretizar los atributos numéricos de los
datos mediante una técnica de discretización su-
pervisado.4
In almost all practical applications of Naive- En casi todas las aplicaciones prácticas de la
Bayes, supervised discretization works better NaiveBayes, discretización supervisado es más
than the default method, and that is why we con- eficaz que el método por defecto, y es por eso que
sider it here. It also produces a more comprehen- lo consideramos aquı́. También produce una visu-
sible visualization. alización más comprensible.
Ex. 5: The plot that is generated by visualiz- Ex. 5: La trama que se genera mediante la visu-
ing the predicted class probabilities of naive alización de las probabilidades de clase pre-
Bayes for each pixel location is quite different visto de Bayes ingenuo para cada posición de
from anything we have seen so far. Explain pı́xel es muy diferente de todo lo que hemos
the patterns in it. visto hasta ahora. Explicar los patrones en
ella.
5 Visualizing decision trees and rule sets
Decision trees and rule sets are similar to nearest- Los árboles de decisión y conjuntos de reglas son
neighbor learning in the sense that they are also similares a los del vecino más próximo de apren-
quasi-universal: in principle, they can approximate dizaje en el sentido de que son también casi uni-
any decision boundary arbitrarily closely. In this versal: en principio, se puede aproximar cualquier
section, we look at the boundaries generated by lı́mite de la decisión arbitraria de cerca. En esta
JRip and J48. sección, nos fijamos en los lı́mites generados por
JRip y J48.
Generate a plot for JRip, with default options. Generar una parcela de JRip, con las opciones pre-
determinadas.
Ex. 6: What do you see? Relate the plot to the Ex. 6: Qué ves? La trama a la salida de las nor-
output of the rules that you get by processing mas que se obtiene al procesar los datos en
the data in the Explorer. la Explorer.
Ex. 7: The JRip output assumes that the rules Ex. 7: La salida JRip asume que las normas se
will be executed in the correct sequence. ejecutará en el orden correcto. Escriba un
Write down an equivalent set of rules that conjunto equivalente de las normas que logra
achieves the same effect regardless of the or- el mismo efecto sin importar el orden en que
der in which they are executed. se ejecutan.
3 The technique used is “supervised” because it takes the class labels of the instances into account to find good split
points for the discretization intervals.
4 La técnica utilizada es “supervisada”, porque tiene las etiquetas de clase de las instancias en cuenta para encontrar
buenos puntos de partido para los intervalos de discretización.
4
Generate a plot for J48, with default options. Generar una parcela de J48, con las opciones pre-
determinadas.
Ex. 8: What do you see? Relate the plot to the Ex. 8: Qué ves? La trama a la salida del árbol
output of the tree that you get by processing que se obtiene al procesar los datos en la Ex-
the data in the Explorer. plorer.
One way to control how much pruning J48 per- Una forma de controlar la cantidad de poda J48
forms before it outputs its tree is to adjust the realiza antes de que los resultados de su árbol es
minimum number of instances required in a leaf, para ajustar el número mı́nimo de casos necesarios
minNumbObj. en una hoja, minNumbObj.
Ex. 9: Suppose you want to generate trees with Ex. 9: Supongamos que desea generar árboles
3, 2, and 1 leaf nodes respectively. What are con 3, 2 y 1 respectivamente nodos de la hoja.
the exact ranges of values for minNumObj Cuáles son los rangos de los valores exactos
that achieve this, given default values for all de minNumObj que lograr este objetivo,
other parameters? los valores por defecto para todos los otros
parámetros?
6 Messing with the data
With the BoundaryVisualizer you can modify Con el BoundaryVisualizer se pueden modificar
the data by adding or removing points. los datos, añadiendo o quitando puntos.
Ex. 10: Introduce some “noise” into the data and Ex. 10: Introducir algunos “ruidos” en los datos
study the effect on the learning algorithms y estudiar el efecto sobre los algoritmos de
we looked at above. What kind of behav- aprendizaje que vimos anteriormente. Qué
ior do you observe for each algorithm as you tipo de comportamiento no se observa para
introduce more noise? cada algoritmo como introducir más ruido?
7 1R revisited
Return to the 1R rule learner on the reduced iris Volver al alumno regla 1R en el iris reducido con-
dataset used in Section 2 (not the noisy version junto de datos utilizado en la Sección 2 (no la
you just created). The following questions will re- versión ruidosa que acaba de crear). Las sigu-
quire you to think about the internal workings of ientes preguntas le exigirá que pensar en el fun-
1R. (Hint: it will probably be fastest to use the Ex- cionamiento interno de 1R. (Sugerencia: es proba-
plorer to look at the rule sets.) ble que sea más rápido utilizar el Explorer a ver
algunos de los conjuntos de reglas.)
Ex. 11: You saw in Section 2 that the plot always Ex. 11: Se vio en la Sección 2 que la trama siem-
has three regions. But why aren’t there more pre tiene tres regiones. Pero por qué no hay
for small bucket sizes (e.g., 1)? Use what más para las dimensiones de cubo pequeño
you know about 1R to explain this apparent (por ejemplo, 1)? Usa lo que sabes sobre 1R
anomaly. para explicar esta aparente anomalı́a.
5
Ex. 12: Can you set minBucketSize to a value Ex. 12: Se puede configurar minBucketSize a
that results in less than three regions? What un valor que los resultados en menos de tres
is the smallest possible number of regions? regiones? Cuál es el menor número posible
What is the smallest value for minBucket- de regiones? Cuál es el valor más pequeño
Size that gives you this number of regions? de minBucketSize que le da este número
Explain the result based on what you know de regiones? Explicar el resultado sobre la
about the iris data. base de lo que sabe acerca de los datos del
iris.
6
Tutorial 4: Preprocessing and Parameter Tuning
May 5, 2011
2008-2012
1 Introduction
Data preprocessing is often necessary to get data Preprocesamiento de datos es a menudo necesario
ready for learning. It may also improve the out- para obtener los datos listos para el aprendizaje.
come of the learning process and lead to more ac- También puede mejorar el resultado del proceso de
curate and concise models. The same is true for aprendizaje y dar lugar a modelos más precisos y
parameter tuning methods. In this tutorial we concisos. Lo mismo es cierto para los métodos de
will look at some useful preprocessing techniques, ajuste de parámetros. En este tutorial vamos a
which are implemented as WEKA filters, as well ver algunas de las técnicas de preprocesamiento
as a method for automatic parameter tuning. útil, que se aplican como filtros de WEKA, ası́
como un método para el ajuste automático de los
parámetros.
2 Discretization
Numeric attributes can be converted into discrete los atributos numéricos se pueden convertir en los
ones by splitting their ranges into numeric inter- discretos mediante el fraccionamiento de sus áreas
vals, a process known as discretization. There are de distribución en intervalos numéricos, un pro-
two types of discretization techniques: unsuper- ceso conocido como discretización. Hay dos tipos
vised ones, which are “class blind.,” and supervised de técnicas de discretización: sin supervisión los
one, which take the class value of the instances into que son “de clase ciego,” y una supervisión, que
account when creating intervals. The aim with su- tienen el valor de clase de las instancias en cuenta
pervised techniques is to create intervals that are al crear intervalos. El objetivo con las técnicas de
as consistent as possible with respect to the class supervisión es la creación de intervalos que sean
labels. tan coherentes como sea posible con respecto a las
etiquetas de clase.
The main unsupervised technique for dis- El principal técnica unsupervisada para dis-
cretizing numeric attributes in WEKA is cretizar los atributos numéricos en WEKA
weka.filters.unsupervised.attribute. es weka.filters.unsupervised.attribute.
Discretize. It implements two straightforward Discretize. Se implementa dos métodos sencil-
methods: equal-width and equal-frequency dis- los: la igualdad de ancho y discretización de igual
cretization. The first simply splits the numeric frecuencia. El primero, simplemente se divide el
range into equal intervals. The second chooses rango numérico en intervalos iguales. El segundo
the width of the intervals so that they contain opta por la amplitud de los intervalos para que los
(approximately) the same number of instances. mismos contienen (aproximadamente) el mismo
The default is to use equal width. número de casos. El valor por defecto es usar la
misma anchura.
Find the glass dataset glass.arff and load it Encontrar el conjunto de datos de cristal
into the Explorer. Apply the unsupervised dis- glass.arff y cargarlo en la Explorer. Aplicar
cretization filter in the two different modes dis- el filtro de discretización sin supervisión en las dos
cussed above. modalidades anteriormente expuestas.
Ex. 1: What do you observe when you compare Ex. 1: Qué observa al comparar los histogramas
the histograms obtained? Why is the one for obtenidos? Por qué es la discretización de
equal-frequency discretization quite skewed la igualdad de frecuencia muy sesgada de al-
for some attributes? gunos atributos?
1
The main supervised technique for dis- El principal supervisado técnica para dis-
cretizing numeric attributes in WEKA cretizar los atributos numéricos en WEKA
is weka.filters.supervised.attribute. es weka.filters.supervised.attribute.
Discretize. Locate the iris data, load it in, Discretize. Busque los datos del iris, se
apply the supervised discretization scheme, and carga en, aplicar el esquema de discretización
look at the histograms obtained. Supervised supervisado, y ver los histogramas obtenidos. En-
discretization attempts to create intervals such cuadramiento intentos de discretización para crear
that the class distributions differ between intervals intervalos de tal manera que las distribuciones
but are consistent within intervals. difieren entre los intervalos de clase, pero son
coherentes dentro de los intervalos.
Ex. 2: Based on the histograms obtained, which Ex. 2: Con base en los histogramas obtenidos,
of the discretized attributes would you con- que de los atributos discretizados se tiene en
sider the most predictive ones? cuenta los más predictivo?
Reload the glass data and apply supervised dis- Actualizar los datos de vidrio y aplicar dis-
cretization to it. cretización supervisada a la misma.
Ex. 3: There is only a single bar in the histograms Ex. 3: Sólo hay una sola barra en los histogramas
for some of the attributes. What does that de algunos de los atributos. Qué significa
mean? eso?
Discretized attributes are normally coded as nomi- Atributos discretizado normalmente codificados
nal attributes, with one value per range. However, como atributos nominales, con un valor por rango.
because the ranges are ordered, a discretized at- Sin embargo, debido a los rangos están ordenados,
tribute is actually on an ordinal scale. Both filters un atributo discretizado es en realidad en una es-
also have the ability to create binary attributes cala ordinal. Ambos filtros también tienen la ca-
rather than multi-valued ones, by setting the op- pacidad de crear los atributos binarios en lugar de
tion makeBinary to true. los múltiples valores, mediante el establecimiento
de la makeBinary opción de verdad.
Ex. 4: Choose one of the filters and apply it Ex. 4: Elegir un de los filtros y aplicarlo para
to create binary attributes. Compare to crear atributos binarios. Compare con el
the output generated when makeBinary is resultado generado cuando makeBinary es
false. What do the binary attributes repre- falsa. Qué significan los atributos binarios
sent? representan?
3 More on Discretization
Here we examine the effect of discretization when Aquı́ se examina el efecto de la discretización en
building a J48 decision tree for the data in la construcción de un árbol de decisión J48 para
ionosphere.arff. This dataset contains informa- los datos de ionosphere.arff. Este conjunto de
tion about radar signals returned from the iono- datos contiene información acerca de las señales
sphere. “Good” samples are those showing evi- de radar de regresar de la ionosfera. “Bueno” son
dence of some type of structure in the ionosphere, las muestras que presenten indicios de algún tipo
while for “bad” ones the signals pass directly de estructura de la ionosfera, mientras que para los
through the ionosphere. For more details, take a “malos” las señales pasan directamente a través de
look the comments in the ARFF file. Begin with la ionosfera. Para obtener más información, visita
unsupervised discretization. los comentarios en el archivo ARFF. Comience con
discretización sin supervisión.
2
Ex. 5: Compare the cross-validated accuracy of Ex. 5: Comparación de la precisión validación
J48 and the size of the trees generated for cruzada de J48 y el tamaño de los árboles
(a) the raw data, (b) data discretized by the generados por (a) los datos en bruto, (b)
unsupervised discretization method in de- los datos discretizados por el método de dis-
fault mode, (c) data discretized by the same cretización sin supervisión en el modo por de-
method with binary attributes. fecto, (c) los datos discretizados por el mismo
método con atributos binarios.
Now turn to supervised discretization. Here a sub- Ahora pasa a la discretización supervisado. Aquı́
tle issue arises. If we simply repeated the previous surge una cuestión sutil. Si nos limitamos a repe-
exercise using a supervised discretization method, tir el ejercicio anterior utilizando un método de
the result would be over-optimistic. In effect, since discretización supervisado, el resultado serı́a de-
cross-validation is used for evaluation, the data in masiado optimista. En efecto, ya que la validación
the test set has been taken into account when deter- cruzada se utiliza para la evaluación, los datos en
mining the discretization intervals. This does not el conjunto de pruebas se ha tenido en cuenta para
give a fair estimate of performance on fresh data. determinar los intervalos de discretización. Esto
no da una estimación razonable de rendimiento en
nuevos datos.
To evaluate supervised discretization in a fair fash- Para evaluar discretización supervisado de man-
ion, we use the FilteredClassifier from WEKA’s era justa, se utiliza el FilteredClassifier de meta
meta-classifiers. This builds the filter model from de WEKA-clasificadores. Esto se basa el modelo
the training data only, before evaluating it on the de filtro de los datos de entrenamiento solamente,
test data using the discretization intervals com- antes de evaluar que en los datos de prueba medi-
puted for the training data. After all, that is how ante los intervalos de discretización calculados para
you would have to process fresh data in practice. los datos de entrenamiento. Después de todo, que
es como se tendrı́a que procesar los datos frescos
en la práctica.
Ex. 6: Compare the cross-validated accuracy and Ex. 6: Comparación de la precisión validación
the size of the trees generated using the Fil- cruzada y el tamaño de los árboles genera-
teredClassifier and J48 for (d) supervised dos con el FilteredClassifier y J48 para (d)
discretization in default mode, (e) supervised discretización supervisado en su modo nor-
discretization with binary attributes. mal, (e) discretización de supervisión de los
atributos binarios.
Ex. 7: Compare these with the results for the raw Ex. 7: Compare estos datos con los resultados de
data ((a) above). Can you think of a rea- los datos en bruto ((a) anterior). Puedes
son of why decision trees generated from dis- pensar en una razón de por qué los árboles de
cretized data can potentially be more accu- decisión generados a partir de datos discretos
rate predictors than those built from raw nu- pueden ser potencialmente predictores más
meric data? fiables que las construye a partir de datos
numéricos en bruto?
3
4 Automatic Attribute Selection
In most practical applications of supervised learn- En la mayorı́a de las aplicaciones prácticas de

ing not all attributes are equally useful for predict- aprendizaje supervisado, no todos los atributos son
ing the target. Depending on the learning scheme igualmente útiles para predecir el destino. De-
employed, redundant and/or irrelevant attributes pendiendo de la actividad de aprendizaje emplea-
can result in less accurate models being generated. dos, redundantes y/o atributos irrelevantes pueden
The task of manually identifying useful attributes dar lugar a modelos menos precisos generando.
in a dataset can be tedious, as you have seen in the La tarea de identificar manualmente los atributos
second tutorial—but there are automatic attribute útiles en un conjunto de datos puede ser tedioso, ya
selection methods that can be applied. que hemos visto en el segundo tutorial—pero hay
métodos automáticos de selección de atributos que
se pueden aplicar.
They can be broadly divided into those that rank Pueden dividirse en aquellos que se clasifican los
individual attributes (e.g., based on their informa- atributos individuales (por ejemplo, sobre la base
tion gain) and those that search for a good subset de su ganancia de información) y los de búsqueda
of attributes by considering the combined effect que para un subconjunto de los atributos de buena
of the attributes in the subset. The latter meth- considerando el efecto combinado de los atributos
ods can be further divided into so-called filter and en el subconjunto. Estos métodos se pueden di-
wrapper methods. Filter methods apply a compu- vidir en los llamados filtro y contenedor métodos.
tationally efficient heuristic to measure the quality métodos de aplicar un filtro eficiente computa-
of a subset of attributes. Wrapper methods mea- cionalmente heurı́stica para medir la calidad de un
sure the quality of an attribute subset by building subconjunto de los atributos. métodos Wrapper
and evaluating an actual classification model from medir la calidad de un subconjunto de atributos
it, usually based on cross-validation. This is more mediante la construcción y evaluación de un mod-
expensive, but often delivers superior performance. elo de clasificación real de ella, generalmente se
basa en la validación cruzada. Esto es más caro,
pero a menudo ofrece un rendimiento superior.
In the WEKA Explorer, you can use the Se- En el Explorer WEKA, puede utilizar el panel
lect attributes panel to apply an attribute se- de Select attributes de aplicar un método de
lection method on a dataset. The default is Cf- selección de atributos en un conjunto de datos.
sSubsetEval. However, if we want to rank in- El valor predeterminado es CfsSubsetEval. Sin
dividual attributes, we need to use an attribute embargo, si queremos clasificar los atributos in-
evaluator rather than a subset evaluator, e.g., the dividuales, tenemos que recurrir a un evaluador
InfoGainAttributeEval. Attribute evaluators de atributos en vez de un subgrupo evaluador,
need to be applied with a special “search” method, por ejemplo, la InfoGainAttributeEval. evalu-
namely the Ranker. adores de atributos deben ser aplicados con un es-
pecial de “búsqueda” método, a saber, la Ranker.
Ex. 8: Apply this technique to the labour nego- Ex. 8: Aplicar esta técnica para las negociaciones
tiations data in labor.arff. What are the laborales de datos en labor.arff. Cuáles
four most important attributes based on in- son los cuatro atributos más importantes
formation gain?1 basadas en el aumento de la información?2
1 Note that most attribute evaluators, including InfoGainAttributeEval, discretize numeric attributes using WEKA’s
supervised discretization method before they are evaluated. This is also the case for CfsSubsetEval.
2 Nota que la mayorı́a de los evaluadores de atributos, incluyendo InfoGainAttributeEval, discretizar los atributos
numéricos mediante el método de discretización supervisado WEKA antes de que se evalúan. Este es también el caso de
CfsSubsetEval.
4
WEKA’s default attribute selection method, Cfs- WEKA atributo por defecto el método de se-
SubsetEval, uses a heuristic attribute subset eval- lección, CfsSubsetEval, utiliza un subconjunto
uator in a filter search method. It aims to iden- de atributos evaluador heurı́stica en un método de
tify a subset of attributes that are highly corre- filtro de búsqueda. Su objetivo es identificar un
lated with the target while not being strongly cor- subconjunto de los atributos que están muy cor-
related with each other. By default, it searches relacionados con el objetivo sin ser fuertemente
through the space of possible attribute subsets correlacionados entre sı́. De forma predetermi-
for the “best” one using the BestFirst search nada, se busca a través del espacio de subcon-
method.3 You can choose others, like a genetic juntos de atributos posibles para el “mejor” con
algorithm or even an exhaustive search. In fact, el método de búsqueda BestFirst.4 Usted puede
choosing GreedyStepwise and setting search- elegir otros, como un algoritmo genético o incluso
Backwards to true gives “backwards selection,” una exhaustiva búsqueda. De hecho, la elección
the search method you used manually in the sec- de GreedyStepwise searchBackwards y el es-
ond tutorial. tablecimiento de verdad da “al revés de selección,”
el método de búsqueda que usa manualmente en el
segundo tutorial.
To use the wrapper method rather than a filter Para utilizar el método de envoltura en vez de un
method like CfsSubsetEval, you need to select método de filtro como CfsSubsetEval, es nece-
WrapperSubsetEval. You can configure this by sario seleccionar WrapperSubsetEval. Puede
choosing a learning algorithm to apply. You can configurar esta eligiendo un algoritmo de apren-
also set the number of folds for the cross-validation dizaje de aplicar. También puede establecer el
that is used to evaluate the model on each subset número de pliegues para la validación cruzada que
of attributes. se utiliza para evaluar el modelo en cada subcon-
junto de atributos.
Ex. 9: On the same data, run CfsSubsetEval Ex. 9: En los mismos datos, CfsSubsetEval cor-
for correlation-based selection, using Best- rer para la selección basada en la correlación,
First search. Then run the wrapper method mediante la búsqueda de BestFirst. A con-
with J48 as the base learner, again using tinuación, ejecute el método de envoltura con
BestFirst search. Examine the attribute J48 como el aprendiz de base, utilizando de
subsets that are output. Which attributes nuevo la búsqueda BestFirst. Examinar
are selected by both methods? How do they los subconjuntos de atributos que se emiten.
relate to the output generated by ranking us- Qué atributos son seleccionados por ambos
ing information gain? métodos? Cómo se relacionan con el resul-
tado generado por el aumento de clasificación
de información utiliza?
5 More on Automatic Attribute Selection
The Select attribute panel allows us to gain in- El panel de Select attribute nos permite profun-
sight into a dataset by applying attribute selection dizar en un conjunto de datos mediante la apli-
methods to a dataset. However, using this infor- cación de métodos de selección de atributos de un
mation to reduce a dataset becomes problematic conjunto de datos. Sin embargo, utilizar esta in-
if we use some of the reduced data for testing the formación para reducir un conjunto de datos se
model (as in cross-validation). convierte en un problema si utilizamos algunos de
los datos reducidos para probar el modelo (como
en la validación cruzada).
3 This is a standard search method from AI.

4 Este es un método de búsqueda estándar de la influenza aviar.
5
The reason is that, as with supervised discretiza- La razón es que, al igual que con discretización su-
tion, we have actually looked at the class labels in pervisado, que se han mirado en las etiquetas de
the test data while selecting attributes—the “best” clase en los datos de prueba, mientras que la se-
attributes were chosen by peeking at the test data. lección de los atributos—la “mejor” los atributos
As we already know (see Tutorial 2), using the test fueron elegidos por espiar a los datos de prueba.
data to influence the construction of a model bi- Como ya sabemos (ver Tutorial 2), utilizando los
ases the accuracy estimates obtained: measured datos de prueba para influir en la construcción de
accuracy is likely to be greater than what will be un modelo de los sesgos de la exactitud estima-
obtained when the model is deployed on fresh data. ciones obtenidas: La precisión de medida es prob-
To avoid this, we can manually divide the data into able que sea mayor de lo que se obtiene cuando el
training and test sets and apply the attribute se- modelo se implementa en nuevos datos. Para evi-
lection panel to the training set only. tar esto, se puede dividir manualmente los datos en
conjuntos de entrenamiento y de prueba y aplicar
el comité de selección de atributos al conjunto de
entrenamiento solamente.
A more convenient method is to use the Un método más conveniente es utilizar el

AttributeSelectedClassifer, one of WEKA’s AttributeSelectedClassifer, uno de los meta-
meta-classifiers. This allows us to specify an at- clasificadores de WEKA. Esto nos permite especi-
tribute selection method and a learning algorithm ficar un método de selección de atributos y un al-
as part of a classification scheme. The Attribute- goritmo de aprendizaje como parte de un esquema
SelectedClassifier ensures that the chosen set of de clasificación. El AttributeSelectedClassifier
attributes is selected based on the training data asegura que el conjunto seleccionado de atribu-
only, in order to give unbiased accuracy estimates. tos se selecciona basándose en los datos de entre-
namiento solamente, a fin de dar estimaciones ins-
esgadas precisión.
Now we test the various attribute selection meth- Ahora ponemos a prueba los métodos de se-
ods tested above in conjunction with NaiveBayes. lección de atributos diferentes probado anterior-
Naive Bayes assumes (conditional) independence mente en relación con NaiveBayes. Bayesiano
of attributes, so it can be affected if attributes asume (condicional) la independencia de los atrib-
are redundant, and attribute selection can be very utos, por lo que puede verse afectado si los atrib-
helpful. utos son redundantes, y la selección de atributos
puede ser muy útil.
You can see the effect of redundant Usted puede ver el efecto de los atributos redun-
attributes on naive Bayes by adding dantes en Bayes ingenuo mediante la adición de
copies of an existing attribute to a copias de un atributo existente a un conjunto de
dataset using the unsupervised filter class datos utilizando la clase de filtro sin supervisión
weka.filters.unsupervised.attribute.Copy weka.filters.unsupervised.attribute.Copy
in the Preprocess panel. Each copy is obviously en el panel de Preprocess. Cada copia es,
perfectly correlated with the original. obviamente, una correlación perfecta con el
original.
Ex. 10: Load the diabetes classification data in Ex. 10: carga los datos de clasificación de la dia-
diabetes.arff and start adding copies of betes diabetes.arff y comenzar a agregar
the first attribute in the data, measuring the copias de la primera cualidad de los datos,
performance of naive Bayes (with useSu- medir el rendimiento de Bayes naive (con
pervisedDiscretization turned on) using useSupervisedDiscretization encendido)
cross-validation after you have added each con validación cruzada después de haber
copy. What do you observe? agregado cada copia. Qué observa?
Let us now check whether the three attribute se- Vamos ahora a comprobar si los tres métodos
lection methods from above, used in conjunction de selección de atributos de arriba, se uti-
with AttributeSelectedClassifier and Naive- liza junto con AttributeSelectedClassifier y
Bayes, successfully eliminate the redundant at- NaiveBayes, con éxito eliminar los atributos re-
tributes. The methods are: dundantes. Los métodos son:
6
• InfoGainAttributeEval with Ranker (8 • InfoGainAttributeEval con Ranker (8
attributes) atributos)
• CfsSubsetEval with BestFirst • CfsSubsetEval con BestFirst
• WrapperSubsetEval with NaiveBayes • WrapperSubsetEval con NaiveBayes y

and BestFirst. BestFirst.
Run each method from within AttributeSelect- Ejecutar cada método dentro de AttributeSe-
edClassifier to see the effect on cross-validated lectedClassifier para ver el efecto en la cruz-
accuracy and check the attribute subset selected validado la exactitud y verificar el subconjunto de
by each method. Note that you need to specify the atributos seleccionados por cada método. Tenga
number of ranked attributes to use for the Ranker en cuenta que es necesario especificar el número
method. Set this to eight, because the original dia- de atributos clasificó a utilizar para el método de
betes data contains eight attributes (excluding the Ranker. Ponga esto a ocho, porque los datos de
class). Note also that you should specify Naive- la diabetes original contiene ocho atributos (con
Bayes as the classifier to be used inside the wrap- exclusión de la clase). Tenga en cuenta también
per method, because this is the classifier that we que debe especificar NaiveBayes como el clasifi-
want to select a subset for. cador para ser utilizado en el método de envoltura,
porque este es el clasificador que desea seleccionar
un subconjunto de.
Ex. 11: What can you say regarding the perfor- Ex. 11: Qué puede decir respecto al rendimiento
mance of the three attribute selection meth- de los tres métodos de selección de atribu-
ods? Do they succeed in eliminating redun- tos? No tienen éxito en la eliminación de las
dant copies? If not, why not? copias redundantes? Si no, por qué no?
6 Automatic parameter tuning
Many learning algorithms have parameters that Muchos algoritmos de aprendizaje tienen
can affect the outcome of learning. For example, parámetros que pueden afectar los resultados
the decision tree learner C4.5 (J48 in WEKA) has del aprendizaje. Por ejemplo, el árbol de de-
two parameters that influence the amount of prun- cisión C4.5 alumno (J48 en WEKA) tiene dos
ing that it does (we saw one, the minimum number parámetros que influyen en la cantidad de la
of instances required in a leaf, in the last tutorial). poda que hace (hemos visto a uno, el número
The k-nearest-neighbor classifier IBk has one that mı́nimo de casos necesarios en una hoja, en el
sets the neighborhood size. But manually tweaking último tutorial). El k -clasificador del vecino más
parameter settings is tedious, just like manually próximo IBk tiene uno que establece el tamaño de
selecting attributes, and presents the same prob- la vecindad. Pero manualmente modificando los
lem: the test data must not be used when selecting ajustes de parámetros es tedioso, al igual que los
parameters—otherwise the performance estimates atributos seleccionar manualmente, y presenta el
will be biased. mismo problema: los datos de prueba no debe ser
utilizado cuando los parámetros de selección—lo
contrario las estimaciones de rendimiento se hará
con preferencia.
7
WEKA has a “meta” classifier, WEKA tiene una “meta” clasificador,
CVParameterSelection, that automatically CVParameterSelection, que busca au-
searches for the “best” parameter settings by tomáticamente los “mejores” valores de los
optimizing cross-validated accuracy on the train- parámetros mediante la optimización de cruz-
ing data. By default, each setting is evaluated validado la exactitud de los datos de entre-
using 10-fold cross-validation. The parameters to namiento. De forma predeterminada, cada
optimize re specified using the CVParameters ajuste se evaluó utilizando 10 veces la validación
field in the GenericObjectEditor. For each cruzada. Los parámetros para volver a optimizar
one, we need to give (a) a string that names it el uso especificado en el campo CVParameters
using its letter code, (b) a numeric range of values GenericObjectEditor. Para cada uno de ellos,
to evaluate, and (c) the number of steps to try in tenemos que dar (a) una cadena que le asigna el
this range (Note that the parameter is assumed nombre utilizando su código de letras, (b) una
to be numeric.) Click on the More button in serie de valores numéricos para evaluar, y (c) el
the GenericObjectEditor for more information, número de medidas para tratar en este rango de
and an example. (Tenga en cuenta que el parámetro se supone que
es numérico.) Haga clic en el botón de More
en la GenericObjectEditor para obtener más
información, y un ejemplo.
For the diabetes data used in the previous section, Para los datos de la diabetes utilizados en la
use CVParameterSelection in conjunction with sección anterior, el uso CVParameterSelection
IBk to select the “best” value for the neighbor- IBk en conjunto con el fin de seleccionar la
hood size, ranging from 1 to 10 in ten steps. The “mejor” valor para el tamaño de la vecindad, que
letter code for the neighborhood size is K. The van desde 1 a 10 en diez pasos. El código de le-
cross-validated accuracy of the parameter-tuned tras para el tamaño de esta zona: K. La precisión
version of IBk is directly comparable with its accu- de validación cruzada de la versión parámetro afi-
racy using default settings, because tuning is per- nado de IBk es directamente comparable con la
formed by applying inner cross-validation runs to precisión con la configuración predeterminada, ya
find the best parameter setting for each training que ajuste se realiza mediante la aplicación de in-
set occuring in the outer cross-validation—and the terior validación cruzada se ejecuta para encontrar
latter yields the final performance estimate. el mejor ajuste de parámetros para cada conjunto
de entrenamiento se producen en el exterior vali-
dación cruzada—y los rendimientos de este último
la estimación final de ejecución.
Ex. 12: What accuracy is obtained in each case? Ex. 12: Qué precisión se obtiene en cada caso?
What value is selected for the parameter- Qué valor se selecciona para la versión
tuned version based on cross-validation on parámetro afinado sobre la base de la val-
the full training set? (Note: this value is idación cruzada en el conjunto de entre-
output in the Classifier output text area.) namiento completo? (Nota: este valor es la
producción en el área de texto Classifier de
salida.)
Now consider parameter tuning for J48. We can Ahora considere ajuste de parámetros de J48.
use CVParameterSelection to perform a grid Podemos utilizar CVParameterSelection para
search on both pruning parameters simultaneously realizar una búsqueda de la rejilla en ambos
by adding multiple parameter strings in the CV- parámetros al mismo tiempo de poda mediante
Parameters field. The letter code for the pruning la adición de varias cadenas de parámetros en el
confidence parameter is C, and you should evalu- campo CVParameters. El código de letras para
ate values from 0.1 to 0.5 in five steps. The letter el parámetro de la confianza de la poda es de C,
code for the minimum leaf size parameter is M , y usted debe evaluar los valores de 0,1 a 0,5 en
and you should evaluate values from 1 to 10 in ten cinco pasos. El código de letras para el parámetro
steps. de hoja de tamaño mı́nimo es de M , y se deben
evaluar los valores de 1 a 10 en diez pasos.
8
Ex. 13: Run CVParameterSelection to find Ex. 13: Ejecutar CVParameterSelection para
the best parameter values in the resulting encontrar los mejores valores de parámetros
grid. Compare the output you get to that en la red resultante. Comparar la salida se
obtained from J48 with default parameters. llega a la obtenida de J48 con los parámetros
Has accuracy changed? What about tree por defecto. Tiene una precisión cambiado?
size? What parameter values were selected Qué pasa con el tamaño del árbol? Qué val-
by CVParameterSelection for the model ores de los parámetros han sido seleccionados
built from the full training set? por CVParameterSelection para el mod-
elo construido a partir del conjunto de entre-
namiento completo?
9
Tutorial 5: Document Classification
May 5, 2011
2008-2012
1 Introduction
Text classification is a popular application of ma- Clasificación de texto es una aplicación popular de
chine learning. You may even have used it: email aprendizaje automático. Puede que incluso lo han
spam filters are classifiers that divide email mes- utilizado: los filtros de spam de correo electrónico
sages, which are just short documents, into two son los clasificadores que dividen a los mensajes
groups: junk and not junk. So-called “Bayesian” de correo electrónico, que son documentos poco
spam filters are trained on messages that have been menos, en dos grupos: basura y no deseado. Los
manually labeled, perhaps by putting them into llamados “Bayesiano” filtros de spam son entre-
appropriate folders (e.g. “ham” vs “spam”). nados en los mensajes que han sido etiquetados
de forma manual, tal vez por su puesta en car-
petas correspondientes (por ejemplo, “jamón” vs
“spam”).
In this tutorial we look at how to perform docu- En este tutorial vamos a ver cómo llevar a cabo la
ment classification using tools in WEKA. The raw clasificación de documentos usando herramientas
data is text, but most machine learning algorithms en WEKA. Los datos en bruto es de texto, pero la
expect examples that are described by a fixed set mayorı́a de algoritmos de aprendizaje automático
of attributes. Hence we first convert the text data esperar ejemplos que se describen mediante un con-
into a form suitable for learning. This is usually junto fijo de atributos. Por lo tanto, primero con-
done by creating a dictionary of terms from all vertir los datos de texto en una forma adecuada
the documents in the training corpus and making para el aprendizaje. Esto suele hacerse mediante
a numeric attribute for each term. Then, for a la creación de un diccionario de términos de todos
particular document, the value of each attribute is los documentos en el corpus de entrenamiento y ha-
based on the frequency of the corresponding term ciendo un atributo numérico de cada término. En-
in the document. There is also the class attribute, tonces, para un documento particular, el valor de
which gives the document’s label. cada atributo se basa en la frecuencia del término
correspondiente en el documento. también existe
el atributo de clase, lo que da la etiqueta del doc-
umento.
2 Data with string attributes
WEKA’s unsupervised attribute filter Atributo sin supervisión WEKA el filtro

StringToWordVector can be used to convert StringToWordVector se puede utilizar para
raw text into term-frequency-based attributes. convertir el texto en bruto en los atributos plazo
The filter assumes that the text of the documents basado en la frecuencia. El filtro se supone que
is stored in an attribute of type String, which is el texto de los documentos se almacena en un
a nominal attribute without a pre-specified set of atributo de tipo String, que es un atributo
values. In the filtered data, this string attribute is nominal sin un conjunto previamente especificado
replaced by a fixed set of numeric attributes, and de valores. En los datos filtrados, este atributo
the class attribute is put at the beginning, as the de cadena se sustituye por un conjunto fijo de
first attribute. atributos numéricos, y el atributo de la clase se
pone al principio, como el primer atributo.
To perform document classification, we first Para realizar la clasificación de documentos,

need to create an ARFF file with a string primero tenemos que crear un archivo de
attribute that holds the documents’ text— ARFF con un atributo de cadena que con-
declared in the header of the ARFF file using tiene texto de los documentos—declarado en
@attribute document string, where document el encabezado del archivo ARFF mediante
is the name of the attribute. We also need a nom- @attribute document string, donde document
inal attribute that holds the document’s classifica- es el nombre del atributo. también necesitamos
tion. un atributo nominal que contiene la clasificación
del documento.
1
Document text Classification
The price of crude oil has increased significantly yes
Demand of crude oil outstrips supply yes
Some people do not like the flavor of olive oil no
The food was very oily no
Crude oil is in short supply yes
Use a bit of cooking oil in the frying pan no
Table 1: Training “documents”.
Document text Classification

Oil platforms extract crude oil Unknown
Canola oil is supposed to be healthy Unknown
Iraq has significant oil reserves Unknown
There are different types of cooking oil Unknown
Table 2: Test “documents”.
Ex. 1: To get a feeling for how this works, Ex. 1: Para tener una idea de cómo funciona
make an ARFF file from the labeled mini- esto, hacer un archivo ARFF de la etiqueta
“documents” in Table 1 and run String- mini “documentos” en la Table 1 y ejecu-
ToWordVector with default options on tar StringToWordVector con las opciones
this data. How many attributes are gener- predeterminadas en estos datos. Cómo se
ated? Now change the value of the option generan muchos atributos? Ahora cambia
minTermFreq to 2. What attributes are el valor de la opción de minTermFreq 2.
generated now? Quéatributos se generan ahora?
Ex. 2: Build a J48 decision tree from the last ver- Ex. 2: Construir un árbol de decisión J48 de la
sion of the data you generated. Give the tree última versión de los datos que generan. Dar
in textual form. el árbol en forma textual.
Usually, the purpose of a classifier is to classify new Por lo general, el objetivo de un clasificador para
documents. Let’s classify the ones given in Table 2, clasificar los documentos nuevos. Vamos a clasi-
based on the decision tree generated from the doc- ficar a las dadas en la Table 2, basado en el árbol de
uments in Table 1. To apply the same filter to both decisión de los documentos generados en la Table 1.
training and test documents, we can use the Fil- Para aplicar el mismo filtro a los dos documen-
teredClassifier, specifying the StringToWord- tos de entrenamiento y prueba, podemos usar el
Vector filter and the base classifier that we want FilteredClassifier, especificando el filtro String-
to apply (i.e., J48). ToWordVector y el clasificador base que quere-
mos aplicar (es decir, J48).
Ex. 3: Create an ARFF file from Table 2, us- Ex. 3: Crear un archivo de ARFF de la Table 2,
ing question marks for the missing class la- con signos de interrogación para las etique-
bels. Configure the FilteredClassifier us- tas de clase perdido. Configurar el Fil-
ing default options for StringToWordVec- teredClassifier utilizando las opciones pre-
tor and J48, and specify your new ARFF determinadas para StringToWordVector
file as the test set. Make sure that you se- y J48, y especificar el archivo ARFF nuevo
lect Output predictions under More op- el equipo de prueba. Asegúrese de que selec-
tions... in the Classify panel. Look at the ciona Output predictions en More op-
model and the predictions it generates, and tions... Classify en el panel. Mira el mod-
verify that they are consistent. What are the elo y las predicciones que genera, y verificar
predictions (in the order in which the docu- que sean compatibles. Cuáles son las predic-
ments are listed in Table 2)? ciones (en el orden en que los documentos
son enumerados en la Table 2)?
2
3 Classifying actual short text documents
There is a standard collection of newswire No es una colección estándar de los artı́culos

articles that is widely used for evaluating doc- agencia de noticias que es ampliamente uti-
ument classifiers. ReutersCorn-train.arff lizado para la evaluación de los clasificadores
and ReutersGrain-train.arff are sets de documentos. ReutersCorn-train.arff
of training data derived from this col- y ReutersGrain-train.arff son conjun-
lection; ReutersCorn-test.arff and tos de datos de aprendizaje derivados de
ReutersGrain-test.arff are corresponding esta colección; ReutersCorn-test.arff y
test sets. The actual documents in the corn and ReutersGrain-test.arff son correspondientes
grain data are the same; just the labels differ. unidades de prueba. Los documentos reales en
In the first dataset, articles that talk about los datos de maı́z y el grano son las mismas,
corn-related issues have a class value of 1 and the sólo las etiquetas son diferentes. En el primer
others have 0; the aim is to build a classifier that conjunto de datos, artı́culos que hablan de temas
can be used to identify articles that talk about relacionados con el maı́z tiene un valor de la clase
corn. In the second, the analogous labeling is de 1 y el resto a 0, el objetivo es construir un
performed with respect to grain-related issues, clasificador que se puede utilizar para identificar
and the aim is to identify these articles in the test los artı́culos que hablan de maı́z. En el segundo,
set. el etiquetado similar se realiza con respecto a
cuestiones relacionadas con granos, y el objetivo es
identificar estos artı́culos en el equipo de prueba.
Ex. 4: Build document classifiers for the two Ex. 4: Construir clasificadores de documentos
training sets by applying the FilteredClas- para los dos conjuntos de formación medi-
sifier with StringToWordVector using (a) ante la aplicación de la FilteredClassifier
J48 and (b) NaiveBayesMultinomial, in StringToWordVector con el uso (a) J48 y
each case evaluating them on the correspond- (b) NaiveBayesMultinomial, en cada caso
ing test set. What percentage of correct clas- a la evaluación en el sistema de la prueba
sifications is obtained in the four scenarios? correspondiente. Qué porcentaje de clasifi-
Based on your results, which classifier would caciones correctas se obtiene en los cuatro
you choose? escenarios? Con base en sus resultados, que
clasificador elegirı́as?
The percentage of correct classifications is not the El porcentaje de clasificaciones correctas no es la

only evaluation metric used for document classi- métrica de evaluación utilizado para la clasificación
fication. WEKA includes several other per-class de documentos. WEKA incluye varias otras es-
evaluation statistics that are often used to eval- tadı́sticas de evaluación por cada clase que se uti-
uate information retrieval systems like search en- lizan con frecuencia para evaluar los sistemas de
gines. These are tabulated under Detailed Ac- recuperación de información como los motores de
curacy By Class in the Classifier output text búsqueda. Estos son tabulados en Detailed Ac-
area. They are based on the number of true posi- curacy By Class en el área de texto Classifier
tives (TP), number of false positives (FP), number output. Se basan en el número de verdaderos pos-
of true negatives (TN), and number of false neg- itivos (VP), el número de falsos positivos (FP), el
atives (FN) in the test data. A true positive is número de verdaderos negativos (VN), y el número
a test instance that is classified correctly as be- de falsos negativos (FN) en los datos de prueba. A
longing to the target class concerned, while a false positivos true es un ejemplo de prueba que está
positive is a (negative) instance that is incorrectly clasificado correctamente como pertenecientes a la
assigned to the target class. FN and TN are de- clase de destino en cuestión, mientras que un fal-
fined analogously. The statistics output by WEKA sos positivos es un ejemplo (negativo) que está mal
are computed as follows: asignado a la clase de destino. FN y TN se define
de manera similar. La salida de las estadı́sticas por
WEKA se calculan de la siguiente manera:
• TP Rate: TP / (TP + FN) • TP Precio: TP / (TP + FN)
3
• FP Rate: FP / (FP + TN) • FP Precio: FP / (FP + TN)
• Precision: TP / (TP + FP) • Precisión: TP / (TP + FP)
• Recall: TP / (TP + FN) • Recuperación: TP / (TP + FN)
• F-Measure: the harmonic mean of precision • F-Medida: la media armónica de precisión y

and recall recuperación
(2/F = 1/precision + 1/recall). (2/F = 1/precisión +1/recuperación).
Ex. 5: Based on the formulas, what are the best Ex. 5: Con base en las fórmulas, Cuáles son los
possible values for each of the statistics in mejores valores posibles para cada una de las
this list? Describe in English when these val- estadı́sticas en esta lista? Describa en Inglés
ues are attained. cuando estos valores se alcanzan.
The Classifier Output table also gives the ROC En la tabla Classifier Output también da la
area, which differs from the other statistics be- ROC area, que difiere de las estadı́sticas de otros
cause it is based on ranking the examples in the porque se basa en el ranking de los ejemplos de
test data according to how likely they are to be- los datos de prueba de acuerdo a la probabilidad
long to the positive class. The likelihood is given que existe de pertenecer a la positivo clase. La
by the class probability that the classifier predicts. posibilidad está dada por la probabilidad de clase
(Most classifiers in WEKA can produce probabili- que el clasificador predice. (La mayorı́a de los
ties in addition to actual classifications.) The ROC clasificadores en WEKA pueden producir proba-
area (which is also known as AUC) is the probabilidades, además de las clasificaciones actuales.)
bility that a randomly chosen positive instance in La zona de la República de China (que también
the test data is ranked above a randomly chosen se conoce como AUC) es la probabilidad de que
negative instance, based on the ranking produced un ejemplo elegido al azar positivo en los datos
by the classifier. de prueba se clasifica por encima de un ejemplo
elegido al azar negativas, sobre la base de la clasi-
ficación producido por el clasificador.
The best outcome is that all positive examples are El mejor resultado es que todos los ejemplos pos-
ranked above all negative examples. In that case itivos se sitúa por encima de todos los ejemplos
the AUC is one. In the worst case it is zero. In negativos. En ese caso las AUC es uno. En el peor
the case where the ranking is essentially random, de los casos es cero. En el caso de que la clasi-
the AUC is 0.5. Hence we want an AUC that is at ficación es esencialmente al azar, las AUC es de
least 0.5, otherwise our classifier has not learned 0,5. Por lo tanto queremos una AUC, que es al
anything from the training data. menos 0,5, de lo contrario nuestro clasificador no
ha aprendido nada de los datos de entrenamiento.
Ex. 6: Which of the two classifiers used above Ex. 6: Cuál de los dos clasificadores utilizados
produces the best AUC for the two Reuters anterior produce los mejores AUC para los
datasets? Compare this to the outcome for dos conjuntos de datos de Reuters? Com-
percent correct. What do the different out- pare esto con los resultados de porcentaje de
comes mean? respuestas correctas. Quésignifican los difer-
entes resultados?
4
Ex. 7: Interpret in your own words the difference Ex. 7: Interpretar en sus propias palabras la
between the confusion matrices for the two diferencia entre las matrices de confusión
classifiers. para los dos clasificadores.
There is a close relationship between ROC Area Existe una relación estrecha entre ROC Area y la
and the ratio TP Rate/FP Rate. Rather than relación de TP Rate/FP Rate. En lugar de sim-
just obtaining a single pair of values for the true plemente obtener un solo par de valores para las
and false positive rates, a whole range of value tasas de positivos verdaderos y falsos, toda una se-
pairs can be obtained by imposing different clas- rie de pares de valores se puede obtener mediante la
sification thresholds on the probabilities predicted imposición de diferentes umbrales de clasificación
by the classifier. de las probabilidades predichas por el clasificador.
By default, an instance is classified as “positive” De forma predeterminada, una instancia se clasi-

if the predicted probability for the positive class is fica como “positivo” si la probabilidad predicha
greater than 0.5; otherwise it is classified as neg- para la clase positivo es superior a 0,5, de lo con-
ative. (This is because an instance is more likely trario se clasifica como negativa. (Esto se debe
to be positive than negative if the predicted prob- a un caso es más probable que sea positivo que
ability for the positive class is greater than 0.5.) negativo si la probabilidad predicha para la clase
Suppose we change this threshold from 0.5 to some positivo es superior a 0.5.) Supongamos que el
other value between 0 and 1, and recompute the ra- cambio de este umbral de 0,5 a algún otro valor
tio TP Rate/FP Rate. Repeating this with dif- entre 0 y 1, y volver a calcular la proporción de
ferent thresholds produces what is called an ROC TP Rate/FP Rate. Repetir esto con diferentes
curve. You can show it in WEKA by right-clicking umbrales produce lo que se llama ROC curve.
on an entry in the result list and selecting Visu- Se puede mostrar en WEKA haciendo clic dere-
alize threshold curve. cho sobre una entrada en la lista de resultados y
la selección de Visualize threshold curve.
When you do this, you get a plot with FP Rate on Al hacer esto, se obtiene una parcela con FP Rate
the x axis and TP Rate on the y axis. Depending en el eje x y TP Rate en el y eje. En función del
on the classifier you use, this plot can be quite clasificador que usa, esta parcela puede ser muy
smooth, or it can be fairly discrete. The interesting suave, o puede ser bastante discretos. Lo intere-
thing is that if you connect the dots shown in the sante es que si se conecta los puntos de muestra en
plot by lines, and you compute the area under the el gráfico por las lı́neas, y calcular el área bajo la
resulting curve, you get the ROC Area discussed curva resultante, se obtiene el ROC Area discu-
above! That is where the acronym AUC for the tido arriba! Ahı́ es donde la AUC acrnimo de la
ROC Area comes from: “Area Under the Curve.” Área de la ROC viene de: “Área bajo la curva.”
Ex. 8: For the Reuters dataset that produced the Ex. 8: Para el conjunto de datos producidos a
most extreme difference in Exercise 6 above, Reuters que la diferencia más extrema en el
look at the ROC curves for class 1. Make a ejercicio 6 anterior, visita las curvas ROC
very rough estimate of the area under each para la clase 1. Hacer una estimación muy
curve, and explain it in words. aproximada del área debajo de cada curva, y
explicarlo con palabras.
Ex. 9: What does the ideal ROC curve corre- Ex. 9: Quéhace el ideal de la curva ROC corre-
sponding to perfect performance look like (a spondiente a buscar un rendimiento perfecto
rough sketch, or a description in words, is como (un boceto o una descripción verbal, es
sufficient)? suficiente)?
5
Using the threshold curve GUI, you can also plot Utilizando la curva de umbral de interfaz gráfica
other types of curves, e.g. a precision/recall curve, de usuario, también puede trazar otros tipos de
with Recall on the x axis and Precision on the curvas, por ejemplo, una precisión/recuperación
y axis. This plots precision against recall for each curva, con Recall en el eje x y Precision en el
probability threshold evaluated. y eje. Este gráfico de precisión contra el recuerdo
de cada umbral de probabilidad evaluada.
Ex. 10: Change the axes to obtain a preci- Ex. 10: Cambiar los ejes para obtener una pre-
sion/recall curve. What shape does the ideal cisión/recuperación curva. Quéforma tiene
precision/recall curve corresponding to per- la ideal precisión/recuperación curva que
fect performance have (again a rough sketch corresponde a un rendimiento perfecto que
or verbal description is sufficient)? (de nuevo un croquis o descripción verbal es
suficiente)?
4 Exploring the StringToWordVector filter
By default, the StringToWordVector filter sim- De forma predeterminada, el filtro de String-

ply makes the attribute value in the transformed ToWordVector, simplemente hace que el valor
dataset 1 or 0 for all raw single-word terms, de- del atributo en el conjunto de datos transforma-
pending on whether the word appears in the doc- dos 1 o 0 para todos los términos primas de una
ument or not. However, there are many options sola palabra, dependiendo de si la palabra aparece
that can be changed, e.g: en el documento o no. Sin embargo, hay muchas
opciones que se pueden cambiar, por ejemplo:
• outputWordCounts causes actual word • outputWordCounts causas palabra real

counts to be output. cuenta de la salida.
• IDFTransform and TFTransform: when • IDFTransform y TFTransform: cuando

both are set to true, term frequencies are ambos se ponen a true, las frecuencias plazo
transformed into so-called T F × IDF values se transforman en los llamados T F × F DI
that are popular for representing documents valores que son populares para la repre-
in information retrieval applications. sentación de documentos en aplicaciones de
recuperación de información.
• stemmer allows you to choose from different • stemmer le permite elegir entre diferentes
word stemming algorithms that attempt to palabras derivadas algoritmos que tratan de
reduce words to their stems. reducir las palabras a sus tallos.
• useStopList allows you determine whether • useStopList le permite determinar si se de-

or not stop words are deleted. Stop words are tiene se suprimirán las palabras. Las pal-
uninformative common words (e.g. a, the). abras vacı́as son poco informativos palabras
comunes (por ejemplo, a, la).
• tokenizer allows you to choose a differ- • tokenizer le permite elegir un analizador de

ent tokenizer for generating terms, e.g. one términos diferentes para generar, por ejem-
that produces word n-grams instead of single plo, que produce la palabra n-gramos en lu-
words. gar de palabras sueltas.
6
There are several other useful options. For more Hay varias opciones útiles. Para obtener más in-
information, click on More in the GenericOb- formación, haga clic en More en la GenericOb-
jectEditor. jectEditor.
Ex. 11: Experiment with the options that are Ex. 11: Experimento con las opciones que están
available. What options give you a good disponibles. Quéopciones le dan un buen
AUC value for the two datasets above, using valor de AUC para los dos conjuntos de datos
NaiveBayesMultinomial as the classifier? anterior, con NaiveBayesMultinomial en
(Note: an exhaustive search is not required.) el clasificador? (Nota: una búsqueda exhaus-
tiva no es necesario.)
Often, not all attributes (i.e., terms) are important A menudo, no todos los atributos (es decir,
when classifying documents, because many words términos) son importantes para la clasificación
may be irrelevant for determining the topic of an de documentos, ya que muchas palabras pueden
article. We can use WEKA’s AttributeSelect- ser irrelevantes para determinar el tema de un
edClassifier, using ranking with InfoGainAt- artı́culo. Podemos utilizar AttributeSelected-
tributeEval and the Ranker search, to try and Classifier WEKA, utilizando ranking con Info-
eliminate attributes that are not so useful. As GainAttributeEval Ranker y la búsqueda, para
before we need to use the FilteredClassifier to tratar de eliminar los atributos que no son tan
transform the data before it is passed to the At- útiles. Al igual que antes tenemos que utilizar
tributeSelectedClassifier. el FilteredClassifier para transformar los datos
antes de que se pasa al AttributeSelectedClas-
sifier.
Ex. 12: Experiment with this set-up, using de- Ex. 12: Experimento con esta puesta en marcha,
fault options for StringToWordVector utilizando las opciones predeterminadas para
and NaiveBayesMultinomial as the clas- StringToWordVector y NaiveBayes-
sifier. Vary the number of most-informative Multinomial en el clasificador. Variar el
attributes that are selected from the info- número de los atributos más informativo
gain-based ranking by changing the value que se seleccionan de la clasificación de
of the numToSelect field in the Ranker. información de ganancia basado en cambiar
Record the AUC values you obtain. What el valor del campo en el numToSelect
number of attributes gives you the best AUC Ranker. Registre los valores del AUC de
for the two datasets above? What AUC obtener. Quénúmero de atributos que ofrece
values are the best you manage to obtain? la mejor AUC para los dos conjuntos de
(Again, an exhaustive search is not required.) datos anteriores? Quévalores AUC son los
mejores que logran obtener? (De nuevo, una
búsqueda exhaustiva no es necesario.)
7
Tutorial 6: Mining Association Rules
May 5, 2011
2008-2012
1 Introduction
Association rule mining is one of the most promi- La minerı́a de reglas de asociación es una de las
nent data mining techniques. In this tutorial, we técnicas de minerı́a de datos más destacados. En
will work with Apriori—the association rule min- este tutorial, vamos a trabajar con Apriori—la
ing algorithm that started it all. As you will see, it regla de asociación algoritmo de minerı́a de datos
is not straightforward to extract useful information que lo empezó todo. Como se verá, no es fácil de
using association rule mining. extraer información útil con la minerı́a de reglas
de asociación.
2 Association rule mining in WEKA
In WEKA’s Explorer, techniques for association En Explorer WEKA, técnicas para la extracción
rule mining are accessed using the Associate de reglas de asociación se accede mediante el panel
panel. Because this is a purely exploratory data de Associate. Debido a que esta es una técnica de
mining technique, there are no evaluation options, minerı́a de datos puramente exploratoria, no hay
and the structure of the panel is simple. The de- opciones de evaluación, y la estructura del panel
fault method is Apriori, which we use in this tuto- es simple. El método predeterminado es Apriori,
rial. WEKA contains a couple of other techniques que utilizamos en este tutorial. WEKA contiene
for learning associations from data, but they are un par de otras técnicas para el aprendizaje de las
probably more interesting to researchers than prac- asociaciones de los datos, pero son probablemente
titioners. más interesante para los investigadores de los pro-
fesionales.
To get a feel for how to apply Apriori, we start Para tener una idea de cómo aplicar Apriori,
by mining rules from the weather.nominal.arff empezamos por las normas de la minerı́a de la
data that we used in Tutorial 1. Note that this al- weather.nominal.arff datos que se utilizó en el
gorithm expects data that is purely nominal: nu- Tutorial 1. Tenga en cuenta que este algoritmo es-
meric attributes must be discretized first. After pera de datos que es puramente nominal: los atrib-
loading the data in the Preprocess panel, hit utos numéricos deben ser discretos en primer lugar.
the Start button in the Associate panel to run Después de cargar los datos en el panel de Prepro-
Apriori with default options. It outputs ten rules, cess, pulsa el botón Start en el panel de Asso-
ranked according to the confidence measure given ciate para ejecutar Apriori con las opciones pre-
in parentheses after each one. The number follow- determinadas. Hace salir diez reglas, ordenadas de
ing a rule’s antecedent shows how many instances acuerdo a la medida de confianza entre paréntesis
satisfy the antecedent; the number following the después de cada uno. El número siguiente an-
conclusion shows how many instances satisfy the tecedente de una regla se muestra cómo muchos
entire rule (this is the rule’s “support”). Because casos cumplen el antecedente, el número después
both numbers are equal for all ten rules, the con- de la conclusión muestra cuántas instancias satis-
fidence of every rule is exactly one. facer toda la regla (esta es la regla de “apoyo”).
Debido a que ambos números son iguales para to-
das las diez reglas, la confianza de cada regla es
exactamente uno.
1
In practice, it is tedious to find minimum sup- En la práctica, es tedioso para encontrar un apoyo
port and confidence values that give satisfactory mı́nimo y los valores de la confianza que dan re-
results. Consequently WEKA’s Apriori runs the sultados satisfactorios. En consecuencia WEKA’s
basic algorithm several times. It uses same user- Apriori corre el algoritmo básico en varias oca-
specified minimum confidence value throughout, siones. Utiliza el mismo valor mı́nimo especificado
given by the minMetric parameter. The sup- por el usuario a través de la confianza, dado por
port level is expressed as a proportion of the total el parámetro minMetric. El nivel de soporte se
number of instances (14 in the case of the weather expresa como un porcentaje del número total de
data), as a ratio between 0 and 1. The minimum casos (14 en el caso de los datos meteorológicos),
support level starts at a certain value (upper- como una relación entre 0 y 1. El nivel mı́nimo
BoundMinSupport, which should invariably be de apoyo se inicia en un determinado valor (up-
left at 1.0 to include the entire set of instances). perBoundMinSupport, que invariablemente se
In each iteration the support is decreased by a debe dejar en 1.0 para incluir todo el conjunto
fixed amount (delta, default 0.05, 5% of the in- de casos). En cada iteración el apoyo se reduce
stances) until either a certain number of rules has en una cantidad fija (delta, por defecto 0.05, 5%
been generated (numRules, default 10 rules) or de los casos) hasta que un cierto número de re-
the support reaches a certain “minimum mini- glas se ha generado (numRules, por defecto 10
mum” level (lowerBoundMinSupport, default normas) o el apoyo llega a un cierto “mı́nimo
0.1—typically rules are uninteresting if they apply mı́nimo “nivel (lowerBoundMinSupport, por
to only 10% of the dataset or less). These four defecto 0.1—normalmente reglas son poco intere-
values can all be specified by the user. santes si se aplican a sólo el 10% del conjunto de
datos o menos). Estos cuatro valores pueden ser
especificados por el usuario.
This sounds pretty complicated, so let us examine Esto suena bastante complicado, ası́ que vamos
what happens on the weather data. From the out- a examinar lo que sucede en los datos meteo-
put in the Associator output text area, we see rológicos. Desde la salida en el área de texto
that the algorithm managed to generate ten rules. Associator output, vemos que el algoritmo de
This is based on a minimum confidence level of 0.9, gestión para generar diez reglas. Esto se basa en
which is the default, and is also shown in the out- un nivel de confianza mı́nimo de 0.9, que es el pre-
put. The Number of cycles performed, which determinado, y también se muestra en la salida.
is shown as 17, tells us that Apriori was actually El Number of cycles performed, que se muestra
run 17 times to generate these rules, with 17 dif- como 17, nos dice que Apriori era en realidad eje-
ferent values for the minimum support. The final cuta 17 veces para generar estas normas, con 17
value, which corresponds to the output that was valores diferentes de la ayuda mı́nima. El coste
generated, is 0.15 (corresponding to 0.15 ∗ 14 ≈ 2 final, que corresponde a la salida que se ha gener-
instances). ado, es de 0,15 (que corresponde a 0.15 ∗ 14 ≈ 2
instances).
By looking at the options in the GenericOb- Al mirar las opciones de la GenericObjectEdi-

jectEditor, we can see that the initial value for tor, podemos ver que el valor inicial de la ayuda
the minimum support (upperBoundMinSup- mı́nima (upperBoundMinSupport) es 1 por de-
port) is 1 by default, and that delta is 0.05. Now, fecto, y que delta es de 0,05. Ahora, 1−17×0.05 =
1 − 17 × 0.05 = 0.15, so this explains why a mini- 0, 15, ası́ que esto explica por qué un valor mı́nimo
mum support value of 0.15 is reached after 17 iter- de apoyo de 0,15 que se llegó después de 17 itera-
ations. Note that upperBoundMinSupport is ciones. Tenga en cuenta que upperBoundMin-
decreased by delta before the basic Apriori algo- Support delta es disminuido por antes de la base
rithm is run for the first time. Apriori algoritmo se ejecuta por primera vez.
2
Minimum confidence Minimum support Number of rules
0.9 0.3
0.9 0.2
0.9 0.1
0.8 0.3
0.8 0.2
0.8 0.1
0.7 0.3
0.7 0.2
0.7 0.1
Table 1: Total number of rules for different values of minimum confidence and support
The Associator output text area also shows the El área de texto Associator output también
number of frequent item sets that were found, muestra el número de conjuntos de ı́tems fre-
based on the last value of the minimum support cuentes que se encontraron, con base en el último
that was tried (i.e. 0.15 in this example). We valor de la ayuda mı́nima que fue juzgado (es de-
can see that, given a minimum support of two in- cir, 0.15 en este ejemplo). Podemos ver que, dado
stances, there are 12 item sets of size one, 47 item un apoyo mı́nimo de dos casos, hay 12 conjun-
sets of size two, 39 item sets of size three, and 6 tos de punto del tamaño de una, 47 conjuntos de
item sets of size four. By setting outputItemSets punto del tamaño de dos, 39 conjuntos de punto
to true before running the algorithm, all those dif- del tamaño de tres, y seis conjuntos de punto del
ferent item sets and the number of instances that tamaño de cuatro. Al establecer outputItemSets
support them are shown. Try this. a true antes de ejecutar el algoritmo, todos los
conjuntos de ı́tems diferentes y el número de casos
que los apoyan se muestran. Pruebe esto.
Ex. 1: Based on the output, what is the support Ex. 1: Sobre la base de la salida, lo que es el so-
of the item set porte del tema conjunto
outlook=rainy perspectivas=lluvias
humidity=normal humedad=normal
windy=FALSE ventoso=FALSO
play=yes? jugar=sı̀?
Ex. 2: Suppose we want to generate all rules with Ex. 2: Supongamos que desea generar todas
a certain confidence and minimum support. las reglas con cierta confianza y el apoyo
This can be done by choosing appropriate mı́nimo. Esto se puede hacer eligiendo val-
values for minMetric, lowerBoundMin- ores adecuados para minMetric, lower-
Support, and numRules. What is the to- BoundMinSupport, y numRules. Cuál
tal number of possible rules for the weather es el número total de posibles reglas para los
data for each combination of values in Ta- datos del tiempo para cada combinación de
ble 1? valores de la Table 1?
3
Apriori has some further parameters. If signif- Apriori tiene algunos parámetros más. Si signif-
icanceLevel is set to a value between zero and icanceLevel se establece en un valor entre cero
one, the association rules are filtered based on a y uno, las reglas de asociación se filtran sobre la
χ2 test with the chosen significance level. How- base de un χ2 la prueba con el nivel de significación
ever, applying a significance test in this context elegido. Sin embargo, la aplicación de una prueba
is problematic because of the so-called “multiple de significación en este contexto es problemático
comparison problem”: if we perform a test hun- debido a los llamados “problemas de comparación
dreds of times for hundreds of association rules, it múltiple”: si realizamos una prueba cientos de ve-
is likely that a significant effect will be found just ces por cientos de reglas de asociación, es probable
by chance (i.e., an association seems to be statis- que un efecto significativo se encuentran sólo por
tically significant when really it is not). Also, the casualidad (es decir, una asociación parece ser es-
χ2 test is inaccurate for small sample sizes (in this tadı́sticamente significativa, cuando en realidad no
context, small support values). lo es). Además, el χ2 la prueba es inexacto para
pequeños tamaños de muestra (en este contexto,
los valores de apoyar a los pequeños).
There are alternative measures for ranking rules. Hay medidas alternativas para las reglas de clasi-
As well as Confidence, Apriori supports Lift, ficación. Además de Confidence, Apriori Lift
Leverage, and Conviction. These can be se- apoya, Leverage y Conviction. Estos pueden ser
lected using metricType. More information is seleccionados con metricType. Más información
available by clicking More in the GenericOb- está disponible haciendo clic More en el Generi-
jectEditor. cObjectEditor.
Ex. 3: Run Apriori on the weather data with Ex. 3: Ejecutar Apriori en la información del
each of the four rule ranking metrics, and tiempo con cada uno de los cuatro indi-
default settings otherwise. What is the top- cadores regla de clasificación, y la configu-
ranked rule that is output for each metric? ración por defecto de otra manera. Cuál es
la primera regla de clasificación que se emite
para cada métrica?
3 Mining a real-world dataset
Now consider a real-world dataset, vote.arff, Consideremos ahora un conjunto de datos del
which gives the votes of 435 U.S. congressmen on mundo real, vote.arff, lo que da los votos de 435
16 key issues gathered in the mid-80s, and also in- congresistas EE.UU. el 16 de cuestiones clave se
cludes their party affiliation as a binary attribute. reunieron a mediados de los años 80, y también in-
This is a purely nominal dataset with some miss- cluye su afiliación a un partido como un atributo
ing values (actually, abstentions). It is normally binario. Se trata de un conjunto de datos pura-
treated as a classification problem, the task being mente nominal con algunos valores que faltan (de
to predict party affiliation based on voting pat- hecho, abstenciones). Normalmente se trata como
terns. However, we can also apply association rule un problema de clasificación, la tarea que para pre-
mining to this data and seek interesting associa- decir afiliación a un partido basado en los patrones
tions. More information on the data appears in de voto. Sin embargo, también podemos aplicar
the comments in the ARFF file. la minerı́a de reglas de asociación a estos datos y
buscar asociaciones interesantes. Más información
sobre los datos aparecen en los comentarios en el
archivo ARFF.
Ex. 4: Run Apriori on this data with default set- Ex. 4: Ejecutar Apriori en estos datos con la
tings. Comment on the rules that are gener- configuración predeterminada. Opina sobre
ated. Several of them are quite similar. How las reglas que se generan. Varios de ellos
are their support and confidence values re- son bastante similares. Cómo son su apoyo
lated? y confianza de los valores asociados?
4
Ex. 5: It is interesting to see that none of Ex. 5: Es interesante ver que ninguna de las re-
the rules in the default output involve glas en la salida predeterminada implican
Class=republican. Why do you think that Clase=republicana. Por qué crees que es?
is?
4 Market basket analysis
A popular application of association rule mining is Una aplicación popular de la minerı́a de reglas
market basket analysis—analyzing customer pur- de asociación es el análisis de la cesta—analizar
chasing habits by seeking associations in the items los hábitos de compra de los clientes mediante
they buy when visiting a store. To do market bas- la búsqueda de asociaciones en los productos que
ket analysis in WEKA, each transaction is coded compran al visitar una tienda. Para hacer análisis
as an instance whose attributes represent the items de la cesta de WEKA, cada transacción se codi-
in the store. Each attribute has only one value: if a fica como una instancia cuyos atributos represen-
particular transaction does not contain it (i.e., the tan los artı́culos de la tienda. Cada atributo tiene
customer did not buy that particular item), this is un único valor: si una transacción en particular
coded as a missing value. no lo contiene (es decir, el cliente no comprar ese
artı́culo en particular), esto se codifica como un
valor que falta.
Your job is to mine supermarket checkout data for Su trabajo consiste en extraer datos superme-
associations. The data in supermarket.arff was rcado para las asociaciones. Los datos de
collected from an actual New Zealand supermar- supermarket.arff se obtuvo de un verdadero su-
ket. Take a look at this file using a text editor permercado de Nueva Zelanda. Echa un vistazo
to verify that you understand the structure. The a este archivo utilizando un editor de texto para
main point of this exercise is to show you how dif- comprobar que entender la estructura. El punto
ficult it is to find any interesting patterns in this principal de este ejercicio es mostrar lo difı́cil que
type of data! es encontrar cualquier patrones interesantes en este
tipo de datos!
Ex. 6: Experiment with Apriori and investigate Ex. 6: Experimente con Apriori e investigar el
the effect of the various parameters discussed efecto de la diversos parmetros discutidos an-
above. Write a brief report on your investi- teriormente. Escriba un breve informe en su
gation and the main findings. investigacin y las conclusiones principales.

Weka Tutorials Spanish

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weka Tutorials Spanish

Uploaded by

Copyright:

Available Formats

Practical Data Mining

Tutorial 1: Introduction to the WEKA Explorer

Mark Hall, Eibe Frank and Ian H. Witten

2 The panels in the Explorer

3 The Preprocess panel

3.1 Loading a dataset

3.3 The dataset editor

3.4 Applying a filter

Ex. 6: Ensure that the weather.nominal Ex. 6: Asegúrese de que el weather.nominal

4 The Visualize panel

5 The Classify panel

5.1 Using the C4.5 classifier

5.2 Interpreting the output

J48 pruned tree

Size of the tree : 8

Correctly Classified Instances 14 100%

=== Confusion Matrix ===

5.3 Setting the testing method

5.4 Visualizing classification errors

1. Hot, mild and cool. 1. caliente, suave y fresco.

6. Select the RemoveWithValues filter after 6. Seleccione el RemoveWithValues filtro de-

7. Click the Undo button. 7. Haga clic en el botón de Undo.

Tutorial 2: Nearest Neighbor Learning and Decision Trees

Eibe Frank and Ian H. Witten

• How is the accuracy of a classifier measured? • Cómo es la precisión de un clasificador de

• What is a learning curve? • Qué es una curva de aprendizaje?

Run the classification algorithm IBk Ejecutar el algoritmo de clasificación IBK

Table 1: Accuracy obtained using IBk, for different attribute subsets

4 Class noise and nearest-neighbor learning

Table 2: Effect of class noise on IBk, for different neighborhood sizes

5 Varying the amount of training data

6 Interactive decision tree construction

Set the classifier to UserClassifier, in the Ajuste el clasificador a UserClassifier, en el

Tutorial 3: Classification Boundaries

Eibe Frank and Ian H .Witten

diferente, ya sea para los x o y eje haciendo clic en el botón correspondiente.

3 Visualizing nearest-neighbor learning

4 Visualizing naive Bayes

5 Visualizing decision trees and rule sets

buenos puntos de partido para los intervalos de discretización.

6 Messing with the data

Tutorial 4: Preprocessing and Parameter Tuning

Eibe Frank and Ian H. Witten

In most practical applications of supervised learn- En la mayorı́a de las aplicaciones prácticas de

5 More on Automatic Attribute Selection

3 This is a standard search method from AI.

A more convenient method is to use the Un método más conveniente es utilizar el

• CfsSubsetEval with BestFirst • CfsSubsetEval con BestFirst

• WrapperSubsetEval with NaiveBayes • WrapperSubsetEval con NaiveBayes y

6 Automatic parameter tuning

Tutorial 5: Document Classification

Eibe Frank and Ian H. Witten

2 Data with string attributes

WEKA’s unsupervised attribute filter Atributo sin supervisión WEKA el filtro

To perform document classification, we first Para realizar la clasificación de documentos,

Table 1: Training “documents”.

Document text Classification

Table 2: Test “documents”.

There is a standard collection of newswire No es una colección estándar de los artı́culos

The percentage of correct classifications is not the El porcentaje de clasificaciones correctas no es la

• TP Rate: TP / (TP + FN) • TP Precio: TP / (TP + FN)

• Precision: TP / (TP + FP) • Precisión: TP / (TP + FP)

• Recall: TP / (TP + FN) • Recuperación: TP / (TP + FN)

• F-Measure: the harmonic mean of precision • F-Medida: la media armónica de precisión y