K-Nearest Neighbors is a lazy learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It requires storing all training data and calculating distances between new and stored points, making it computationally expensive for large datasets. Preprocessing techniques like dimensionality reduction and attribute weighting can help address the "curse of dimensionality" issue KNN faces with high-dimensional data.
K-Nearest Neighbors is a lazy learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It requires storing all training data and calculating distances between new and stored points, making it computationally expensive for large datasets. Preprocessing techniques like dimensionality reduction and attribute weighting can help address the "curse of dimensionality" issue KNN faces with high-dimensional data.
K-Nearest Neighbors is a lazy learning algorithm that classifies new data points based on the majority class of its k nearest neighbors. It requires storing all training data and calculating distances between new and stored points, making it computationally expensive for large datasets. Preprocessing techniques like dimensionality reduction and attribute weighting can help address the "curse of dimensionality" issue KNN faces with high-dimensional data.
Day 4 Introduction At its core, the algorithm says:
• Pick a number of neighbors you want to use for
classification or regression • Choose a method to measure distances • Keep a data set with records • For every new point, identify the number of nearest neighbors you picked using the method you chose • Let them vote if it is a classification issue or take a mean/median for regression Diagrammatically • Let them vote if it is a classification issue or take a mean/median for regression • Here, N = 1. The new green point is labeled black as its nearest neighbor is black too
• Here, N = 3. The new green point
is labeled white based on the voting of three nearest neighbors From the algorithm, clearly KNN is
• Lazy: This is a technical term! All the techniques
we learned so far have a phase called “training phase” and try to identify a function from the training set. Then apply this function to the test data. Such learning is called “eager learning”. K- NN on the other hand does not generalize and uses all the training data (or a subset) in the testing phase. This type of learning is called lazy learning or instance based learning. • K-NN requires more time, as all data points are needed to decide. • It requires more memory as all training data needs to be stored. So, it is very expensive for large data sets and large dimensions. Where N is the number of training examples, d is the dimension of each sample. • Hence, a lot of time must be spent in reducing the N and d. K-NN does suffer from the curse of dimensionality. Attributes Handling curse of dimensionality • K-NN is heavily impacted by huge number of dimensions • Reduce the dimensions using – Correlation , Principal Component Analysis – Gain Ratio, Info gain (filter approach: We lose some that are important) – Wrapper methods (Forward selection, Backward elimination) – Weighting attributes • Scaling the attributes – Attributes with larger range can dominate To understand this, consider the following pair of data points (0.1, 20) and (0.9, 720) • The distance is almost completely dominated by (720- 20) = 700. To avoid this, we standardize attributes to force the attributes to have a common value range. The common techniques include • Taking logarithms when one variable is varying several orders of magnitude • Dividing with the highest value to get the variables between 0 and 1 • Standardizing will bring most of the data to -3 and 3 • Categorical variables and Ordinal variables need to be converted to numeric • Handling missing values – K-NN is impacted heavily by missing values – Imputation is one option • Handling overfitting – Remove outliers (Wilson Editing) • Speeding up KNN – Condensation Feature Engineering • Library (class) • kNN produces complex decision surfaces. • As complexity of decision surfaces increases, accuracy decreases and we need more data • Increase K to decrease the over-fit. • kNN gives no explicability. • kNN is a distance method, only numeric variables. Convert categorical/ordinal values into numerical • kNN works well in batch mode not in real time • kNN fails when there are missing values. (use kNN Imputation in data pre- processing to fill missing values) • In kNN, training is easy but predictions are difficult