Skip to content

Distance-based filters

These filters rely on neighborhood geometry and local consistency.

Overview

  • ENNFilter and ENNProbFilter use nearest-neighbor voting.
  • MultiEditFilter iteratively cleans the data in stratified blocks.
  • NCNEdit selects neighbors by minimizing the centroid distance.

AllKNN and TomekLinks are re-exported from imbalanced-learn when that dependency is available.

ENN

Bases: BaseEstimator

Edited nearest-neighbor noise filter.

Parameters:

Name Type Description Default
n_neighbors int

Number of neighbors used to vote on each sample.

3
mode (enn, menn)

"enn" uses the fixed k nearest neighbors. "menn" expands the candidate neighborhood until predictions stabilize.

"enn"
metric str

Distance metric used by :class:sklearn.neighbors.NearestNeighbors.

"minkowski"
p int

Minkowski power parameter, only used when metric="minkowski".

2
tie_eps float

Tolerance used to include neighbors tied at the kth distance.

1e-12
action (remove, detect)

Whether to drop noisy samples or only detect them.

"remove"
n_jobs int or None

Parallelism forwarded to the nearest-neighbor search.

None
candidate_strategy (full, expansive)

Strategy used to grow the candidate set in "menn" mode.

"full"
Notes

A sample is flagged as noisy when the neighborhood vote differs from the observed label.

fit(X, y)

Fit the filter and cache nearest-neighbor predictions.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of an edited nearest-neighbor filtering run.

Attributes:

Name Type Description
keep_mask ndarray of bool

Mask indicating which samples are kept after filtering.

noisy_fraction float

Fraction of samples flagged as noisy.

nn_pred ndarray

Majority-vote label predicted from the selected neighbors.

disagree_count ndarray

Number of neighbors whose label differs from the observed label.

neighbor_count_used ndarray

Number of neighbors actually used for each sample.

kth_distance ndarray

Distance to the kth neighbor used to define the neighborhood.

ENNProb

Bases: BaseEstimator

Probabilistic edited nearest-neighbor noise filter.

Parameters:

Name Type Description Default
n_neighbors int

Number of neighbors used to build the local probability estimate.

3
mode (prob, th)

"prob" flags only label disagreements. "th" also flags samples whose maximum local probability is below threshold.

"prob"
threshold float

Minimum probability required to keep a sample when mode="th".

0.5
metric str

Distance metric used by :class:sklearn.neighbors.NearestNeighbors.

"minkowski"
p int

Minkowski power parameter, only used when metric="minkowski".

2
tie_eps float

Tolerance used to include neighbors tied at the kth distance.

1e-12
action (remove, detect)

Whether to drop noisy samples or only detect them.

"remove"
n_jobs int or None

Parallelism forwarded to the nearest-neighbor search.

None
Notes

Local class probabilities are computed by weighting each neighbor with the inverse of its distance.

fit(X, y)

Fit the filter and cache neighbor-weighted class probabilities.

fit_resample(X, y)

Fit the filter and return the filtered data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of a probabilistic edited nearest-neighbor filtering run.

Attributes:

Name Type Description
keep_mask ndarray of bool

Mask indicating which samples are kept after filtering.

noisy_fraction float

Fraction of samples flagged as noisy.

nn_pred ndarray

Label with the highest neighbor-weighted probability.

max_prob ndarray

Probability assigned to the predicted label for each sample.

class_probabilities ndarray

Weighted class-probability distribution computed from the neighborhood.

neighbor_count_used ndarray

Number of neighbors actually used for each sample.

kth_distance ndarray

Distance to the kth neighbor used to define the neighborhood.

noise_score ndarray

Product of the two noise factors.

MultiEdit

Bases: BaseEstimator

Multi-Edit noise filter.

Parameters:

Name Type Description Default
n_neighbors int

Number of neighbors used by the block-wise KNN classifier.

3
n_blocks int

Number of stratified blocks built at each iteration.

10
metric str

Distance metric used by the internal KNN classifiers.

"minkowski"
p int

Minkowski power parameter, only used when metric="minkowski".

2
action (remove, detect)

Whether to drop noisy samples or only detect them.

"remove"
random_state int

Seed used to shuffle the stratified blocks.

33
n_jobs int or None

Parallelism forwarded to the internal KNN classifiers.

None
max_iter int or None

Optional maximum number of refinement iterations.

None
Notes

Each iteration predicts one block from another block of the same stratified partition.

fit(X, y)

Fit the filter and iteratively remove misclassified instances.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of a Multi-Edit filtering run.

Attributes:

Name Type Description
keep_mask ndarray of bool

Mask indicating which samples are kept after filtering.

noisy_fraction float

Fraction of samples flagged as noisy.

removed_total int

Total number of removed samples.

n_iters int

Number of cleaning iterations performed.

n_blocks int

Number of stratified blocks used per iteration.

nn_pred ndarray

Block-wise nearest-neighbor prediction used for detection.

NCNEdit

Bases: BaseEstimator

Nearest-centroid-neighbor noise filter.

Parameters:

Name Type Description Default
n_neighbors int

Number of neighbors used to build the centroid-based neighborhood.

3
metric str

Distance metric used by :class:sklearn.neighbors.NearestNeighbors.

"minkowski"
p int

Minkowski power parameter, only used when metric="minkowski".

2
action (remove, detect)

Whether to drop noisy samples or only detect them.

"remove"
n_jobs int or None

Parallelism forwarded to the nearest-neighbor search.

None
candidate_strategy (full, expansive)

Strategy used to grow the candidate set while selecting the centroid-nearest neighbors.

"full"
Notes

The neighborhood is built recursively by adding the candidate that minimizes the distance between the sample and the centroid of the partial neighborhood.

fit(X, y)

Fit the filter and cache the NCN-based predictions.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of an NCNEdit filtering run.

Attributes:

Name Type Description
keep_mask ndarray of bool

Mask indicating which samples are kept after filtering.

noisy_fraction float

Fraction of samples flagged as noisy.

ncn_pred ndarray

Prediction obtained from the centroid-based neighborhood vote.

disagree_count ndarray

Number of neighbors whose label differs from the observed label.

neighbor_count_used ndarray

Number of neighbors actually used for each sample.