Distance-based filters¶
These filters rely on neighborhood geometry and local consistency.
Overview¶
ENNFilterandENNProbFilteruse nearest-neighbor voting.MultiEditFilteriteratively cleans the data in stratified blocks.NCNEditselects neighbors by minimizing the centroid distance.
AllKNN and TomekLinks are re-exported from imbalanced-learn when that dependency is available.
ENN¶
Bases: BaseEstimator
Edited nearest-neighbor noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_neighbors
|
int
|
Number of neighbors used to vote on each sample. |
3
|
mode
|
(enn, menn)
|
|
"enn"
|
metric
|
str
|
Distance metric used by :class: |
"minkowski"
|
p
|
int
|
Minkowski power parameter, only used when |
2
|
tie_eps
|
float
|
Tolerance used to include neighbors tied at the kth distance. |
1e-12
|
action
|
(remove, detect)
|
Whether to drop noisy samples or only detect them. |
"remove"
|
n_jobs
|
int or None
|
Parallelism forwarded to the nearest-neighbor search. |
None
|
candidate_strategy
|
(full, expansive)
|
Strategy used to grow the candidate set in |
"full"
|
Notes
A sample is flagged as noisy when the neighborhood vote differs from the observed label.
Summary of an edited nearest-neighbor filtering run.
Attributes:
| Name | Type | Description |
|---|---|---|
keep_mask |
ndarray of bool
|
Mask indicating which samples are kept after filtering. |
noisy_fraction |
float
|
Fraction of samples flagged as noisy. |
nn_pred |
ndarray
|
Majority-vote label predicted from the selected neighbors. |
disagree_count |
ndarray
|
Number of neighbors whose label differs from the observed label. |
neighbor_count_used |
ndarray
|
Number of neighbors actually used for each sample. |
kth_distance |
ndarray
|
Distance to the kth neighbor used to define the neighborhood. |
ENNProb¶
Bases: BaseEstimator
Probabilistic edited nearest-neighbor noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_neighbors
|
int
|
Number of neighbors used to build the local probability estimate. |
3
|
mode
|
(prob, th)
|
|
"prob"
|
threshold
|
float
|
Minimum probability required to keep a sample when |
0.5
|
metric
|
str
|
Distance metric used by :class: |
"minkowski"
|
p
|
int
|
Minkowski power parameter, only used when |
2
|
tie_eps
|
float
|
Tolerance used to include neighbors tied at the kth distance. |
1e-12
|
action
|
(remove, detect)
|
Whether to drop noisy samples or only detect them. |
"remove"
|
n_jobs
|
int or None
|
Parallelism forwarded to the nearest-neighbor search. |
None
|
Notes
Local class probabilities are computed by weighting each neighbor with the inverse of its distance.
Summary of a probabilistic edited nearest-neighbor filtering run.
Attributes:
| Name | Type | Description |
|---|---|---|
keep_mask |
ndarray of bool
|
Mask indicating which samples are kept after filtering. |
noisy_fraction |
float
|
Fraction of samples flagged as noisy. |
nn_pred |
ndarray
|
Label with the highest neighbor-weighted probability. |
max_prob |
ndarray
|
Probability assigned to the predicted label for each sample. |
class_probabilities |
ndarray
|
Weighted class-probability distribution computed from the neighborhood. |
neighbor_count_used |
ndarray
|
Number of neighbors actually used for each sample. |
kth_distance |
ndarray
|
Distance to the kth neighbor used to define the neighborhood. |
noise_score |
ndarray
|
Product of the two noise factors. |
MultiEdit¶
Bases: BaseEstimator
Multi-Edit noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_neighbors
|
int
|
Number of neighbors used by the block-wise KNN classifier. |
3
|
n_blocks
|
int
|
Number of stratified blocks built at each iteration. |
10
|
metric
|
str
|
Distance metric used by the internal KNN classifiers. |
"minkowski"
|
p
|
int
|
Minkowski power parameter, only used when |
2
|
action
|
(remove, detect)
|
Whether to drop noisy samples or only detect them. |
"remove"
|
random_state
|
int
|
Seed used to shuffle the stratified blocks. |
33
|
n_jobs
|
int or None
|
Parallelism forwarded to the internal KNN classifiers. |
None
|
max_iter
|
int or None
|
Optional maximum number of refinement iterations. |
None
|
Notes
Each iteration predicts one block from another block of the same stratified partition.
Summary of a Multi-Edit filtering run.
Attributes:
| Name | Type | Description |
|---|---|---|
keep_mask |
ndarray of bool
|
Mask indicating which samples are kept after filtering. |
noisy_fraction |
float
|
Fraction of samples flagged as noisy. |
removed_total |
int
|
Total number of removed samples. |
n_iters |
int
|
Number of cleaning iterations performed. |
n_blocks |
int
|
Number of stratified blocks used per iteration. |
nn_pred |
ndarray
|
Block-wise nearest-neighbor prediction used for detection. |
NCNEdit¶
Bases: BaseEstimator
Nearest-centroid-neighbor noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_neighbors
|
int
|
Number of neighbors used to build the centroid-based neighborhood. |
3
|
metric
|
str
|
Distance metric used by :class: |
"minkowski"
|
p
|
int
|
Minkowski power parameter, only used when |
2
|
action
|
(remove, detect)
|
Whether to drop noisy samples or only detect them. |
"remove"
|
n_jobs
|
int or None
|
Parallelism forwarded to the nearest-neighbor search. |
None
|
candidate_strategy
|
(full, expansive)
|
Strategy used to grow the candidate set while selecting the centroid-nearest neighbors. |
"full"
|
Notes
The neighborhood is built recursively by adding the candidate that minimizes the distance between the sample and the centroid of the partial neighborhood.
Summary of an NCNEdit filtering run.
Attributes:
| Name | Type | Description |
|---|---|---|
keep_mask |
ndarray of bool
|
Mask indicating which samples are kept after filtering. |
noisy_fraction |
float
|
Fraction of samples flagged as noisy. |
ncn_pred |
ndarray
|
Prediction obtained from the centroid-based neighborhood vote. |
disagree_count |
ndarray
|
Number of neighbors whose label differs from the observed label. |
neighbor_count_used |
ndarray
|
Number of neighbors actually used for each sample. |