Classifier-based filters¶
These filters compare the observed labels against predictions obtained from one or more base classifiers.
Overview¶
ClassificationFilteruses a single classifier and out-of-fold predictions.CVCFFilteraggregates fold-wise committee votes.EnsembleFilteringcompares several estimators.INFFCFilteriteratively fuses a heterogeneous committee.IterativePartitioningFilterrepeatedly partitions the data and checks agreement.
ClassificationFilter¶
Bases: BaseEstimator
Cross-validated single-classifier noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimator
|
estimator
|
Base learner cloned and trained on each fold. |
required |
cv
|
int
|
Number of stratified folds used to generate out-of-fold predictions. |
10
|
action
|
(remove, detect)
|
Whether noisy samples are dropped or only detected. |
"remove"
|
random_state
|
int
|
Seed used by the stratified splitter. |
33
|
Notes
A sample is flagged as noisy when its out-of-fold prediction differs from the observed label.
CVCFFilter¶
Bases: BaseEstimator
Cross-validated committees noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimator
|
estimator
|
Base learner cloned for each fold of the committee. |
c45_like
|
cv
|
int
|
Number of stratified folds used to build the committee. |
10
|
vote_rule
|
(threshold, consensus)
|
Rule used to flag samples as noisy from the fold disagreements. |
"threshold"
|
threshold
|
float
|
Minimum fraction of disagreeing folds required when |
0.5
|
action
|
(remove, detect)
|
Whether noisy samples are dropped or only detected. |
"remove"
|
random_state
|
int
|
Seed used by the stratified splitter. |
33
|
Notes
Relabel is not implemented yet.
EnsembleFiltering¶
Bases: BaseEstimator
Ensemble-based noise filter using multiple classifiers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimators
|
sequence of estimators
|
Base learners combined in the ensemble committee. |
required |
cv
|
int
|
Number of stratified folds used to compute out-of-fold predictions. |
10
|
mode
|
str
|
Decision rule used to flag samples as noisy. The current implementation
accepts |
"S"
|
threshold
|
float
|
Minimum fraction of disagreeing estimators required when |
0.5
|
action
|
(remove, detect)
|
Whether noisy samples are dropped or only detected. |
"remove"
|
random_state
|
int
|
Seed used by the stratified splitter. |
33
|
return_noisy_samples
|
bool
|
Stored on the instance for compatibility; the current implementation does not branch on it. |
False
|
Notes
A sample is flagged as noisy when enough estimators disagree with its observed label.
INFFCFilter¶
Bases: BaseEstimator
Iterative fusion-of-classifiers noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimators
|
sequence of estimators or None
|
Base learners used by the committee. If |
None
|
cv
|
int
|
Number of stratified folds used inside each iteration. |
10
|
decision_rule
|
(majority, consensus, threshold)
|
Rule used to flag a sample as noisy from the committee disagreements. |
"majority"
|
threshold
|
float
|
Minimum disagreement fraction required when |
0.5
|
action
|
(remove, detect)
|
Whether noisy samples are dropped or only detected. |
"remove"
|
max_iter
|
int
|
Maximum number of cleaning iterations. |
20
|
max_removed_frac
|
float
|
Stop once this fraction of the original training set has been removed. |
0.5
|
random_state
|
int
|
Seed used by the stratified splitter in each iteration. |
33
|
Notes
Relabel is not implemented yet.
IterativePartitioningFilter¶
Bases: BaseEstimator
Iterative partitioning noise filter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
estimator
|
estimator
|
Base learner fitted on each partition. |
c45_like
|
n_partitions
|
int
|
Number of stratified partitions built at each iteration. |
10
|
vote_rule
|
(majority, consensus)
|
Rule used to flag a sample as noisy from the partition disagreements. |
"majority"
|
action
|
(remove, detect)
|
Whether noisy samples are dropped or only detected. |
"remove"
|
p_stop
|
float
|
Patience threshold expressed as a fraction of the original dataset. |
0.01
|
k_patience
|
int
|
Number of consecutive low-yield iterations tolerated before stopping. |
3
|
max_iter
|
int
|
Maximum number of cleaning iterations. |
20
|
random_state
|
int
|
Seed used by the stratified splitter in each iteration. |
33
|
Notes
Relabel is not implemented yet.