Skip to content

Classifier-based filters

These filters compare the observed labels against predictions obtained from one or more base classifiers.

Overview

  • ClassificationFilter uses a single classifier and out-of-fold predictions.
  • CVCFFilter aggregates fold-wise committee votes.
  • EnsembleFiltering compares several estimators.
  • INFFCFilter iteratively fuses a heterogeneous committee.
  • IterativePartitioningFilter repeatedly partitions the data and checks agreement.

ClassificationFilter

Bases: BaseEstimator

Cross-validated single-classifier noise filter.

Parameters:

Name Type Description Default
estimator estimator

Base learner cloned and trained on each fold.

required
cv int

Number of stratified folds used to generate out-of-fold predictions.

10
action (remove, detect)

Whether noisy samples are dropped or only detected.

"remove"
random_state int

Seed used by the stratified splitter.

33
Notes

A sample is flagged as noisy when its out-of-fold prediction differs from the observed label.

fit(X, y)

Fit the filter and cache out-of-fold predictions.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of a single-classifier noise filtering run.

CVCFFilter

Bases: BaseEstimator

Cross-validated committees noise filter.

Parameters:

Name Type Description Default
estimator estimator

Base learner cloned for each fold of the committee.

c45_like
cv int

Number of stratified folds used to build the committee.

10
vote_rule (threshold, consensus)

Rule used to flag samples as noisy from the fold disagreements.

"threshold"
threshold float

Minimum fraction of disagreeing folds required when vote_rule="threshold".

0.5
action (remove, detect)

Whether noisy samples are dropped or only detected.

"remove"
random_state int

Seed used by the stratified splitter.

33
Notes

Relabel is not implemented yet.

fit(X, y)

Fit the filter and cache fold-wise predictions and agreement scores.

fit_resample(X, y)

Fit the filter and return the filtered data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of a cross-validated committees filtering run.

EnsembleFiltering

Bases: BaseEstimator

Ensemble-based noise filter using multiple classifiers.

Parameters:

Name Type Description Default
estimators sequence of estimators

Base learners combined in the ensemble committee.

required
cv int

Number of stratified folds used to compute out-of-fold predictions.

10
mode str

Decision rule used to flag samples as noisy. The current implementation accepts "threshold" and "consensus"; the signature default is kept for compatibility.

"S"
threshold float

Minimum fraction of disagreeing estimators required when mode="threshold".

0.5
action (remove, detect)

Whether noisy samples are dropped or only detected.

"remove"
random_state int

Seed used by the stratified splitter.

33
return_noisy_samples bool

Stored on the instance for compatibility; the current implementation does not branch on it.

False
Notes

A sample is flagged as noisy when enough estimators disagree with its observed label.

fit(X, y)

Fit the filter and cache ensemble disagreement counts.

fit_resample(X, y)

Fit the filter and return the filtered data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of an ensemble-based noise filtering run.

INFFCFilter

Bases: BaseEstimator

Iterative fusion-of-classifiers noise filter.

Parameters:

Name Type Description Default
estimators sequence of estimators or None

Base learners used by the committee. If None, a default trio of C4.5-like tree, 1-NN and LDA is used.

None
cv int

Number of stratified folds used inside each iteration.

10
decision_rule (majority, consensus, threshold)

Rule used to flag a sample as noisy from the committee disagreements.

"majority"
threshold float

Minimum disagreement fraction required when decision_rule="threshold".

0.5
action (remove, detect)

Whether noisy samples are dropped or only detected.

"remove"
max_iter int

Maximum number of cleaning iterations.

20
max_removed_frac float

Stop once this fraction of the original training set has been removed.

0.5
random_state int

Seed used by the stratified splitter in each iteration.

33
Notes

Relabel is not implemented yet.

fit(X, y)

Fit the filter and iteratively remove noisy samples.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of an INFFC filtering run.

Per-iteration diagnostics for INFFC.

IterativePartitioningFilter

Bases: BaseEstimator

Iterative partitioning noise filter.

Parameters:

Name Type Description Default
estimator estimator

Base learner fitted on each partition.

c45_like
n_partitions int

Number of stratified partitions built at each iteration.

10
vote_rule (majority, consensus)

Rule used to flag a sample as noisy from the partition disagreements.

"majority"
action (remove, detect)

Whether noisy samples are dropped or only detected.

"remove"
p_stop float

Patience threshold expressed as a fraction of the original dataset.

0.01
k_patience int

Number of consecutive low-yield iterations tolerated before stopping.

3
max_iter int

Maximum number of cleaning iterations.

20
random_state int

Seed used by the stratified splitter in each iteration.

33
Notes

Relabel is not implemented yet.

fit(X, y)

Fit the filter and iteratively partition the training data.

fit_resample(X, y)

Fit the filter and return the filtered or detected data.

get_filter_report()

Return a dictionary with the main fit diagnostics.

get_detection_report()

Return the stored detection report.

Summary of an iterative partitioning filtering run.

Per-iteration diagnostics for iterative partitioning filtering.