Feature Selection#

변수 선택

Feature Selection은 주어진 고차원 데이터 분석 시에 모델링에 사용할 변수를 선택하거나 제거하는 방법입니다.

Further Reading

Hands-on with Feature Selection Techniques: Filter Methods

변수 자체의 Variance Threshold 를 통한 변수 제거#

Feature Selection의 가장 기본적인 접근방법으로, Variance 가 Threshold 보다 작은 변수를 제거합니다. Variance 가 낮다는 것은 변수 안에 같은 값의 데이터가 많다는 의미입니다.

예를 들면 변수의 데이터의 다수가 0 으로 채워져 있는 경우입니다.

Variance Threshold 는 Independent Variable(독립변수, x)와 Dependent Variable(종속변수, y)간의 관계를 고려하지 않기 때문에 분석 차원에서 제거시 이 부분을 고려해봐야 합니다.

# 경고 메시지 출력 끄기
import warnings 
warnings.filterwarnings(action='ignore')

%matplotlib inline
import matplotlib.pyplot as plt
import IPython

import sys

rseed = 22
import random
random.seed(rseed)

import numpy as np
np.random.seed(rseed)
np.set_printoptions(precision=3)
np.set_printoptions(formatter={'float_kind': "{:.3f}".format})

import pandas as pd
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_columns', None) 
pd.set_option('display.max_colwidth', None)
pd.options.display.float_format = '{:,.5f}'.format

import sklearn

print(f"python ver={sys.version}")
print(f"pandas ver={pd.__version__}")
print(f"numpy ver={np.__version__}")
print(f"sklearn ver={sklearn.__version__}")

python ver=3.8.9 (default, Jun 12 2021, 23:47:44) 
[Clang 12.0.5 (clang-1205.0.22.9)]
pandas ver=1.2.4
numpy ver=1.19.5
sklearn ver=0.24.2

from sklearn import feature_selection

xs = np.array(
    [[0, 2, 0, 3],
     [0, 1, 4, 3],
     [0, 1, 1, 3]]
)
print(xs)

selector = feature_selection.VarianceThreshold()
fit = selector.fit(xs)
print(fit.variances_)

xs_selected = selector.fit_transform(xs)
print(xs_selected.shape)
print(xs_selected)

[[0 2 0 3]
 [0 1 4 3]
 [0 1 1 3]]
[0.000 0.222 2.889 0.000]
(3, 2)
[[2 0]
 [1 4]
 [1 1]]

Independet Variable와 Dependent Variable간의 관계를 통해 변수 제거#

주어진 고차원 독립변수(Independent Variable) 각각을 종속변수(Dependent Variable)와의 관계를 계산하여 그 중 모델의 성능에 영향을 줄수 있는 독립변수를 우선적으로 선택합니다.

독립변수와 종속변수간의 관계가 강한 Top-K 변수를 선택함으로써 아래와 같은 이점을 얻을 수 있습니다.

Overfitting 감소
모델 성능 향상
모델 훈련 시간 감소

Scikit-learn 에서는 단변량 통계 (Univariate Statistics)를 이용한 선택 방법 및 종속 변수의 범주에 따른 분석 방법이 Regression, Classification 이냐에 따라 다른 특징 선택 방법을 제공합니다.

변수 선택 방법

SelectKBest: 가장 높은 점수를 받은 K 개 변수외의 변수를 제거
SelectPercentile: 가장 높은 점수순으로 사용자가 제공한 Percent 를 제외한 변수를 제거
SelectFpr: False Positve Rate를 통해 변수를 제거
SelectFdr: False Discovery Rate를 통해 변수를 제거
SelectFwe: Family Wise Error를 통해 변수를 제거

종속 변수 분석 방법

Regression: f_regression, mutual_info_regression
Classification: chi2, f_classif, mutual_info_classif

Regression 문제에 Classification 특징 선택 함수를 쓸 경우에는 잘못된 결과를 얻을 수 있기 때문에 주의해야 합니다.

한가지 더 생각해보아야 할 점은 종속변수와 관계가 약한 독립변수의 제거에 있습니다. 많은 특징 데이터를 학습하여 복잡한 모델을 만들게 되는 딥러닝 기법에서는 약한 관계에 있는 변수들도 사용 시에 도움이 될 수 있기에 Top-K 변수를 사용하는 것 보다는 전혀 관계가 없는 변수만을 제거하는 것을 분석 차원 축소의 의미로 사용하는 것을 생각해보아야 합니다.

Regression#

Scikit-learn 에서 제공하는 샘플 데이터 셋에서 Regression 에 적합한 데이터 셋에 대해서 특징 선택을 수행해봅니다. 데이터 셋 로딩 함수의 도움말을 통해 데이터 컬럼 정보를 확인해 봅니다.

from sklearn import datasets

datasets.load_boston?

Signature: datasets.load_boston(*, return_X_y=False)
Docstring:
Load and return the boston house-prices dataset (regression).

==============   ==============
Samples total               506
Dimensionality               13
Features         real, positive
Targets           real 5. - 50.
==============   ==============

Read more in the :ref:`User Guide <boston_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

Returns
-------
data : :class:`~sklearn.utils.Bunch`
    Dictionary-like object, with the following attributes.

    data : ndarray of shape (506, 13)
        The data matrix.
    target : ndarray of shape (506, )
        The regression target.
    filename : str
        The physical location of boston csv dataset.

        .. versionadded:: 0.20

    DESCR : str
        The full description of the dataset.
    feature_names : ndarray
        The names of features

(data, target) : tuple if ``return_X_y`` is True

    .. versionadded:: 0.18

Notes
-----
    .. versionchanged:: 0.20
        Fixed a wrong data point at [445, 0].

Examples
--------
>>> from sklearn.datasets import load_boston
>>> X, y = load_boston(return_X_y=True)
>>> print(X.shape)
(506, 13)
File:      ~/.pyenv/versions/3.8.9/envs/skp-n4e-jupyter-sts/lib/python3.8/site-packages/sklearn/datasets/_base.py
Type:      function

전체 13개의 변수 중에 5개의 특징을 선택해 보도록 하겠습니다.

from sklearn import datasets, feature_selection

xs, ys = datasets.load_boston(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

selector = feature_selection.SelectKBest(score_func = feature_selection.f_regression, k = 'all')
fit = selector.fit(xs, ys)

# Display score
columns = pd.DataFrame(np.arange(xs.shape[1]))
scores = pd.DataFrame(fit.scores_)
feature_scores = pd.concat([columns, scores], axis = 1)
feature_scores.columns = ['column', 'score']
display(feature_scores)

# Select Top-N score
display(feature_scores.nlargest(5, 'score'))

index =  feature_scores.nlargest(5, 'score')['column'].values
display(index)

xs_selected = xs[:, index]
print(xs_selected.shape)
display(xs_selected[:5])

shape: xs=(506, 13), ys=(506,)

	column	score
0	0	89.48611
1	1	75.25764
2	2	153.95488
3	3	15.97151
4	4	112.59148
5	5	471.84674
6	6	83.47746
7	7	33.57957
8	8	85.91428
9	9	141.76136
10	10	175.10554
11	11	63.05423
12	12	601.61787

	column	score
12	12	601.61787
5	5	471.84674
10	10	175.10554
2	2	153.95488
9	9	141.76136

array([12,  5, 10,  2,  9])

(506, 5)

array([[4.980, 6.575, 15.300, 2.310, 296.000],
       [9.140, 6.421, 17.800, 7.070, 242.000],
       [4.030, 7.185, 17.800, 7.070, 242.000],
       [2.940, 6.998, 18.700, 2.180, 222.000],
       [5.330, 7.147, 18.700, 2.180, 222.000]])

Classification#

Scikit-learn 에서 제공하는 샘플 데이터 셋에서 Classication 에 적합한 데이터 셋에 대해서 특징 선택을 수행해봅니다. 데이터 셋 로딩 함수의 도움말을 통해 데이터 컬럼 정보를 확인해 봅니다.

from sklearn import datasets

datasets.load_wine?

Signature: datasets.load_wine(*, return_X_y=False, as_frame=False)
Docstring:
Load and return the wine dataset (classification).

.. versionadded:: 0.18

The wine dataset is a classic and very easy multi-class classification
dataset.

=================   ==============
Classes                          3
Samples per class        [59,71,48]
Samples total                  178
Dimensionality                  13
Features            real, positive
=================   ==============

Read more in the :ref:`User Guide <wine_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.

as_frame : bool, default=False
    If True, the data is a pandas DataFrame including columns with
    appropriate dtypes (numeric). The target is
    a pandas DataFrame or Series depending on the number of target columns.
    If `return_X_y` is True, then (`data`, `target`) will be pandas
    DataFrames or Series as described below.

    .. versionadded:: 0.23

Returns
-------
data : :class:`~sklearn.utils.Bunch`
    Dictionary-like object, with the following attributes.

    data : {ndarray, dataframe} of shape (178, 13)
        The data matrix. If `as_frame=True`, `data` will be a pandas
        DataFrame.
    target: {ndarray, Series} of shape (178,)
        The classification target. If `as_frame=True`, `target` will be
        a pandas Series.
    feature_names: list
        The names of the dataset columns.
    target_names: list
        The names of target classes.
    frame: DataFrame of shape (178, 14)
        Only present when `as_frame=True`. DataFrame with `data` and
        `target`.

        .. versionadded:: 0.23
    DESCR: str
        The full description of the dataset.

(data, target) : tuple if ``return_X_y`` is True

The copy of UCI ML Wine Data Set dataset is downloaded and modified to fit
standard format from:
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

Examples
--------
Let's say you are interested in the samples 10, 80, and 140, and want to
know their class name.

>>> from sklearn.datasets import load_wine
>>> data = load_wine()
>>> data.target[[10, 80, 140]]
array([0, 1, 2])
>>> list(data.target_names)
['class_0', 'class_1', 'class_2']
File:      ~/.pyenv/versions/3.8.9/envs/skp-n4e-jupyter-sts/lib/python3.8/site-packages/sklearn/datasets/_base.py
Type:      function

전체 13개의 변수 중에 5개의 특징을 선택해 보도록 하겠습니다.

from sklearn import datasets, feature_selection

xs, ys = datasets.load_wine(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

selector = feature_selection.SelectKBest(score_func = feature_selection.f_classif, k = 'all')
fit = selector.fit(xs, ys)

# Display score
columns = pd.DataFrame(np.arange(xs.shape[1]))
scores = pd.DataFrame(fit.scores_)
feature_scores = pd.concat([columns, scores], axis = 1)
feature_scores.columns = ['column', 'score']
display(feature_scores)

# Select Top-N score
display(feature_scores.nlargest(5, 'score'))

index =  feature_scores.nlargest(5, 'score')['column'].values
display(index)

xs_selected = xs[:, index]
print(xs_selected.shape)
display(xs_selected[:5])

shape: xs=(178, 13), ys=(178,)

	column	score
0	0	135.07762
1	1	36.94342
2	2	13.31290
3	3	35.77164
4	4	12.42958
5	5	93.73301
6	6	233.92587
7	7	27.57542
8	8	30.27138
9	9	120.66402
10	10	101.31680
11	11	189.97232
12	12	207.92037

	column	score
6	6	233.92587
12	12	207.92037
11	11	189.97232
0	0	135.07762
9	9	120.66402

array([ 6, 12, 11,  0,  9])

(178, 5)

array([[3.060, 1065.000, 3.920, 14.230, 5.640],
       [2.760, 1050.000, 3.400, 13.200, 4.380],
       [3.240, 1185.000, 3.170, 13.160, 5.680],
       [3.490, 1480.000, 3.450, 14.370, 7.800],
       [2.690, 735.000, 2.930, 13.240, 4.320]])

Recursive 한 변수 조합 평가를 통한 변수 제거#

이 방법은 전체 종속 변수에 대한 서브셋을 만들고 그중 가장 나쁜 성능을 나타내는 변수 조합을 제거하는 방법입니다.

예를 들어 5개의 변수 중 3개의 변수를 선택한다고 하면, 재귀적으로 표현 가능한 변수의 조합 셋을 만들고, n-1 조합에서 가장 좋은 셋을 찾고 그 셋에서 n-2 조합 중에 가장 좋은 셋을 찾는 식입니다.

평가에 사용하는 함수는 원하는 모델 함수를 파라미터로 선택하여 이용할 수 있습니다. 이 방법은 변수의 수가 많은 고차원 데이터 셋에서는 많은 시간이 들기 때문에 고차원 데이터라면 추천되지는 않습니다.

Regression#

from sklearn import datasets, feature_selection, svm

xs, ys = datasets.load_boston(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

estimator = svm.SVR(kernel="linear")
selector = feature_selection.RFE(estimator, 5)
%time fit = selector.fit(xs, ys)

# Display score
columns = pd.DataFrame(np.arange(xs.shape[1]))
ranks = pd.DataFrame(fit.ranking_)
feature_ranks = pd.concat([columns, ranks], axis = 1)
feature_ranks.columns = ['column', 'rank']
display(feature_ranks)

# Select Top-N score
index = fit.support_
display(index)

xs_selected = xs[:, index]
print(xs_selected.shape)
display(xs_selected[:5])

shape: xs=(506, 13), ys=(506,)
CPU times: user 3.6 s, sys: 8.43 ms, total: 3.61 s
Wall time: 3.62 s

	column	rank
0	0	3
1	1	5
2	2	4
3	3	1
4	4	1
5	5	1
6	6	6
7	7	2
8	8	8
9	9	9
10	10	1
11	11	7
12	12	1

array([False, False, False,  True,  True,  True, False, False, False,
       False,  True, False,  True])

(506, 5)

array([[0.000, 0.538, 6.575, 15.300, 4.980],
       [0.000, 0.469, 6.421, 17.800, 9.140],
       [0.000, 0.469, 7.185, 17.800, 4.030],
       [0.000, 0.458, 6.998, 18.700, 2.940],
       [0.000, 0.458, 7.147, 18.700, 5.330]])

Classification#

from sklearn import datasets, feature_selection, svm

xs, ys = datasets.load_wine(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

estimator = svm.SVC(kernel="linear")
selector = feature_selection.RFE(estimator, 5)
%time fit = selector.fit(xs, ys)

# Display score
columns = pd.DataFrame(np.arange(xs.shape[1]))
ranks = pd.DataFrame(fit.ranking_)
feature_ranks = pd.concat([columns, ranks], axis = 1)
feature_ranks.columns = ['column', 'rank']
display(feature_ranks)

# Select Top-N score
index = fit.support_
display(index)

xs_selected = xs[:, index]
print(xs_selected.shape)
display(xs_selected[:5])

shape: xs=(178, 13), ys=(178,)
CPU times: user 103 ms, sys: 1.82 ms, total: 105 ms
Wall time: 103 ms

	column	rank
0	0	1
1	1	4
2	2	1
3	3	5
4	4	8
5	5	3
6	6	1
7	7	1
8	8	6
9	9	2
10	10	7
11	11	1
12	12	9

array([ True, False,  True, False, False, False,  True,  True, False,
       False, False,  True, False])

(178, 5)

array([[14.230, 2.430, 3.060, 0.280, 3.920],
       [13.200, 2.140, 2.760, 0.260, 3.400],
       [13.160, 2.670, 3.240, 0.300, 3.170],
       [14.370, 2.500, 3.490, 0.240, 3.450],
       [13.240, 2.870, 2.690, 0.390, 2.930]])

Modeling 을 통한 Feature selection#

이 방법은 모델 알고리즘 중에 훈련시에 변수의 중요도가 뽑혀저 나오는 알고리즘을 이용하여 중요도에 따라서 Feature를 선택하는 방법입니다. 이 방법의 장점은 Feature 선택과 동시에 Baseline 모델로도 활용할 수 있다는 것입니다.

Regression#

estimator로 LinearRegression을 사용해도 되지만 좀 더 좋은 선택을 위해 Lasso Regression를 사용해봅니다. Lasso는 L1 Regularization 을 통해 모델의 복잡성을 줄이면서 Overfit 을 방지하는 Regression 알고리즘입니다. 이 알고리즘을 통해 모델을 훈련시킬때 사용되는 coefficient 를 통해 변수를 선택합니다. 사용할 수 있는 다른 estimator는 LogisticRegression(linear_model.LogisticRegression), LinearSVM(svm.LinearSVC) 이 있습니다.

from sklearn import datasets, feature_selection, linear_model

xs, ys = datasets.load_boston(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

estimator = linear_model.LassoCV(cv=5) # cv: cross validation
selector = feature_selection.SelectFromModel(estimator, threshold=0.25)
%time fit = selector.fit(xs, ys)

xs_seleted = fit.transform(xs)
print(xs_selected.shape)

shape: xs=(506, 13), ys=(506,)
CPU times: user 63.8 ms, sys: 1.01 ms, total: 64.8 ms
Wall time: 65 ms
(178, 5)

Classification#

estimator로 Tree 기반 알고리즘인 RandomForest를 사용해봅니다.

from sklearn import datasets, feature_selection, ensemble

xs, ys = datasets.load_wine(return_X_y=True)
print(f"shape: xs={xs.shape}, ys={ys.shape}")

estimator = ensemble.RandomForestClassifier(n_estimators=100)
selector = feature_selection.SelectFromModel(estimator, threshold='1.25*median')
%time fit = selector.fit(xs, ys)

xs_seleted = fit.transform(xs)
print(xs_seleted.shape)

shape: xs=(178, 13), ys=(178,)
CPU times: user 161 ms, sys: 1.85 ms, total: 163 ms
Wall time: 163 ms
(178, 5)

데사견문록

Feature Selection

Contents

Feature Selection#

변수 자체의 Variance Threshold 를 통한 변수 제거#

Independet Variable와 Dependent Variable간의 관계를 통해 변수 제거#

Regression#

Classification#

Recursive 한 변수 조합 평가를 통한 변수 제거#

Regression#

Classification#

Modeling 을 통한 Feature selection#

Regression#

Classification#