Collaborative Filtering
Contents
Collaborative Filtering#
협업필터링은 비슷한 성향을 가진 사용자는 비슷한 아이템을 선호 할 것이라는 가정에서 출발한 알고리즘입니다. 사용자가 아이템에 대해서 투표한 평점 로그 데이터를 이용하며, 사용자나 아이템을 기준으로 평점의 유사성을 살피는 Memory-based CF와 평점 데이터를 통해 모델을 만들고 평점을 예측하는 Model-based CF이 있습니다.
Memory-based CF#
평점 행렬을 기억하고 평점 예측을 위해 기억된 행렬 기반으로 유사도 계산을 진행하기 때문에 Memory-based CF라고 불립니다. 이 방법들은 평점 데이터가 충분히 있을 경우 간단하면서 좋은 성능을 나타내지만, 평점 데이터가 없는 새로운 유저 및 아이템 추가시 대응이 어렵다는 단점이 있습니다.
유사도 계산을 위해 surprise 패키지는 다음과 같은 유사도 기준을 제공합니다.
평균제곱차이 유사도 (‘msd’: Mean Squared Difference Similarity)
코사인 유사도 (‘cosine’: Cosine Similarity)
피어슨 유사도 (‘psearson’: Pearson Similarity)
피어슨-베이스라인 유사도 (‘pearson_baseline’: Pearson-Baseline Similarity)
Item-based CF#
평점 행렬에서 아이템 평점 벡터 기준으로 유사한 아이템를 찾아서 이를 기반으로 원하는 사용지의 평점을 계산하는 방법
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import os, math, random
import numpy as np
import pandas as pd
from sklearn import datasets, preprocessing, model_selection, metrics
import surprise as sp
seed = 0
random.seed(seed)
np.random.seed(seed)
# 데이터
sp_data = sp.Dataset.load_builtin('ml-100k')
df_data = pd.DataFrame(sp_data.raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])
print("df_data.shape={}".format(df_data.shape))
print(df_data.dtypes)
print(df_data.head())
print(df_data.describe(include='all'))
df_train, df_test = model_selection.train_test_split(df_data, test_size=0.1)
print(df_data.shape, df_train.shape, df_test.shape)
# 전처리
# A reader is still needed but only the rating_scale param is requiered.
reader = sp.Reader(rating_scale=(1, 5))
sp_data = sp.Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)
# surprise model.test 의 input shape => [(user_id, item_id, rating)]
sp_test = [(row['user_id'], row['item_id'], row['rating']) for i, row in df_test.iterrows()]
# 모델
models = [
sp.KNNBasic(sim_options={'name' : 'msd'}),
sp.KNNBasic(sim_options={'name' : 'cosine'}),
sp.KNNBasic(sim_options={'name' : 'pearson'}),
sp.KNNBasic(sim_options={'name' : 'msd', 'user_based': False}),
sp.KNNBasic(sim_options={'name' : 'cosine', 'user_based': False}),
sp.KNNBasic(sim_options={'name' : 'pearson', 'user_based': False})
]
for model in models:
# 학습
sp.model_selection.cross_validate(model, sp_data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
# 평가
sp_pred = model.test(sp_test)
rmse = sp.accuracy.rmse(sp_pred, verbose=False)
print("Test RMSE={}".format(rmse))
df_data.shape=(100000, 4)
user_id object
item_id object
rating float64
timestamp object
dtype: object
user_id item_id rating timestamp
0 196 242 3.0 881250949
1 186 302 3.0 891717742
2 22 377 1.0 878887116
3 244 51 2.0 880606923
4 166 346 1.0 886397596
user_id item_id rating timestamp
count 100000 100000 100000.000000 100000
unique 943 1682 NaN 49282
top 405 50 NaN 891033606
freq 737 583 NaN 12
mean NaN NaN 3.529860 NaN
std NaN NaN 1.125674 NaN
min NaN NaN 1.000000 NaN
25% NaN NaN 3.000000 NaN
50% NaN NaN 4.000000 NaN
75% NaN NaN 4.000000 NaN
max NaN NaN 5.000000 NaN
(100000, 4) (90000, 4) (10000, 4)
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.9920 0.9944 0.9936 0.9933 0.0010
MAE (testset) 0.7832 0.7878 0.7859 0.7856 0.0019
Fit time 0.18 0.21 0.20 0.20 0.01
Test time 5.75 5.76 5.82 5.78 0.03
Test RMSE=0.9919432379164125
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 1.0238 1.0268 1.0199 1.0235 0.0028
MAE (testset) 0.8117 0.8131 0.8078 0.8108 0.0023
Fit time 0.92 0.92 0.92 0.92 0.00
Test time 5.69 6.00 5.65 5.78 0.15
Test RMSE=1.0224081647696446
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 1.0219 1.0219 1.0286 1.0241 0.0032
MAE (testset) 0.8114 0.8102 0.8172 0.8129 0.0031
Fit time 1.42 1.45 1.42 1.43 0.01
Test time 5.74 5.62 5.96 5.77 0.14
Test RMSE=1.0245770389211275
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.9887 0.9978 0.9883 0.9916 0.0044
MAE (testset) 0.7847 0.7917 0.7826 0.7863 0.0039
Fit time 0.29 0.33 0.27 0.30 0.03
Test time 7.37 7.08 6.76 7.07 0.25
Test RMSE=0.9876095694171025
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 1.0435 1.0395 1.0430 1.0420 0.0018
MAE (testset) 0.8285 0.8282 0.8278 0.8282 0.0003
Fit time 1.69 1.68 1.65 1.67 0.02
Test time 6.92 6.76 7.29 6.99 0.22
Test RMSE=1.0401501772737274
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 1.0542 1.0525 1.0548 1.0539 0.0010
MAE (testset) 0.8447 0.8396 0.8426 0.8423 0.0021
Fit time 2.52 2.54 2.68 2.58 0.07
Test time 6.67 6.46 6.81 6.65 0.14
Test RMSE=1.051538692275093
predict() 함수를 통해 한 데이터 셋에 대해서 예측 평점을 구할 수 있으며, Top-N 개의 비슷한 아이템이나, 사용자를 찾을 수 있습니다.
# 평가: by predict
tests = []
preds = []
for row in sp_test:
tests.append(row[2])
pred = model.predict(row[0], row[1], row[2])
preds.append(pred.est)
rmse = math.sqrt(metrics.mean_squared_error(tests, preds))
print("Test RMSE={}".format(rmse))
Test RMSE=1.051538692275093
print("Top-N Similar Users")
user_model = models[0] # sp.KNNBasic(sim_options={'name' : 'msd'})
raw_uid = '22'
inner_uid = user_model.trainset.to_inner_uid(raw_uid)
raw_uid = user_model.trainset.to_raw_uid(inner_uid)
print("raw_uid:{} == inner_uid:{}".format(raw_uid, inner_uid))
top_inner_uids = user_model.get_neighbors(inner_uid, k=5)
print("top-5 inner_uids: {}".format(top_inner_uids))
top_raw_uids = [user_model.trainset.to_raw_uid(top_inner_uid) for top_inner_uid in top_inner_uids]
print("top-5 raw_uids: {}".format(top_raw_uids))
print("\nTop-N Similar Items")
item_model = models[3] # sp.KNNBasic(sim_options={'name' : 'msd', 'user_based': False})
raw_iid = '377'
inner_iid = item_model.trainset.to_inner_iid(raw_iid)
raw_iid = item_model.trainset.to_raw_iid(inner_iid)
print("raw_iid:{} == inner_iid:{}".format(raw_iid, inner_iid))
top_inner_iids = item_model.get_neighbors(inner_iid, k=5)
print("top-5 inner_iids: {}".format(top_inner_iids))
top_raw_iids = [item_model.trainset.to_raw_iid(top_inner_iid) for top_inner_iid in top_inner_iids]
print("top-5 raw_iids: {}".format(top_raw_iids))
Top-N Similar Users
raw_uid:22 == inner_uid:493
top-5 inner_uids: [3, 32, 85, 100, 174]
top-5 raw_uids: ['803', '703', '494', '787', '802']
Top-N Similar Items
raw_iid:377 == inner_iid:610
top-5 inner_iids: [1, 15, 30, 47, 56]
top-5 raw_iids: ['597', '402', '356', '365', '406']
Model-based CF#
SVD(Singular Value Decomposition)#
평점 데이터를 통해 모델을 만드는 방법은 다양한 접근이 가능하지만, 그 중 행렬의 연산을 이용하여 특징벡터를 추출하여 사용하는 Matrix Factorization을 많이 사용합니다. Matrix Factorization 문제에 대한 해를 찾는 방법은 여러가지가 있지만 그 중에서도 일반적으로 SVD(Singular Value Decomposition)방법을 사용합니다.
%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import os, math
import numpy as np
import pandas as pd
from sklearn import datasets, preprocessing, model_selection, metrics
import surprise as sp
# 데이터
sp_data = sp.Dataset.load_builtin('ml-100k')
df_data = pd.DataFrame(sp_data.raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])
print("df_data.shape={}".format(df_data.shape))
print(df_data.dtypes)
print(df_data.head())
print(df_data.describe(include='all'))
df_train, df_test = model_selection.train_test_split(df_data, test_size=0.1)
print(df_data.shape, df_train.shape, df_test.shape)
# 전처리
# A reader is still needed but only the rating_scale param is requiered.
reader = sp.Reader(rating_scale=(1, 5))
sp_data = sp.Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)
# surprise model.test 의 input shape => [(user_id, item_id, rating)]
sp_test = [(row['user_id'], row['item_id'], row['rating']) for i, row in df_test.iterrows()]
# 모델
models = [
sp.SVD(n_factors=10),
]
for model in models:
# 학습
sp.model_selection.cross_validate(model, sp_data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
# 평가
sp_pred = model.test(sp_test)
rmse = sp.accuracy.rmse(sp_pred, verbose=False)
print("Test RMSE={}".format(rmse))
df_data.shape=(100000, 4)
user_id object
item_id object
rating float64
timestamp object
dtype: object
user_id item_id rating timestamp
0 196 242 3.0 881250949
1 186 302 3.0 891717742
2 22 377 1.0 878887116
3 244 51 2.0 880606923
4 166 346 1.0 886397596
user_id item_id rating timestamp
count 100000 100000 100000.000000 100000
unique 943 1682 NaN 49282
top 405 50 NaN 891033606
freq 737 583 NaN 12
mean NaN NaN 3.529860 NaN
std NaN NaN 1.125674 NaN
min NaN NaN 1.000000 NaN
25% NaN NaN 3.000000 NaN
50% NaN NaN 4.000000 NaN
75% NaN NaN 4.000000 NaN
max NaN NaN 5.000000 NaN
(100000, 4) (90000, 4) (10000, 4)
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.9493 0.9445 0.9438 0.9459 0.0024
MAE (testset) 0.7494 0.7454 0.7470 0.7473 0.0016
Fit time 1.63 1.68 1.67 1.66 0.02
Test time 0.35 0.35 0.35 0.35 0.00
Test RMSE=0.9463626341382563
predict() 함수를 통해 한 데이터 셋에 대해서 예측 평점을 구할 수 있습니다.
# 평가: by predict
tests = []
preds = []
for row in sp_test:
tests.append(row[2])
pred = model.predict(row[0], row[1], row[2])
preds.append(pred.est)
rmse = math.sqrt(metrics.mean_squared_error(tests, preds))
print("Test RMSE={}".format(rmse))
Test RMSE=0.9463626341382563