Collaborative Filtering#

협업필터링은 비슷한 성향을 가진 사용자는 비슷한 아이템을 선호 할 것이라는 가정에서 출발한 알고리즘입니다. 사용자가 아이템에 대해서 투표한 평점 로그 데이터를 이용하며, 사용자나 아이템을 기준으로 평점의 유사성을 살피는 Memory-based CF와 평점 데이터를 통해 모델을 만들고 평점을 예측하는 Model-based CF이 있습니다.

Memory-based CF#

평점 행렬을 기억하고 평점 예측을 위해 기억된 행렬 기반으로 유사도 계산을 진행하기 때문에 Memory-based CF라고 불립니다. 이 방법들은 평점 데이터가 충분히 있을 경우 간단하면서 좋은 성능을 나타내지만, 평점 데이터가 없는 새로운 유저 및 아이템 추가시 대응이 어렵다는 단점이 있습니다.

matrix

유사도 계산을 위해 surprise 패키지는 다음과 같은 유사도 기준을 제공합니다.

  • 평균제곱차이 유사도 (‘msd’: Mean Squared Difference Similarity)

  • 코사인 유사도 (‘cosine’: Cosine Similarity)

  • 피어슨 유사도 (‘psearson’: Pearson Similarity)

  • 피어슨-베이스라인 유사도 (‘pearson_baseline’: Pearson-Baseline Similarity)

User-based CF#

평점 행렬에서 사용자 평점 벡터 기준으로 유사한 사용자를 찾아서 이를 기반으로 원하는 아이템의 평점을 계산하는 방법

matrix matrix matrix

Item-based CF#

평점 행렬에서 아이템 평점 벡터 기준으로 유사한 아이템를 찾아서 이를 기반으로 원하는 사용지의 평점을 계산하는 방법

matrix matrix matrix

%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import os, math, random
import numpy as np
import pandas as pd
from sklearn import datasets, preprocessing, model_selection, metrics
import surprise as sp

seed = 0
random.seed(seed)
np.random.seed(seed)

# 데이터
sp_data = sp.Dataset.load_builtin('ml-100k')
df_data = pd.DataFrame(sp_data.raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])
print("df_data.shape={}".format(df_data.shape))
print(df_data.dtypes)
print(df_data.head())
print(df_data.describe(include='all'))

df_train, df_test = model_selection.train_test_split(df_data, test_size=0.1)
print(df_data.shape, df_train.shape, df_test.shape)

# 전처리 
# A reader is still needed but only the rating_scale param is requiered.
reader = sp.Reader(rating_scale=(1, 5))
sp_data = sp.Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)
# surprise model.test 의 input shape => [(user_id, item_id, rating)]
sp_test = [(row['user_id'], row['item_id'], row['rating']) for i, row in df_test.iterrows()]

# 모델
models = [
    sp.KNNBasic(sim_options={'name' : 'msd'}), 
    sp.KNNBasic(sim_options={'name' : 'cosine'}),
    sp.KNNBasic(sim_options={'name' : 'pearson'}),
    sp.KNNBasic(sim_options={'name' : 'msd', 'user_based': False}),
    sp.KNNBasic(sim_options={'name' : 'cosine', 'user_based': False}),
    sp.KNNBasic(sim_options={'name' : 'pearson', 'user_based': False})    
]

for model in models:
    # 학습
    sp.model_selection.cross_validate(model, sp_data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
    
    # 평가
    sp_pred = model.test(sp_test)
    rmse = sp.accuracy.rmse(sp_pred, verbose=False)
    print("Test RMSE={}".format(rmse))
df_data.shape=(100000, 4)
user_id       object
item_id       object
rating       float64
timestamp     object
dtype: object
  user_id item_id  rating  timestamp
0     196     242     3.0  881250949
1     186     302     3.0  891717742
2      22     377     1.0  878887116
3     244      51     2.0  880606923
4     166     346     1.0  886397596
       user_id item_id         rating  timestamp
count   100000  100000  100000.000000     100000
unique     943    1682            NaN      49282
top        405      50            NaN  891033606
freq       737     583            NaN         12
mean       NaN     NaN       3.529860        NaN
std        NaN     NaN       1.125674        NaN
min        NaN     NaN       1.000000        NaN
25%        NaN     NaN       3.000000        NaN
50%        NaN     NaN       4.000000        NaN
75%        NaN     NaN       4.000000        NaN
max        NaN     NaN       5.000000        NaN
(100000, 4) (90000, 4) (10000, 4)
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9920  0.9944  0.9936  0.9933  0.0010  
MAE (testset)     0.7832  0.7878  0.7859  0.7856  0.0019  
Fit time          0.18    0.21    0.20    0.20    0.01    
Test time         5.75    5.76    5.82    5.78    0.03    
Test RMSE=0.9919432379164125
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0238  1.0268  1.0199  1.0235  0.0028  
MAE (testset)     0.8117  0.8131  0.8078  0.8108  0.0023  
Fit time          0.92    0.92    0.92    0.92    0.00    
Test time         5.69    6.00    5.65    5.78    0.15    
Test RMSE=1.0224081647696446
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0219  1.0219  1.0286  1.0241  0.0032  
MAE (testset)     0.8114  0.8102  0.8172  0.8129  0.0031  
Fit time          1.42    1.45    1.42    1.43    0.01    
Test time         5.74    5.62    5.96    5.77    0.14    
Test RMSE=1.0245770389211275
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9887  0.9978  0.9883  0.9916  0.0044  
MAE (testset)     0.7847  0.7917  0.7826  0.7863  0.0039  
Fit time          0.29    0.33    0.27    0.30    0.03    
Test time         7.37    7.08    6.76    7.07    0.25    
Test RMSE=0.9876095694171025
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0435  1.0395  1.0430  1.0420  0.0018  
MAE (testset)     0.8285  0.8282  0.8278  0.8282  0.0003  
Fit time          1.69    1.68    1.65    1.67    0.02    
Test time         6.92    6.76    7.29    6.99    0.22    
Test RMSE=1.0401501772737274
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    1.0542  1.0525  1.0548  1.0539  0.0010  
MAE (testset)     0.8447  0.8396  0.8426  0.8423  0.0021  
Fit time          2.52    2.54    2.68    2.58    0.07    
Test time         6.67    6.46    6.81    6.65    0.14    
Test RMSE=1.051538692275093

predict() 함수를 통해 한 데이터 셋에 대해서 예측 평점을 구할 수 있으며, Top-N 개의 비슷한 아이템이나, 사용자를 찾을 수 있습니다.

# 평가: by predict
tests = []
preds = []
for row in sp_test:
    tests.append(row[2])
    pred = model.predict(row[0], row[1], row[2])
    preds.append(pred.est)

rmse = math.sqrt(metrics.mean_squared_error(tests, preds))
print("Test RMSE={}".format(rmse))
Test RMSE=1.051538692275093
print("Top-N Similar Users")
user_model = models[0] # sp.KNNBasic(sim_options={'name' : 'msd'})
raw_uid = '22'
inner_uid = user_model.trainset.to_inner_uid(raw_uid)
raw_uid = user_model.trainset.to_raw_uid(inner_uid)
print("raw_uid:{} == inner_uid:{}".format(raw_uid, inner_uid))

top_inner_uids = user_model.get_neighbors(inner_uid, k=5)
print("top-5 inner_uids: {}".format(top_inner_uids))
top_raw_uids = [user_model.trainset.to_raw_uid(top_inner_uid) for top_inner_uid in top_inner_uids]
print("top-5 raw_uids: {}".format(top_raw_uids))

print("\nTop-N Similar Items")
item_model = models[3] # sp.KNNBasic(sim_options={'name' : 'msd', 'user_based': False})
raw_iid = '377'
inner_iid = item_model.trainset.to_inner_iid(raw_iid)
raw_iid = item_model.trainset.to_raw_iid(inner_iid)
print("raw_iid:{} == inner_iid:{}".format(raw_iid, inner_iid))

top_inner_iids = item_model.get_neighbors(inner_iid, k=5)
print("top-5 inner_iids: {}".format(top_inner_iids))
top_raw_iids = [item_model.trainset.to_raw_iid(top_inner_iid) for top_inner_iid in top_inner_iids]
print("top-5 raw_iids: {}".format(top_raw_iids))
Top-N Similar Users
raw_uid:22 == inner_uid:493
top-5 inner_uids: [3, 32, 85, 100, 174]
top-5 raw_uids: ['803', '703', '494', '787', '802']

Top-N Similar Items
raw_iid:377 == inner_iid:610
top-5 inner_iids: [1, 15, 30, 47, 56]
top-5 raw_iids: ['597', '402', '356', '365', '406']

Model-based CF#

SVD(Singular Value Decomposition)#

평점 데이터를 통해 모델을 만드는 방법은 다양한 접근이 가능하지만, 그 중 행렬의 연산을 이용하여 특징벡터를 추출하여 사용하는 Matrix Factorization을 많이 사용합니다. Matrix Factorization 문제에 대한 해를 찾는 방법은 여러가지가 있지만 그 중에서도 일반적으로 SVD(Singular Value Decomposition)방법을 사용합니다.

SVD

%matplotlib inline
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

import os, math
import numpy as np
import pandas as pd
from sklearn import datasets, preprocessing, model_selection, metrics
import surprise as sp

# 데이터
sp_data = sp.Dataset.load_builtin('ml-100k')
df_data = pd.DataFrame(sp_data.raw_ratings, columns=["user_id", "item_id", "rating", "timestamp"])
print("df_data.shape={}".format(df_data.shape))
print(df_data.dtypes)
print(df_data.head())
print(df_data.describe(include='all'))

df_train, df_test = model_selection.train_test_split(df_data, test_size=0.1)
print(df_data.shape, df_train.shape, df_test.shape)

# 전처리 
# A reader is still needed but only the rating_scale param is requiered.
reader = sp.Reader(rating_scale=(1, 5))
sp_data = sp.Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)
# surprise model.test 의 input shape => [(user_id, item_id, rating)]
sp_test = [(row['user_id'], row['item_id'], row['rating']) for i, row in df_test.iterrows()]

# 모델
models = [
    sp.SVD(n_factors=10),    
]

for model in models:
    # 학습
    sp.model_selection.cross_validate(model, sp_data, measures=['RMSE', 'MAE'], cv=3, verbose=True)
    
    # 평가
    sp_pred = model.test(sp_test)
    rmse = sp.accuracy.rmse(sp_pred, verbose=False)
    print("Test RMSE={}".format(rmse))
df_data.shape=(100000, 4)
user_id       object
item_id       object
rating       float64
timestamp     object
dtype: object
  user_id item_id  rating  timestamp
0     196     242     3.0  881250949
1     186     302     3.0  891717742
2      22     377     1.0  878887116
3     244      51     2.0  880606923
4     166     346     1.0  886397596
       user_id item_id         rating  timestamp
count   100000  100000  100000.000000     100000
unique     943    1682            NaN      49282
top        405      50            NaN  891033606
freq       737     583            NaN         12
mean       NaN     NaN       3.529860        NaN
std        NaN     NaN       1.125674        NaN
min        NaN     NaN       1.000000        NaN
25%        NaN     NaN       3.000000        NaN
50%        NaN     NaN       4.000000        NaN
75%        NaN     NaN       4.000000        NaN
max        NaN     NaN       5.000000        NaN
(100000, 4) (90000, 4) (10000, 4)
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9493  0.9445  0.9438  0.9459  0.0024  
MAE (testset)     0.7494  0.7454  0.7470  0.7473  0.0016  
Fit time          1.63    1.68    1.67    1.66    0.02    
Test time         0.35    0.35    0.35    0.35    0.00    
Test RMSE=0.9463626341382563

predict() 함수를 통해 한 데이터 셋에 대해서 예측 평점을 구할 수 있습니다.

# 평가: by predict
tests = []
preds = []
for row in sp_test:
    tests.append(row[2])
    pred = model.predict(row[0], row[1], row[2])
    preds.append(pred.est)

rmse = math.sqrt(metrics.mean_squared_error(tests, preds))
print("Test RMSE={}".format(rmse))
Test RMSE=0.9463626341382563