how to make RMSE(root mean square error) small when use ALS of spark?

I need some suggestions to build a good model to make recommendation by using Collaborative Filtering of spark. There is a sample code in the official website. I also past it following:

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))
   .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))

# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)

# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5)
print("Root Mean Squared Error = " + str(RMSE))

A good model need the RMSE as small as possible.

Is that because I do not set proper parameter to ALS.train method, such as rand numIterations and so on?

Or is that because my dataset is small to make RMSE big?

So could anyone help me figure out what cause RMSE is big and how to fix it.

addition:

Just as @eliasah said, I need to add some detail to narrow the answer set. Let us consider this particular situation:

Now, if I want to build a recommendation system to recommend music to my clients. I have their history rate for tracks, albums, artists, and genres. Obviously, this 4 class build a hierarchy structure. Tracks directly belong to albums, albums directly belongs to artists, and artists may belong to several different genres. Finally, I want use all of these info to choice the some tracks to recommend to clients.

So, what is the best practice to build a good model for these situation and ensure to make RMSE as small as possible for prediction.


As you mentioned above, as rank and numIterations increase, RMSE decreases, given the same dataset. However, as dataset grows, RMSE increases .

Now, one practice done to decrease RMSE and some other similar measures is to normalize the values in ratings . In my experience, this works really well when you know in advance the minimum and maximum rating values.

Also, you should also consider using other measures other than RMSE. When doing Matrix Factorization, what I found useful is to compute Frobenius Norm of ratings - predictions then divide by Frobenius Norm of ratings. By doing this, you are getting the relative error of your predictions with respect to the original ratings.

Here's the code in spark for this method:

# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))

ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)

abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum())))

# frobenius error of original ratings
frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum())

# finally, the relative error
rel_error = abs_frobenius_error/frob_error_orig

print("Relative Error = " + str(rel_error))

In this error measure, the closer the error to zero, the better you model is.

I hope this helps.


I do a little research on it, here is the conclusion:

When rand and iteration grow, the RMSE will decrease. However, when size of dataset grow, the RMSE will increase.From above result, rand size will change the RMSE value more significantly.

I know this is not enough to get a good model. Wish more ideas!!!


在pyspark中使用这个来找到均方根误差(rmse)

from pyspark.mllib.recommendation import ALS
from math import sqrt
from operator import add


# rank is the number of latent factors in the model.
# iterations is the number of iterations to run.
# lambda specifies the regularization parameter in ALS
rank = 8
num_iterations = 8
lmbda = 0.1

# Train model with training data and configured rank and iterations
model = ALS.train(training, rank, num_iterations, lmbda)


def compute_rmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error), or square root of the average value
        of (actual rating - predicted rating)^2
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictions_ratings = predictions.map(lambda x: ((x[0], x[1]), x[2])) 
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) 
      .values()
    return sqrt(predictions_ratings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))

print "The model was trained with rank = %d, lambda = %.1f, and %d iterations.n" % 
        (rank, lmbda, num_iterations)
# Print RMSE of model
validation_rmse = compute_rmse(model, validation, num_validation)
print "Its RMSE on the validation set is %f.n" % validation_rmse
链接地址: http://www.djcxy.com/p/91280.html

上一篇: RESTEasy Guice Provider

下一篇: 如何在使用火花ALS时使RMSE(均方根误差)很小?