My fact table contains sparse data and has 3 columns: (user, movie, normalized_score). Example:
('u1', 'm3', 0.3) ('u1', 'm4', 0.1) ('u1', 'm7', 0.6) ('u2', 'm1', 0.33) ('u2', 'm3', 0.33) ('u2', 'm7', 0.33) ('u3', 'm2', 0.6) ('u3', 'm6', 0.4) ...
As you can see, sum(normalized_score)=1 for each user.
I have two dimensions:
- User_info
(user, Cat_Level1, Cat_Level2)
- Movie_info
(movie, Genre_Level1, Genre_Level2)
I want top movies for any selected dimension by the average score, where calculation of average considers all the associated users from the selected dimension.
For example, at the lowest level, average('m3')
above
would be (0.3+0.3)/3
.
Note that the denominator is3, not 2.
Basically, any dimension we select, there's the corresponding #of users, that becomes the denominator.
Can't figure it out how. Please help!