Will Your App Be a Success in the Google Play Store? (Using Machine Learning to Predict Ratings)

Published in

The Startup

8 min readNov 19, 2020

The Google Play store serves thousands of options to users daily. Whether they enjoy puzzle games, going on quests and defeating evil in order to save a kingdom, or want to just simply check the weather, there’s an app for that! In fact, Google Play offers many apps for that. And in a world of apps, how can developers know if they will be successful? What makes an app great?

Feature Engineering and Data Wrangling

In order to predict whether or not an app will be successful, I engineered a feature called ‘Successful_Rating’. If an app received a rating of 4 or more, it’s value is ‘True’. If it has a rating of less than 4, it’s value is ‘False’. This creates a Binary Classification problem.

This dataset needed a bit of cleaning before being able to do much with it. There were a few outliers that needed to be taken care of. I doubt anyone is going to pay $400 for an app. Also, there is a weird value, ‘1.9’, that doesn’t belong in this feature.

After completing all of the data wrangling, this is what we are looking at when predicting if an app will receive a successful rating. It’s not a really large dataset, but it is good enough to work with.

There is one other issue, though….

Imbalanced Data

Since our goal is to be able to predict whether an app will be successful in the Google Play store, our target is ‘Successful_Rating’. However, after splitting the target vector from the feature matrix, we can see that the data in our target is very imbalanced.

79% of the target values are ‘True’ and 21% are ‘False’. If we continue using our data in this way, our model will be biased to the majority class. In fact, I ended up with a baseline of around .78 and none of the models were able to beat it.

So, how do we fix this problem?

There are a couple different ways to handle the issue of imbalanced data. One is using a data augmentation technique called SMOTE. SMOTE stands for Synthetic Minority Oversampling Technique. This function judges the Euclidean Distance between data points in their feature space and creates synthetic data from the value calculated in between those points.

To do this, we first need to do a time-series split to get our train, validation, and test sets. I trained on 60% of the data and split the other 40% between the validation and test sets.

Because SMOTE does not play well with categorical variables, we are going to use an Ordinal Encoder. We will also be using a Simple Imputer, since we have some NaN values as well. Once we have finished transforming, we then need to resample our training, validation, and testing sets using SMOTE.

Shape of Target Vector ‘Successful_Rating’

Now that our data has been resampled we can see the shape of our target vector. It has been completely balanced and is ready to work with. There is one other change that I made. SMOTE turns the data sets into Numpy Arrays after resampling them. Not all of the functions and methods used going forward are going to be able to use the data in this form. So, I converted all of the target vectors into Pandas Series and the feature matrices into Pandas DataFrames.

Establish Baseline

Since our classes are now evenly distributed, we can see that we now have a Baseline Accuracy of 0.5. So, we have a coin flip’s chance of predicting a True Positive or Negative.

This leads us to our next step:

Build Models

Which model will better predict whether an app will receive a ‘Successful_Rating’ in the Google Play store?

Since we have already used an Ordinal Encoder and a Simple Imputer on our data, there will be no need to create pipelines for any of our models. Our goal is to beat our Baseline Accuracy of 0.5.

Logistic Regression

The first model that we are going to build is a Logistic Regression model. We are not going to perform any hyperparameter tuning. This model will perform with the set default parameters.

Random Forest Classifier

The next model we are going to build is a Random Forest Classification model. I used a GridSearchCV in order to tune the hyperparameters for this model. Setting the number of estimators to 20 seemed to be the best fit for this model. Also, decreasing the values for max_depth and max_samples increases the performance of this Random Forest Classifier.

Gradient Boosting Classifier

Gradient Boosting Classifier is similar to a Random Forest Classifier, except that it trains on all of the data in sequence as opposed to training on subsets of the data in parallel (Bagging). Since this model trains in sequence, it will be slower than the Random Forest model. I did all of the tuning for this model by hand. I was able to find the best values for each hyperparameter by adjusting them one at a time.

XGBoost Classifier

I know what you’re thinking, “Why are you using another Boosting Classifier?” Even though Gradient Boost and XGBoost are basically the same, I wanted to to see which one would perform better. XGBoost is the newer model. It is the more popular option for boosting classifiers because it trains faster. It may be fast, but I want to see how it’s performance compares to the Gradient Boosting Classifier.

Fit Logistic Regression model to Training Sets

Fit Random Forest Classifier to Training Sets

Fit Gradient Boosting Classifier to Training Sets

Fit XGB Classifier to Training Sets

Now that we have finished creating and training our models, we can see how well they perform!

Check Metrics

The Gradient Boosting Classifier and XGB Classifier definitely the best performing models. XGBoost has a better training accuracy score. While, the Gradient Boosting Classifier has the higher validation accuracy score. As you can see, neither of these models are ahead by much.

Even though the Random Forest Classifier and Logistic Regression models do not have great performance, it is important to note that their accuracy scores are still beating the baseline!

Classification Report and Confusion Matrix

Let’s take a look at the Precision and Recall for our top performing models by creating a Classification Report and Confusion Matrix.

Part of the Classification Report for Gradient Boosting Classifier

Confusion Matrix for Gradient Boosting Classifier

We have great numbers for predicting True Positives and True Negatives! Looking at the Classification Report, our Recall is a little lower for predicting True Negatives. If we tune our hyperparameters a little more, maybe we could increase this.

Part of the Classification Report for XGBClassifier

These scores are not too much different from the Gradient Boosting Classifier! The Recall score for True Negatives is better! It is looking like this model may be better at predicting whether or not an App in the Google Play store will receive a rating for 4 or more.

Feature Importance

I, then, decided to look at the Feature Importance for the XGBoost model. I have to say I’m actually kind of surprised by these results. Earlier I engineered the App feature to have values that equal the sum of the number of words in the App name. For example, ‘Angry Birds’ has two words in its name. So, its value would be 2. I did not expect this feature to have such an impact on the model. Type and Content_Rating are less surprising in their importance. I’m sure the Type of App and whether it’s rated for Everyone, Teens, etc. plays a big role in predicting Successful_Rating.

Predictions

We have a winner! The XGBoost Classifier has a slightly higher test accuracy score than the Gradient Boosting Classifier.

The value that our XGB Classifier predicted for this observation

If we take a single observation from our test set and create a prediction using our best model, we can create Shapley Values. This shows the influence of features on an individual prediction. Our XGB Classifier predicted this single observation to be a True value. Meaning that this app (single observation) is predicted to be a success in the Google Play store with a rating of 4 or higher.

Taking a look at our Shapley Values, we can see which features have a positive or negative influence on our model’s prediction. Category(5), Android_Ver(6), and Content_Rating(1) are all important features that have a positive influence on our XGB model. App(2) has a very negative influence on our prediction. It would be good to take a closer look at possibly removing this column or doing some feature engineering with this column.

Conclusion

In conclusion, the XGBoost Classifier is our best performing model with a Test Accuracy of 0.706. We can also conclude from our Shapley Values that the model believes apps with only 2 words in their name have a very negative influence on receiving a successful rating. However, we also found that Business apps that are rated for everyone and are available for Android Versions 4.1 and higher had a positive influence on the models prediction.