Bored Of Your Data Job?
Posts
How to Build a Tailored Project

How to Build a Tailored Project

Building a solution to a problem

Albert Edwards
May 28, 2024

Morning Boys,

It’s Bank holiday in the UK today, so I’m writing this live from the comfort of my living room. Fuelled by my Gail’s Latte (IYKYK).

There are 3 steps to a Tailored Project:

1) Planning (finding a relevant problem to work on)

2) Building – developing the solution to your problem

3) Production – putting your solution to use

I’ll talk about how you can build a great solution to a problem. Specifically, this is if you need to build a supervised machine learning problem. Other problems for more data analyst related roles might require different solutions.

So whilst it’s important to get this bit right, it means nothing if you get steps 1 and 3 wrong.

Anyway, let’s get into it.

1) Exploratory Data Analysis

This work will help you solve problems which pop up throughout your project.

Even if this work now doesn’t answer any immediate questions it’ll give you the knowledge and understanding to solve problems which pop up throughout your project.

Plot the distributions of your features and target variable.

Bucket the numerical features into categories to get a better understanding of the distribution.

Get numerical summaries to see the means, minimum and maximum values.

Look for correlations between variables.

Check for data errors.

Anything you’re curious about from your data, investigate it.

2) Clean Data

Part 1) will help with this step.

To build an ML model, you need to fill in (known as impute) missing values.

How you impute missing values depends on the problem, but using the mean/mode is a safe bet.

To train a model, each observation in the data needs to correspond to something unique. The ‘something’ depends on the problem. One observation could correspond to one person/date/business etc.

If you have duplicate observations you need to remove these. How you do it depends on the problem.

You’ll also want to deal with outliers e.g., very high / low feature values. One way to do this is to clip the variables – set a maximum and minimum value for the feature. Again, how you do this depends on the business context.

3) Apply Business Rules

It’s a broken record but again, this all depends on the problem.

An example of this may be observations after a certain date not being relevant so you remove them from the data.

For example, we know Covid had a big effect on the world, so observations around that period may be affected and have to be dealt with differently.

Or if you know some feature values are recorded differently but mean the same thing you may want to group them together.

For example, if you had football data and had data on a player’s position broken down into ‘left-winger’ and ‘right-winger’ you might want to group these two values into one – ‘winger’.

4) Feature Engineering

Models are as good as the features you train them with.

The data you have will supply most of the features, but you want to be creative and find ways to add more features (to make the model better).

This may involve finding similar datasets from the internet to join onto your data, or making new features from your existing data.

For example, if ‘total customer spend’ and ‘number of customers’ are features in your data, you could make a ‘average spend per customer’ feature.

If you have text data, you can extract features like counts of specific words and the sentiment of the text (e.g. is it a positive or negative review?)

If you have dates, you can extract the month, day, year etc.

5) Split Data

Before you start training your model, you need to split your data into training and testing sets. The training set is used to train the model, while the testing set is used to evaluate its performance.

A common split ratio is 80/20, but this can vary depending on the size of your dataset.

It may be sensible to split your data based on date, as this approach can provide a more realistic estimate of your model’s performance in production. By evaluating on a test set that comes after the training period, you can better understand how the model performs on future, unseen data.

For example, if you’re building a model to predict football scores, you could train it on 2023 data and then evaluate it on 2024 data. This allows you to see how well the model extrapolates into the future. In football, team dynamics can change significantly from year to year due to factors like player transfers, injuries, and changes in coaching staff. By evaluating the model with 2024 data, you can assess how well it adapts to these changes in team performance

If your data isn’t time- dependent, randomly shuffle the data before splitting to ensure that the train/test set are representative of the whole dataset.

6) Training/Tuning

Training the model involves feeding it the training data and allowing it to learn the patterns. To get the best performance, you’ll need to tune the hyperparameters. Cross-validation is a technique where the training data is split into smaller sets, and the model is trained and validated on these sets in a rotating fashion. This helps in getting a more reliable estimate of the model’s performance and avoids overfitting.

You’ll likely be using a tree based model like xgboost or catboost. The main hyperparameters to tune are number of trees, tree depth and learning rate.

Once you’ve found the best hyperparamters with cross-validation, you can retrain the model on the whole training data with the best hyperparameters.

7) Test Set Evaluation (Only Do This Once!!!)

After you’ve trained the model, evaluate it on the unseen test set. This gives you an unbiased estimate of how the model will perform on new, unseen data.

It’s crucial to only do this evaluation once, as repeatedly tweaking the model based on test set performance can lead to overfitting.

8) Results

Businesses don’t want you to train models for fun.

They want models to deliver business results, like making money or saving time.

So this is when you find out what your model’s all about.

You need to translate model metrics like accuracy and precision into results more relevant to the problem you’re solving.

For example, an accuracy of 90% on a football score prediction model would get translated to ‘the model predicts the correct score 90% of the time’.

You should also use Shap values and partial dependency plots (pdp) to see which features have the biggest effect on the prediction, and their relationship with the prediction.

We’re over 1k words now so that’s enough for now.

I had some more secrets I wanted to share but I’ll save them for another time…

Whilst this can help you build a great model, it means nothing if you don’t implement the first and last step of Tailored Projects – Planning a relevant project and productionising it.

P.S. What else do you want me to send emails on?

Let me know here