In this blog post Ryan Jessop – Data Scientist @ Carbon by Clicksco – takes us through the importance of Clickstream analysis, the challenges it brings and the ultimate benefits.
Online user browsing generates vast quantities of typically unexploited data. Investigating this data can be of substantial value to online businesses, such as targeted advertising and personalised web pages, and statistics can play a key role by providing the appropriate tools for exploration, modelling, and inference.
The three main problems we will explore are:
- Cleaning a time-based variable
- Classifying online behaviour
- Predicting an online purchase.
Our challenge is to use statistical methods and machine learning to discover patterns in online browsing behaviour.
The data takes the form of an anonymous digital footprint or profile associated with each unique visitor to any of the web pages within the Carbon network, which collects, manages, and leverages this data. It’s important to emphasise that.
Carbon is a cloud-based Audience Management Platform (AMP), and a key goal is to develop statistical solutions that can be incorporated into its toolset.
Each online profile is constructed from a sequence of page visits (a clickstream) in which the key variables are: timestamps, web page structure, demographics and device information.
This comprises of 59 variables for each 10M individual page visits associated with 3M unique profiles on a daily basis. Given the scale of this problem, the process of collecting, storing and managing this data is a non-trivial process.
When scoping the research project area, we swiftly became aware that the size of the data would make accessing the problem challenging.
To build successful applied statistical models, we would need to expand our knowledge and expertise into new technologies.
As a data scientist working at Carbon, collaboration with developers and utilising a range of the cutting-edge technologies for big data processing and analysis is essential to finding and creating solutions. We aim to develop statistical models exploiting cluster-computing using Apache Spark.
Data Exploration and Cleaning
Prior to our research, we expected the length of time spent of web pages to be a strong indicator of the intent to make an online purchase, i.e. a conversion. Hence, we explored the distribution of the variable session duration – measured in seconds.
Before we can begin any analysis, we require clean data, that is free from outliers and other anomalies. This improves the reliability and value of our data and creates robustness of our statistical models.
We expected to see an exponential decay curve, but note the significant peaks in the data.
This led to a discovery of an artificially created heartbeat, which checks for user activity of the web page.
By labelling these events, we were able to distinguish between the spikes and the underlying distribution, controlling the structure and definition of the variable.
The difference when we label the data and split the original distribution, allows us to model the variable more accurately using a Weibull (1.0,8.2) distribution.
Our goal is to model the browsing journey and probability of conversion, using clickstream data. This will affect targeted personalised advertising, making a data-driven decision to display advertising content.
We use a mixture of continuous, categorical and discrete variables, aggregated to the session level, some examples are:
- number of page visits
- time since previous session
- current/previous session duration
- device type
- time of day
- number of previous sessions/conversions
There are 23 terms in the model and we used the Spark Machine Learning library to create a logistic regression model using over 3 million observations.
We model checkout purchases within the furniture e-commerce segment of Carbon data, and we display a sample of the coefficients estimate outputs from the logistic regression model. The positive coefficient for the number of previous page visits suggests as this variable increases, we will predict a higher probability of conversion.
Below, the confusion matrix determines the success of this particular model on a small sample of test data. Overall, the model under predicts actual conversions but provides too many false positives. We are investigating outliers that are possibly affecting the model.
Predicting conversions can be further explored using a Markov chain model to identify states and transitions to represent a customer’s path through the website. With enough data we are able to identify probabilities of a user moving from one state to another which then allows us to discover common sequences of states. In real-time, predicting profiles that are likely to follow similar behaviour patterns to known conversions, will have a critical impact on targeted advertising. The states, correspond to page types such as: home, article, search result, subscription.
If the chain is currently in state , then it moves to state at the next step with a probability denoted by .
Below is a heat map of page type transition probabilities for a publisher, which we can use to find the probability of a subscription in real-time.
Focusing on the subscription column suggests we should focus attention on Competition Article pages to incentivise traffic. It could also suggest further structural changes that the publisher could make to increase the volume of traffic to the Subscription web page such as providing more navigation links from Articles and Tier 1 or Tier 2 pages to Competition Article pages.
In summary, big data, created by internet browsing, is noisy and needs careful cleaning and preprocessing. Once this has been done, statistical approaches can be used to identify consumer behaviours which lead to personalisation and higher quality targeted advertising.
Author: Ryan Jessop, Data Scientist @ Carbon by Clicksco