by Maarit Widmann and Rosaria Silipo

Machine learning and data visualization for clickstream analysis

how-to
Apr 24, 201915 mins
AnalyticsData VisualizationMachine Learning

How to use the KNIME Analytics Platform to understand the behavior of website visitors

abstract data analytics
Credit: Thinkstock

Before analyzing customer data, we need to describe the customers. Descriptive features for customers usually revolve around three categories: revenues, demographics and behavior. While revenues and demographics are easy to quantify, customer behavior is harder to define and therefore harder to quantify.

Customer behavior depends heavily on the kind of business. Behavior in energy usage requires different metrics than behavior in newspaper reading. Loyalty to the business is measured differently in subscription-based businesses than in a traditional retail store. When it comes to online shopping, the number of metrics proposed to measure user behavior is just overwhelming.

How long does it take before we finally decide to buy a product? How long do we read the product description? How many times do we come back to the same page in search of a convincing reason to buy? How many other product pages do we explore to compare?

We are not all the same when it comes to buying. There are the impulsive buyers, the buyers who need deep reflection before buying, the buyers who need comparisons to be convinced, and so on. We all follow our own buying path—even more so when it comes to online shopping.

Clicks, visiting times, purchases, and related actions are recorded on all websites. If you are just a guest, your actions are recorded anonymously. If you are a known customer, your actions are recorded in connection with your user ID. Anonymously or not, all of us leave a trail as we click our way from page to page.

Clickstream analysis is the branch of data science associated with collecting, summarizing, and analyzing the mass of data from website visitors. With this knowledge, the online shop can optimize their service, including temporary advertisements, targeted product suggestions, better webpage layout, and improved navigation options.

Notice that recommendations and predictions based on clickstream data aggregations rely on the mass of shoppers and lose track of the single shopper behavior. The mass of users (customers) sharing a similar behavior with the current user (customer) provide the necessary information for optimizing the current user’s experience.

In this article, we describe some classic clickstream metrics on a data set of web visits in an online shop by known and anonymous customers, and we draw some conclusions from such behavioral features.

The entire application behind our clickstream analysis was implemented using KNIME Analytics Platform. To get started with KNIME, see “How to use KNIME for data science.”

knime clickstream analysis fig01 KNIME

Figure 1: Features to quantify and describe website customers.

The clickstream data set

Typically, a website log file contains information about the visit date and time stamp, visited URLs, user IP address, user location, and optionally user ID. For registered users, the data are enriched by personal information such as age, gender, location, family status, and interests.

For this project, we used clickstream data provided by Hortonworks, containing samples of visits to the website of an online shop stored across three files.

  1. Web session data. This file shows user IDs, time stamps, visited webpages, and purchase information, as extracted from the original web log file.
  2. User data. This file contains birthdate and gender associated with the user IDs, where available.
  3. Web page categories. The third file is a map of webpages and their associated metadata, such as homepage, customer reviews, video reviews, celebrity recommendations, and product pages.

Descriptive features for customer behavior

Summarizing, in the raw data we have user IDs, age and gender, page metadata, time stamps, and subsequent clicks. How can we describe our visitors using these data?

The easiest metric is always a count: total number of users and total number of visits to the website. These total numbers are an easy-to-achieve measure of the website popularity. However, they do not tell us much about the customers per se.

In order to know more about the customer base, we could segment customers, for example, by gender and age. In general, how old are the visitors of our website? And in particular, how often are people in a certain age category visiting the website? Does the website attract more women than men? Are women returning more often than men?

A second commonly used group of metrics revolves around time. Are visits more frequent during some particular time of the day or day of the week? How long do we stare at a page before moving to the next one? Here, the average visit duration and visit frequency by time of day or day of week are two commonly used metrics.

Another group of metrics is about page content. Using the page metadata, and in particular their categorization, do some pages get more visits than others? Do visitors spend more time reading some pages than others? Are there dead pages that nobody ever visits? Let’s customize the previous time-based metrics for the single page categories. The average visit duration and the visit frequency by page category are other commonly used metrics. In addition, the number of clicks for a certain page category is a measure of the engagement generated by content.

Finally, the sequence of clicks leading to a purchase is what every owner of an online shop would like to know. Is there a special sequence of clicks that most purchasers follow? Is there a sequence of clicks that non-purchasers follow? How many paths can I walk down on my website to reach the purchase page?

There are many other metrics that can be implemented to better understand how your online store works and how your customer base is approaching your business. In this article, we will only refer to those metrics mentioned in the following table.

Number of users and visits by age and gender

The first questions are always about demographics. How old are the people visiting the webpage? Does the webpage attract more women than men or vice versa?

By age

We defined four age bins: Generation Z (24 years old or less), Generation Y (between 25 and 39 years old), Generation X (from 40 to 59 years old), and finally Baby Boomers (over 55 years old). Let’s count the number of users and visits by age bin (Figure 2).

The number of users and the number of visits follow a similar pattern across the four age bins. The customer base is dominated by the Generation Z and Generation Y groups, which together make up roughly three quarters of all visitors and all visits. This reflects the general trends that the younger segment of the population is more prone to Internet shopping.

By gender

Let’s now check the same metrics by gender (Figure 2). The results are somewhat uninformative since the website is visited by men and women in equal measure and both genders are equally active in terms of number of visits. From these pie charts, there comes no hint about possible marketing actions targeting women versus men.

knime clickstream analysis fig02 KNIME

Figure 2: Visualizing the number of visitors and number of visits to the website according to age group and gender.

Web page popularity by day of week

We know now that the majority of the users are younger than 40 years old, and men and women are equally active. Next, we want to explore the importance of the webpage content for our visitors. Let’s consider the categories of webpages as defined in the page category map:

  • Home page – the welcome page and hub for all other pages
  • Product – product description pages
  • Reviews – pages containing product reviews
    • By customers – product reviews by customers
    • Videos – live user feedback on products
  • Celebrity recommendations – webpages including recommendations by celebrities

The other parameter to consider, together with the webpage category, is the day of the week. Are visitors more active on specific pages on some days of the week rather than others? Are there more visits on recommendation and review pages on weekends rather than weekdays?

Visit duration

Let’s start with the plot on the left in Figure 3 showing the average visit duration (in minutes) according to day of week and page category. There we see a slight increase over the weekend in time spent on the website, no doubt due to people having more time during the weekend to gather information for their purchase. However, the differences across weekdays and weekend days are clearest for the customer review, video review, and celebrity recommendations pages.

Number of clicks

The stacked area chart on the right in Figure 3 shows the number of clicks by weekday on pages belonging to a given category. The peak on Monday is clear for all page categories. It seems that visitors read throughout the week, mainly on weekends, and proceed with more exploration, or even purchase, on Mondays.

knime clickstream analysis fig03 KNIME

Figure 3: Visualizing average visit duration in minutes and number of visits according to day of week and page category.

Similar to the average visit duration, the most popular pages here are the homepage and the various product pages, whereas the least popular category is the celebrity recommendations. Apparently, most people do not care what celebrities think when it comes to a purchase.

Purchase frequency by time of day or day of week

What we know so far is that some categories are more popular than others, and that some days have more traffic than others. The majority of the users are younger than 40 years old, and males and females are visiting the website in equal numbers. This information helps us even more if we combine it with purchase information. Does the number of visits correlate with the number of purchases? Or does visiting the website at unpopular times indicate shopping “with purpose”?

Let’s check the graphs in Figure 4 showing the absolute and normalized number of visits by time of day, day of week, and session purchase. Here, the purchase information defines the colors: blue for visits with a purchase and orange for visits with no purchase.

knime clickstream analysis fig04 KNIME

Figure 4: Visualizing the number of visits according to time of day, day of week, and purchase or no purchase.

Normalized purchase frequency

Here, the number of visits with and without a purchase on each day and at each time of day are normalized by the total number of visits for the same day or time of day. In the line plot on the left, we see that about 60 percent of all visits result in a purchase during weekdays versus 40 percent to 50 percent during weekends.

In the line plot on the right, we see that the percentage of visits with a purchase decreases toward evening and night. The highest percentage of purchases happens during the afternoon and evening hours.

Absolute purchase numbers

We have chosen to represent the absolute purchase numbers through bar charts. Monday is by far the busiest day in number of visits, both visits ending with a purchase and visits that do not. This is similar to what we found with the number of clicks across all page categories in Figure 3.

Again, the most popular times to visit are afternoon and evening. Morning visitors are the least likely to purchase. Although few visitors come at night, 50 percent of them actually make a purchase.

Click patterns

Let’s move on to a more detailed level. From which page do the users start? Can we detect a common click sequence? How good is the homepage in forwarding visitors to the product pages? Here, we examine two features: the click sequence and the probability of going from one page to the next.

Click sequence

We decided to represent sequences of clicks occurring at least twice with a sunburst chart (Figure 5). Colors are associated with different page categories. The first clicks comprise the innermost ring. Further clicks comprise the external rings. Selecting an area within an external ring produces the Click Sequences graphic above the sunburst chart as shown in Figure 5.

The green and yellow sections make up almost 75 percent of the number of clicks in the innermost ring. This means that almost three of four visits start at either the homepage or a product page, similar to the number of visits in the line plot in Figure 3. Both the green and yellow sections are divided in two parts, one part with further clicks and one part without. Clearly about half of the visitors stop the click sequence at the homepage or at a product page.

Transition probability

The heat map on the right in Figure 5 shows the transition probability between two page categories. On the y-axis, we have the page category for the first click, and on the x-axis, the page category for the next click. The color scheme transitions from purple (low likelihood) to orange (high likelihood).

The most probable next categories are homepage and product page for all page categories. Celebrity recommendations and video reviews represent the least probable next clicks for all categories, which is also in line with the line plot in Figure 3.

knime clickstream analysis fig05 KNIME

Figure 5: Visualizing typical click sequences and transition probability between two page categories.

KNIME data processing workflow

KNIME Analytics Platform is open source software for data science, covering all of your data needs from data ingestion and data blending to data wrangling and data visualization, from machine learning algorithms to reporting to deployment, and more. KNIME provides a graphical user interface (GUI) for visual programming that makes it intuitive and easy to use, considerably reducing the learning time. KNIME Analytics Platform has also been designed to be open to different data formats, data types, data sources, data platforms, and external tools.

Computing units in the KNIME GUI are represented as small colorful blocks, named “nodes.” Assembling nodes in a pipeline, one after the other, implements a data processing application. The pipeline is referred to as the “workflow.” Figure 6 shows the complete clickstream analysis workflow, which is downloadable for free from the KNIME EXAMPLES server under 50_Applications/52_Clickstream_Analysis.

The clickstream analysis workflow has three main parts: data preprocessing, data preprocessing for visualization, and visualization. Starting from the left—data preprocessing—the first part of the workflow provides data access, session ID identification, data cleaning, calculation of customer age, and purchase information.

The second part prepares the data for visualization. From the top, we calculate session metrics by user age and gender. Then we define the sequence of clicks (the clickstream) by session and its statistics. Lastly, we calculate the frequency of bounce pages by category.

The third part of the workflow on the right handles visualization—producing graphs and plots, visualizing the correlation between user data and session metrics, the sequence of clicks, and the page statistics.

knime clickstream analysis fig06 KNIME

Figure 6: KNIME Analytics Platform workflow for clickstream analysis. From the left: data access, feature engineering, data preparation for visualization, and visualizing clickstream data in interactive composite views.

So, who visits your webpages? When do they visit, and how do they get there? What distinguishes purchasers from lookie-loos? Applying clickstream analysis enables us to explore and detect patterns and relationships in the mass of data from the visitors of a website. We can identify customer trends and analyze the different paths customers take to product information and, most importantly, to purchases.

There are two important steps in clickstream analysis: 1) defining quantitative features to describe visitor behavior, and 2) visualizing these features in a way that allows us to discover meaningful patterns.

In this article, we have implemented and displayed a number of classic metrics to describe the behavior of website visitors—from the average duration of a visit to the click sequence and from the visitors’ demographics to the most popular times to visit, gather information, and purchase.

Maarit Widmann is a junior data scientist at KNIME. She started with quantitative sociology and holds her bachelor’s degree in social sciences. The University of Konstanz made her drop the “social” as a Master of Science. Her ambition is to communicate the concepts behind data science to others in videos and blog posts. Follow Maarit on LinkedIn.

Rosaria Silipo is principal data scientist at KNIME. She is the author of more than 50 technical publications, including her most recent book “Practicing Data Science: A Collection of Case Studies”. She holds a doctorate degree in bio-engineering and has spent 25 years working on data science projects for companies in a broad range of fields, including IoT, customer intelligence, the financial industry, and cybersecurity. Follow Rosaria on Twitter, LinkedIn, and the KNIME blog.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.