Multilevel Regression and Post-stratification (MRP) for Brand Tracking: PyConDE & PyData Berlin Recap

This post summarizes key ideas from a presentation at PyConDE & PyData Berlin 2019 by, Korbinian Kuusisto, Dalia Research’s Data Scientist. The presentation covers why Dalia Research began using multilevel regression and post-stratification (MRP) to create thousands of audience segmentations with Latana, the company’s brand tracking solution. While Latana’s MVP was built using simple logistic regression, the team decided to switch to a more complex Bayesian framework and created a self-learning model that can build on past knowledge for niche audiences.

PyConDE & PyData Berlin 2019 was a 3-day conference that brought together professionals in the software industry to deep dive into their product learnings, side projects and latest technical innovations. Presentations included discussing professional approaches such as developer approaches versus enterprises, data visualisation, machine learning, ethics, and hardware.

Dalia’s data scientist, Korbinian Kuusisto, shared our company’s learnings in product development for our survey and research platform and the benefits of applying complex statistical methods to give brand clients better quality data. 

Niche Audience Personas Help Product, but not Brand Managers

Korbinian Kuusisto Dalia Pycon

Photo Credit: Corrie Bartelheimer

Specific personas help focus product development, but they also present challenges for data-driven brand managers. Brand managers that Dalia Research’s Latana team work with usually know who they want to reach, but the question is how do they reach enough of them to get meaningful data points? Equally, brand managers also want to know if their money was well spent on campaigns. How do they find out if these campaigns actually reached their target audiences?

Latana niche audience profile

Presentation slides for PyConDE and PyData Berlin 2019

Imagine finding the niche audience of young, female Twitter users. Even if you have 1500 females, 1000 Twitter users, and 600 young people, only about 20 may have all those characteristics. If 7 of 20 say they love your brand, can you really say 35% of young, female Twitter users do? Given the huge confidence bounds the estimate can be anywhere between 15-55%, which doesn’t give much confidence for decision making. Using a traditional quota method for reaching niche audience simply isn’t feasible for brands.

Niche audiences have high confidence bounds

Presentation slides for PyConDE and PyData Berlin 2019

The Dalia online survey platform can help clients reach thousands of users a day, segmented by location, age, education level, and other factors. Though we could simply increase the quotas for our surveys to reach the minimum statistical confidence level, there are more efficient ways to make use of the survey data we were gathering for partial fits. Our solution was to apply multilevel regression and post-stratification (MRP), which is already used in political science, and build a model that uses information from all the responses (female, and/or young, and/or a Twitter user).

Latana MRP for niche audiences

Presentation slides for PyConDE and PyData Berlin 2019

Dalia used the open-source machine learning library Scikit-learn to build our first MRP solution with simple logistic regression (LR). Scikit-learn is a machine learning library implemented in Python, containing a wide range of machine learning models out of the box. We tested a range of those to create a brand tracker with a dashboard that allowed our clients to segment thousands of audiences. However, since survey data is usually ‘small data’, models that live in more complex non-linear spaces did not improve the approximations when compared to a simple linear setting. This is why we stuck to a LR to form the backbone of the first iteration of our brand tracker product, Latana

Quantifying Uncertainty: Logistic Regression to Bayesian Method

Presentation slides for PyConDE and PyData Berlin 2019

However, we did not stop at simple logistic regression because we wanted a model that could segment audiences and be resilient to very little data in certain characteristics and, in addition, can quantify uncertainty better. By quantifying uncertainty, brand managers can have greater confidence in their data — that, for example, the 5% to 8% increase in a brand KPI is not a fluke.

In addition, the Bayesian framework allows us to increase precision over time by incorporating past knowledge in our model. 

Bayesian method with MRP built with Scikit-learn Python library

Presentation slides for PyConDE and PyData Berlin 2019

Because of Dalia’s tech stack, PyMC3 was a natural solution for us. Currently, Dalia’s platform is run on the cloud with the stack PyMC3, Django, and AWS. At the same time, other machine learning libraries are available for teams working on other data science solutions. 

Bayesian method quantifies uncertainty

Presentation slides for PyConDE and PyData Berlin 2019

Latana’s Bayesian model MRP engine offers high precision tracking of specific target groups, maintains stable sample composition over time, and reduces fieldwork time and costs (does not require quota cell-filling). By drastically reducing the noise in data using MRP, brand managers can pick up small changes in brand awareness over time and zoom in to customer segments without losing data precision for strategy planning.

We hope that our presentation on the product use case of using PyMC3 for brand tracking encourages data scientists to consider Bayesian methods for building their product. PyConDE 2019 covered a wide range of topics in the field of Bayesian modelling, Gaussian processes and uncertainty quantification and we look forward to seeing more on the same topics on PyConDE 2020.

You can follow Korbinian on Twitter @kuusisto_k and check out his full presentation slides.

You can find out more about MRP here or try our brand tracker, Latana.