This post summarizes key ideas from a presentation at PyConDE & PyData Berlin 2019 by, Korbinian Kuusisto, Dalia Research’s Data Scientist. The presentation covers why Dalia Research began using multilevel regression and post-stratification (MRP) to create thousands of audience segmentations with Latana, the company’s brand tracking solution. While Latana’s MVP was built using simple logistic regression, the team decided to switch to a more complex Bayesian framework and created a self-learning model that can build on past knowledge for niche audiences.
PyConDE & PyData Berlin 2019 was a 3-day conference that brought together professionals in the software industry to deep dive into their product learnings, side projects and latest technical innovations. Presentations included discussing professional approaches such as developer approaches versus enterprises, data visualisation, machine learning, ethics, and hardware.
Dalia’s data scientist, Korbinian Kuusisto, shared our company’s learnings in product development for our survey and research platform and the benefits of applying complex statistical methods to give brand clients better quality data.
Niche Audience Personas Help Product, but not Brand Managers
Specific personas help focus product development, but they also present challenges for data-driven brand managers. Brand managers that Dalia Research’s Latana team work with usually know who they want to reach, but the question is how do they reach enough of them to get meaningful data points? Equally, brand managers also want to know if their money was well spent on campaigns. How do they find out if these campaigns actually reached their target audiences?
Imagine finding the niche audience of young, female Twitter users. Even if you have 1500 females, 1000 Twitter users, and 600 young people, only about 20 may have all those characteristics. If 7 of 20 say they love your brand, can you really say 35% of young, female Twitter users do? Given the huge confidence bounds the estimate can be anywhere between 15-55%, which doesn’t give much confidence for decision making. Using a traditional quota method for reaching niche audience simply isn’t feasible for brands.
The Dalia online survey platform can help clients reach thousands of users a day, segmented by location, age, education level, and other factors. Though we could simply increase the quotas for our surveys to reach the minimum statistical confidence level, there are more efficient ways to make use of the survey data we were gathering for partial fits. Our solution was to apply multilevel regression and post-stratification (MRP), which is already used in political science, and build a model that uses information from all the responses (female, and/or young, and/or a Twitter user).
Dalia used the open-source machine learning library Scikit-learn to build our first MRP solution with simple logistic regression (LR). Scikit-learn is a machine learning library implemented in Python, containing a wide range of machine learning models out of the box. We tested a range of those to create a brand tracker with a dashboard that allowed our clients to segment thousands of audiences. However, since survey data is usually ‘small data’, models that live in more complex non-linear spaces did not improve the approximations when compared to a simple linear setting. This is why we stuck to a LR to form the backbone of the first iteration of our brand tracker product, Latana.
Quantifying Uncertainty: Logistic Regression to Bayesian Method
However, we did not stop at simple logistic regression because we wanted a model that could segment audiences and be resilient to very little data in certain characteristics and, in addition, can quantify uncertainty better. By quantifying uncertainty, brand managers can have greater confidence in their data — that, for example, the 5% to 8% increase in a brand KPI is not a fluke.
In addition, the Bayesian framework allows us to increase precision over time by incorporating past knowledge in our model.
Because of Dalia’s tech stack, PyMC3 was a natural solution for us. Currently, Dalia’s platform is run on the cloud with the stack PyMC3, Django, and AWS. At the same time, other machine learning libraries are available for teams working on other data science solutions.
Latana’s Bayesian model MRP engine offers high precision tracking of specific target groups, maintains stable sample composition over time, and reduces fieldwork time and costs (does not require quota cell-filling). By drastically reducing the noise in data using MRP, brand managers can pick up small changes in brand awareness over time and zoom in to customer segments without losing data precision for strategy planning.
We hope that our presentation on the product use case of using PyMC3 for brand tracking encourages data scientists to consider Bayesian methods for building their product. PyConDE 2019 covered a wide range of topics in the field of Bayesian modelling, Gaussian processes and uncertainty quantification and we look forward to seeing more on the same topics on PyConDE 2020.