After a very successful first meetup last month, we decided to keep going! Our second session took place on Monday, July 16th. We began the evening with some snacks and introductions, followed by our first lecture, where I explained how we started using machine learning algorithms for matching the right surveys with the right users. We also had the pleasure of welcoming an external speaker, Vincent Maurin – Head of Engineering at Glispa Global Group –, who taught us all about Kafka.
Using Machine Learning to Explore our Inventory in Real-Time: Process and Key Learnings — Irati R. Saez de Urabain, Data Scientist at Dalia Research
One of the big challenges we face at Dalia is matching the right user with the right survey. For this purpose, we use a number of algorithms that apply different logic which aim to attribute a survey to a user. The main purpose of some of these algorithms is to maximize the number of conversions and therefore send the majority of traffic to offers that are known to be good. On the other hand, some other algorithms need to explore new offers that have the potential to perform well in our system.
During the talk, I explained how we started using machine learning algorithms to explore new offers from our inventory. The need to move towards more advanced ways of exploring came after a period of testing and validating our business ideas with very simple heuristics. At some point, we found ourselves with too many independent variables that we wanted to use simultaneously, so machine learning seemed like the way to go.
Implementing a Machine Learning Model
My presentation described the steps that we generally follow when implementing a machine learning model that will run in production:
- First, we run an initial analysis to validate our business ideas. This analysis is also helpful to identify the relationship between the independent variables and the dependent variable, which can be very useful when doing feature selection.
- During the modeling phase, we prepare the data to be modeled, select the right algorithm that we will use, and tune its input parameters. If everything goes well, by the end of it we have a machine learning model that we can put in production.
- Testing the iterations phase involves testing several versions of the model in production using real requests. We test different versions iteratively, sending very little traffic to each of them. By the end of this phase, we select the best performing model, and we move on with the AB test.
- Finally, we perform an AB test!
Our exploration model involved several model iterations.The winner was the one we describe in the following slide. This formula calculates the probability to convert for each survey. Once we have all probabilities, we select the survey with the highest probability.
To view the entire presentation, visit our Github.
Kafka as a Streaming Platform — Vincent Maurin, Head of Engineering at Glispa Global Group
During our second presentation, Vincent talked to us about Kafka, a streaming platform for building real-time data pipelines and streaming apps.
The first question to answer is: when should we think about streaming our data? As Vincent pointed out, streaming might help when we need to spread the data across multiple components, when we need fast results, and when performance is affected by hitting the disk. Vincent continued by giving us some great insights about how Kafka’s ecosystem works. More precisely, he talked about three main aspects of the Kafka Ecosystem: Kafka Cluster, Kafka Connect, and Kafka Streams.
1. A Kafka Cluster contains Brokers that are managed and coordinated by ZooKeeper.
- Brokers are the running unit of Kafka that manage data in the form of log files in the cluster. One broker instance can handle hundreds of thousands of reads and writes per second. This means TB of messages without performance impact.
- Producers connect and push data to brokers.
- Consumers consume the data and need to maintain how many messages have been consumed by using partition offset.
2. Kafka Connect integrates Apache Kafka with other systems and makes it easy to add new systems to scalable and secure stream data pipelines. Kafka Connect allows you to connect sources of data as input and sinks as output. In our presentation, an example was given with a JDBC-compatible database as a source and Elastic Search as a sink.
3. Kafka Streams allows Stream Processing in Kafka. A stream processing application is implemented as a plain java application that doesn’t require anything else than Kafka. Kafka streams allow for different ways to define the processing, but also for multiple types of transformations.
You can access the rest of Vincent’s presentation here!
Our Upcoming Meetups
Our upcoming meetings will focus on more challenges involving Data Science and Engineering. All levels are welcome to attend, but those with a general to more developed knowledge of data science and/or engineering may find it easier to follow than absolute beginners.
How to join us at our next meetup
Due to the summer holidays, we still don’t have a date, but we will announce it soon. Stay tuned!