Julia Kroll, Applied Engineer at Deepgram, will be presenting “Minimizing Request Latency of Self-Hosted ML Models” at P99 CONF 24.
Note: P99 CONF is a technical conference on performance and low-latency engineering. It’s virtual, free, and highly interactive. This year’s agenda spans Rust, Zig, Go, C++, compute/infrastructure, Linux, Kubernetes, databases, and more.
We hope you’ll join us live October 23-24 to hear the talk and chat with Julia. In the meantime, let’s get to know a little about her!
How do you answer the dreaded “tell us about yourself” question?
I’m a multidisciplinary engineer with a love of language, efficiency, and innovation. My specialty is voice AI, and I’ve learned the contours of the field across a variety of roles and products. I’ve tackled challenges along many dimensions, from sourcing training data, structuring data, shaping knowledge and ontologies, training ML models, enabling speech understanding, creating synthetic voices, serving models, and driving down latency so that the whole conversational experience is fast enough to feel natural. I’m intrigued by how communication often feels so effortless to humans, and by understanding and problem-solving how a computer can fit together all those pieces in its own way in order to hold an engaging, intelligent, productive conversation.
What’s the most interesting project that you’re working on right now – or hoping to start soon?
My company is tackling multilingual models, and I’ve enjoyed being involved in shipping our first model in that space, which can transcribe code-switched English and Spanish speech. It’s simplest to conceptualize languages as discrete entities, but that’s not the reality for many people. Some people grew up speaking multiple languages, others have taken the time to learn additional languages, and I find it very exciting that we’re capturing these perspectives and realities of language within ML models, to create products that are more effective and equitable in real-world circumstances.
What will you be talking about at P99 CONF?
I will be talking about how to minimize request latency of self-hosted ML models. One unique aspect of my company’s product is that we enable customers to self-host our models on their hardware. This leads to interesting challenges where the customer has both the control and the responsibility to manage their own ML inference. It’s kind of like giving our customers a specialty baking mix so they can bake their own cookies – we give them guidance on how long to bake them and at what temperature, but many other factors go into what other flavorings they choose to mix in, and whether they end up with their ideal results.
What other P99 CONF talks are you most looking forward to – and why?
I’m looking forward to many of the other talks. One is Will Crichton’s “The Performance Engineer’s Toolkit: A Case Study on Data Analytics with Rust.” My company does all of our ML research and experimentation in Python, and then productionalizes our code in Rust. I’m interested to hear Crichton’s analysis of which techniques he used in order to slash latency. Another is David Collier-Brown’s “Queues, Hockey Sticks and Performance,” to learn about the queue data structure at both the conceptual and practical level.
What do you like most about P99 CONF?
I’ll offer two points. First, it’s a gathering of smart, curious people who want to share their knowledge and learnings with others. Second, it upholds the value of optimizing for all cases. Oftentimes, we throw up our hands and call something “good enough.” And in plenty of cases, it is. But at P99 Conf, we can dig into the extreme cases, and ask ourselves, how can we improve our handling of extremes? And in turn, what does examining and minimizing outliers teach us about the system as a whole?
Any performance-related resource recommendations for the P99 CONF community?
To learn more about outliers from a classical ML perspective, I recommend the blog post A Brief Overview of Outlier Detection Techniques, and the academic paper Outlier Detection: Applications And Techniques, depending on the format you prefer. This applies to cases where p99 isn’t simply a long-tail, but may have fundamental differences from the inlier data.