In our last post, we shared a technique to make operational business metrics easier to analyze. As mentioned, plotting the expected value of a metric alongside the actual value of the metric can reveal anomalies in your data. In this post, we will continue to ask the question “Is everything healthy?” In our last post we looked at events that occurred millions of times per day. What if we wanted look at events that only occurred hundreds of times per day? In our case, this happens when we scope our analysis to a small segment of users. It’s as important for us to answer these questions of health for small subsets of users because they drive value to our clients.
In our first attempt, we plotted the sparse data directly (Fig A). We found these charts to be unusable for analysis. Fig. A shows a downward trend only if you know what you’re looking for (the patch of blue toward the end of the series). Fig. B plots the same data in a much clearer format: a daily cumulative view. In Fig. A, we draw a data point at the actual value for every ten minute sample. In Fig. B, we draw the running total per day – so each drawn data point is the count of all of the prior observed events for the day. The running total peaks ten minutes before midnight and resets to zero at midnight.
A common reaction to noisy charts is smoothing: applying a transformation that hides the peaks and valleys. Fig. C is derived from the same data and transformed with a one hour rolling average (each point is the average of the data within the hour). Smoothing works when the outliers are easy to distinguish – which isn’t true for this dataset. The trend is at the same point of time but just as difficult to find because of the otherwise spiky line.
The daily cumulative view is clearer for assessing the health of a metric with sparse data. First, it makes the gap between the observed and expected more prevalent so we can catch trends faster. Once a trend is found, we can estimate its size much faster. Another surprising but useful effect is: the value at the peak is is the total for that day! This is handy when you want to answer “how many clicks have we seen so far today?”
In analyzing operational business metrics, it’s imperative to answer questions of health as quickly as possible – if we can assess the health of our business in a quick glance, we can spend more of our time growing our accounts. It’s as important for us to understand data for small subsets of users as it is for large subsets. Enabling our Business Analysts to find trends and estimate the size of trends faster became simpler with the daily cumulative view.
Join the team turning terabytes of information into revenue! Check out our careers site.
Pratik Prasad is a software engineer at TellApart.
At TellApart, the first question we hope to answer with our dashboard is “are we performing as expected?” For instance, consider the chart below:
The middle of the week seems healthy in the left chart, but unhealthy in the second where a 10 week average ‘lookback’ is added in blue. Healthy and unhealthy are defined in reference to expectations. Explicit expectations make anomalies obvious.
The above data is representative of a week in retail – consumers shop and browse more often during the day and on certain days of the week. Retailers also have specific times for new product launches or promotions. To consistently find gaps like the one above, an Account Manager would need to recall retailer-specific patterns for several clients across several metrics.
We found that time of day and day of week reliably predicted a majority of normal behavior. The underlying data is a cumulative ten-minute sample (each data point represents the total number of events in the past ten minutes). The expectation in our case is built from the prior ten weeks – each data point is the average of the corresponding time of day and day of week weighted by recency.
When you present the expected data in a loud color underneath current data, anomalies become apparent at a glance. Likewise, when the expectation matches the observation, it fades into the background. Formalizing an expectation makes it simple for anyone to find anomalies. If the provided expectation is accurate in predicting normal behavior, issues hiding in plain sight are revealed.
In the next post in this series, we’ll consider the application of this technique to sparse data.
Join the team turning terabytes of information into revenue. Check out our careers site.
Pratik Prasad is a Software Engineer at TellApart.
Imagine you manage hundreds of heterogeneous servers, each requiring frequent independent deployments. In addition, for each new service that’s developed, resources need to be provisioned and deployment scripts written. All of this overhead adds up to significant developer time that could be better spent building new products and improving existing systems.
A few months ago, we identified a set of Apache open source projects, Aurora and Mesos, that would reduce and sometimes even eliminate this overhead. Since then, we’ve begun transforming our infrastructure from many individually managed services, each with their own deployment processes and scripts, to a single platform: Aurora. In doing so, we’ve contributed significantly to Aurora, and are excited to see a new major version released today that includes our contributions. We’ve been running a parallel version in production for several months without issue and hope that other organizations can benefit from it as well.
Aurora is a framework that runs on top of Mesos. It allows scheduling of jobs (web servers, backend services, etc) onto a pool of servers. Along with Mesos, Aurora abstracts away conventional per-machine resources such as CPU, RAM, and disk in a way that allows a job to simply request a certain amount of each; Aurora + Mesos will ensure it runs with the requested resources.
We run a heterogeneous environment (Java, Python, C++, etc.) and many of our applications have their own third party dependencies. In its initial design, Aurora was intended to primarily run self-contained applications with few to no system dependencies. Typically, the first step of a job would be to download the application and dependencies, then execute the application.
With our applications, it would be neither desirable nor feasible to install all of the dependencies for our applications on each Aurora machine, nor was it practical to have each application download its dependencies when it ran. We realized that we needed to somehow package and isolate our applications in a different way than we had previously, and Aurora needed to support whatever system we chose.
We decided to use Docker to create containers for our applications. Doing so allows us to easily isolate our applications, as well as package all of their dependencies into one unit of deployment. Using Docker containers also helps us to streamline and standardize our deployment process– developers can simply push a new image to our Docker repository instead of having to write their own deployment scripts.
The last challenge we had to solve was the most difficult one: Aurora had no first class support for running applications inside Docker containers. After talking with the Aurora community, we learned that this had been a long-requested feature, but no one had taken on building it yet. We decided to build it for ourselves, and then contribute it back to the project. During development, we collaborated frequently with the Aurora community on design decisions and code reviews. We found them to be consistently helpful, not to mention an overall great group of people to work with.
With our enhancement, Aurora now has first class support for Docker containers. Developers can now simply add a Docker image name to their Aurora job configuration and Aurora + Mesos will handle the rest: pulling the image, setting it up on the machine, and running it. The results have been met with great feedback, both internally from our developers and externally from the Aurora community. In many cases, our deployment times have gone from over an hour to minutes, with the added benefits of never having to write a deployment script again.
Support for Docker in Aurora was merged into the main Aurora branch about two weeks ago, and today, it’s included in their major 0.7.0 release. We’re all excited for this release, and hope to contribute more in the future.
Below you will find a video from a recent Bay Area Mesos User Group meeting at Twitter HQ explaining the use of Aurora and Mesos at TellApart.
Interested in helping us make future commits to this project? Check out our careers site.
Steve Niemitz is a Software Engineer at TellApart.
A database replica server just failed. No big deal. A replacement is launched automatically, gets setup, and the Elastic IP address is attached. Everything’s good to go. Except nothing’s connecting to it. The application servers are still complaining that the database is unreachable. That is, until the application is restarted; then everything starts working again. What’s going on?
It could be because of InetSocketAddress.
It’s a common pattern for Java applications to hold connection information in InetSocketAddress objects. They have convenient parsing methods, handle a variety of host string formats, and represent the canonical connection information used throughout the java.net packages. But, there’s one detail of their implementation that makes them unsuitable for holding that information over long periods of time: they perform DNS resolution at construction, and then never again.
This implies that if the DNS record of a host changes while the application is running it won’t be picked up, and the application will be stuck with incorrect IP addresses, even if it tries to reconnect.
In the Cloud
This is a subtle problem when using Elastic IPs in AWS. Elastic IPs are stable public IP addresses that you can move between instances. They’re really useful for things like databases and discovery systems that have to be in known locations.
AWS uses two types of IP addresses – public and private. Public addresses are visible to the outside world, whereas private addresses are only visible within an AWS datacenter. For traffic within a region, you want to use the Private IP address: it’s faster, lower latency, and cheaper (no charges for incoming or outgoing data). However, Elastic IPs are for the public address. Conversion from an Elastic IP hostname to the Private IP address is done through DNS.
So if the connection information for the database is stored in an InetSocketAddress object, the application won’t see Elastic IP reassignments. Bummer.
Hold the original connection information in memory, and only convert to an InetSocketAddress right before creating a socket. If you need to reconnect (say, because the server at the other end failed), use the original information to create a new InetSocketAddress. Doing so will cause Java to go through DNS resolution again, and pick up any possible changes. (The JVM will cache DNS resolutions locally for 60 seconds by default, so this won’t cause a flood of such requests).
As part of hardening our systems at TellApart, we’ve had to patch this problem in many places, both in our code and in open-source projects we use (for example, a memcached library). Hopefully, better awareness will mean that you won’t have to do so as well.
Kevin Ballard is a Software Engineer at TellApart.
Fast Data is required to drive real-time personalization.
There is no arguing that Big Data tools like the Hadoop ecosystem have radically changed the technology landscape – but the term “Big Data” is losing useful meaning.
The next frontier in data science is working with Big Data in real-time, so that we can access massive amounts of data instantaneously. Imagine building data products on a platform where time scale is irrelevant, and it’s as easy to retrieve data from years ago as it is from a second ago. An innovation this powerful needs a new name: we call it Fast Data.
As part of the SF Data Mining Meetup series, we hosted a TechTalk where we dove into TellApart’s Predictive Marketing Platform, which is powered by Fast Data. Three hundred Silicon Valley data scientists attended the meetup to hear us describe how we leverage Fast Data to build a diverse set of personalization products on a common infrastructure, and the unique advantages that it gives TellApart in the marketplace.
TellApart’s platform is in part inspired by Nathan Marz’s work describing the Lambda Architecture pattern. Lambda Architecture is a name for a robust, distributed platform that can serve a variety of workloads, including low-latency high-reliability queries. One key insight is that by combining a precomputed batch view of data with a real-time streaming view, you can serve queries in real-time with similar latency to that of a database lookup.
Our focus on building Fast Data systems allows our personalization products to be hyper-responsive to recency. TellApart continues to build a suite of personalization products that can respond to a shopper’s preferences in milliseconds, giving them instantaneous personalization throughout their shopping experience — both onsite as well as offsite. And that personalized shopping experience can include recommendations based on a user profile built over years of shopping behavior, as well as the product that you looked at just a second ago.
In case you missed our event, here’s the video of the TechTalk where you can hear more about how we designed and build our Fast Data infrastructure, as well as where we walk through applying the architecture to drive complex decisions an order of magnitude faster than the blink of an eye.
Just under 2 years ago we open-sourced Taba — our distributed instrumentation and event aggregation service. Since then we’ve scaled up our deployment from handling 10M events/minute to 10M events/second (and at a lower latency for good measure). To get there though required rethinking some of the underlying architectural decisions, which necessitated a rewrite of the core Taba Server.
The rewrite was well worth the effort though. Besides the order of magnitude higher throughput, we added several highly requested features:
> Cluster auto-discovery and auto-balancing
> Support for Redis Sentinel, including database discovery and automatic failover
> Regular expression queries on Tab Names and Client IDs
> Query caching, which improves response times for the most common large queries from multiple minutes to just a couple seconds
> Pluggable Type Handlers
> Support for “super-aggregates” that can arbitrarily aggregate any Tabs of the same type
> Standard Python deployment tools and availability on PyPi
We found all these improvements so useful that we wanted to share them. So today we’re announcing the release of Taba v0.3, which is the latest version TellApart is running internally. Also, we’re moving our entire internal deployment of Taba onto the public version, and all new development will occur in the public repository. This means that all future improvements will be automatically available to the broader community.
Additionally, we’re releasing a Java version of the Taba Client library, so you’re no longer limited to instrumenting just Python processes.
In the process of building Taba v0.3, we learned a lot about scaling a distributed service in Python. Some of the more interesting lessons were:
The way a problem is modeled informs what a solution is capable of, and subtle differences in the model can have dramatic effects on performance. In Taba, we made a small change in the way States are aggregated into query responses, which made aggressive caching possible. This in turn decreased response times by several orders of magnitude.
Maintaining consistent and durable state is one of the fundamental problems of computer science. Fortunately we have existing tools (like Redis) that we can lean on to solve this problem for us. Taba consists of a small central cluster of Redis instances that comprise a hardened sub-service for maintaining all state. The rest of the Taba Server workers are stateless, and can be added or removed from the cluster seamlessly. That makes Taba highly resistant to failures, and trivially easy to scale up or down.
A pattern that comes up again and again in Taba is the combination of generators and greenlets to create just-in-time processing pipelines. It’s a powerful abstraction that makes coordinating complex processing significantly simpler.
The CPython interpreter has trouble managing memory for long running processes that handle a high throughput of data, especially if that data is a combination of large and small objects. Taba uses several techniques to avoid memory fragmentation, including using generators to reduce in-flight objects, allocating critical blocks of memory manually to control placement and deallocation, and avoiding persistent objects that can cause heap ratcheting.
I recently had the opportunity to talk about these points in a bit more detail at PyCon 2014 in Montreal. The slides and a video of the talk are available online:
This past weekend Nick Gorski and Jeshua Bratman (two TellApart engineers) attendedMHacks, a college hackathon organized by Tom Erdman and the Michigan Hackers student group. MHacks wasn’t just any college hackathon, though: it was the largest college hackathon ever held with 1214 attendees. Not only that, but it was held in the luxury suites of the Big House, the largest stadium in the United States!
Word on the street was that the best and brightest hackers are attracted to the spotlight of the largest hackathon, and they certainly didn’t disappoint. There were over 100+ universities represented at the hackathon and we had the opportunity to meet many of the attendees.
3am: prime hacking time!
On the second day of MHacks, TellApart gave a well-received tech-talk, where we candidly discussed our tech stack, the challenges we solve, and our engineering culture. One point that clicked with the hackers was that at a larger company, it’s difficult to do full-stack software engineering simply because you’re boxed in by what your team owns within the company. At TellApart, all of our engineers are full-stack developers by necessity, as people scope work by project rather than specific component. For this audience of smart and talented hackers, the idea of learning to build sophisticated, distributed systems using open source stacks really resonated.
Bain Capital Venture’s StartUp Academy and TellApart co-sponsored a prize for the “Best Use of Data” in a hack. We were specifically looking for hacks that would qualify as using big-data when scaled beyond their proof of concept. Some noteworthy entries included: Profit Maker 3000, a website integrating social data with geo-locations that showed where products were being most purchased across the US; dwnhll, a Google Maps hack that found the best route between two points to skateboard; and ratio, a Klout-like website integrating many categories of data for brands.
Between mentoring the attendees with their hacks, recruiting the great talent at the hackathon, and judging the final results, we also had a chance to fit in a quick hack of our own:
Using the MHacks logo as inspiration, we designed a Facebook ad to be shown in both the News Feed and on the right-hand side. Creating the ad was simple enough, but we had another goal: we wanted this ad to be hyper-targeted to only hackers at MHacks, and no one else.
Well-targeted ads are nothing new to TellApart, of course, but this was a bit of a twist on our normal operating procedure. Our personalized retargeted ads are shown our pixel pools which are made up of users who visited a website and received a TellApart cookie. Since many of the hackers would not be visiting the MHack’s website after the event started, getting our cookies onto the MHacks website wouldn’t give us the reach that we needed.
Instead, we targeted MHacks attendees (and no one else) with a Facebook Custom Audience ad. Not only that, though; we updated the text in the ad in real time as we moved around the site, so that attendees would always know where to find us. We estimate that we reached 90% of all MHacks attendees with the ad, and fewer than 5% of impressions were shown to people not on-site at MHacks. You might ask how were we able to create a custom audience comprised of only the MHacks attendees. Well, we’re not ready to give away all of our secrets yet.
After the event, we’ll continue to run our ad for a week or two. It has proven to be popular, with a high like-to-impression ratio compared to other News Feed ads shown on Facebook. In fact, we even received a photo of the ad from a super strong candidate that was excited to see the new text (showing that very targeted ads can continue to drive user engagement!).
When all was said and done, we had a great weekend and were very impressed with the MHacks organizing crew, the hackers, and the venue. Tom Erdman gave a short but touching speech at the close of the hackathon, where he praised the camaraderie and good naturedness that the hackers displayed. We want to echo those thoughts: although it was a competition with big prizes on the line, the hackers were all very positive and we saw groups helping each other throughout the event. The competition brought out the best of all involved, and we hope that everyone enjoyed it as much as we did.
We are looking forward to catching up on sleep, though, just like this hacker:
Why did you choose to intern at a startup, and one I’ve never heard of?
My answer would be that startups afford an intern the chance to make an impact on the company. As a member of an intimate, though expanding, engineering team, TellApart could not afford to push me into a corner to work on an intern project that may or may not be useful. I did not have a single intern project at TellApart. I had real projects that were manageable given my lack of experience. In the process of building useful applications for TellApart, I gained invaluable engineering experience that I believe I could not have received at a larger company.
Impact through hands on experience
Before the end of my first week, I had pushed live code. Admittedly coding was slow going at first: I was unfamiliar with the code base, had never programmed in Python, and had little experience working with large scale web applications. However, I had the best help available at my fingertips. With a lean engineering team, consisting of many engineers who have been here for multiple years, I could talk face-to-face with the engineer that built the system I was working with. It is unlikely that an intern at a large company could say they knew, let alone regularly conversed with, the authors of a significant portion of the code base.
By pushing live code regularly, I gained experience unattainable at either school, or by interns at larger companies. There is something to be said for the Facebook sign that was put on my desk: “Move fast and break things”. Only through trial, failure, and ultimately success do I learn to be a better engineer. If I had never pushed code, I would not have had the opportunity to fix the many small issues that arise when your code runs on live data. There are some things that unit tests cannot catch, and I am grateful that the engineering team trusted me enough to own up to my mistakes, fix them, learn from them, and move on.
Massive benefits from ownership
With trust to push code regularly came a trust that I could be responsible for a section of the code base. When I would launch an experiment, my work was not done once the experiment went live. I would monitor the data daily, and contribute to the final analysis of the experiment’s results. If I had not been responsible for managing the experiment I designed, I would have lost insight into how the code I wrote impacts real users. I found it much more rewarding, engaging, and thought-provoking to manage and understand my code’s impact outside of the lines I contributed to the code base.
Cross department learning and impact
As the owner of a section of the code base, people came to me for questions and requests related to the code I wrote. As the leader of an experiment, I interacted with many different people at TellApart that engineers at other companies may not normally directly converse with. I became good friends with an Account Manager on the business side of TellApart as a result of working closely with him to analyze the results from an experiment. I am happy to say that I did not just leave TellApart as a better coder. I also learned a lot about product management and data analysis, two fields that interest me but are not necessarily part of a normal software engineering internship. I leave TellApart a better engineer not just because I learned how to implement a product. I also have skills to analyze the performance, communicate the results, and iterate while being mindful of the overarching goal of a product.
As a rising junior at Stanford University, I was a relatively inexperienced engineer. However, lack of experience did not preclude me from leaving a mark on TellApart. Startups provide the opportunity to both have a direct impact, and learn a lot in a very short span of time. For these reasons, I would strongly consider, and suggest, working at a startup.
Jocelyn Neff is a Software Engineering Intern at TellApart.
In this post we’ll continue our quest for line-by-line Python optimizations.
We already covered several tips in part I.
Today we’ll learn:
At TellApart, we use Python to code our real-time bidding server. Real-time bidding means we have to respond in a few milliseconds to requests from ad exchanges, or else we’re, well, out of business. So it’s critical that we know how to use all of our tools effectively.
There are faster languages than Python, but if you’re writing Python you can still speed things up without making anything less readable. Even where performance differences are small it pays to know what you’re doing. The knowledge may come in handy someday when you discover that an apparently simple piece of code has a disproportionate impact on execution time, perhaps only because it’s in a block that gets executed zillions of times.
With that introduction, here’s a survey of some tips we’ve assembled to write performant Python code on a line-by-line basis.
Today we’ll learn: