It’s About Time (to talk SQL)
One of the reasons I came to Interana was because I missed analytics and statistics. In grad school, I spent more hours than I care to recall hacking S and S-Plus. Early in my career I also worked with relational databases and taught courses around SQL programming. But in the last 20 years I’ve only occasionally dabbled with SQL and R. I’m also relatively new to behavioral analytics. For both those reasons, I want to blog about SQL in behavioral analytics to refresh my knowledge and learn more about both.
Interana doesn’t use SQL in the event analysis data path or as an interface for our users. We provide users an intuitive Visual Query Builder which we translate into parallelized instructions optimized for time series data. We also let users graphically create flexibly defined behavioral constructs (e.g., cohorts, metrics, sessions, funnels, etc.) as reusable named expressions. Our users create and refine powerful queries without being technical or knowing how to code.
If that’s the case, why write about SQL? One reason is to share our perspective. Another is to explore alternative views. My goal is also to better understand the state of SQL for Behavioral Analytics: essentially, how SQL handles a time-ordered sequence of actions by actor. Whether you love or hate using SQL for analytics, we want to hear your perspective. How do you address some of the challenges? Is it all working just fine? Have you tried SQL but needed to hand off the task to an expert? If so, where did things get stuck and what alternatives did you consider? How long did it take you to ask follow-up questions? Please comment. I’m definitely hoping for an interactive discussion.
This blog post focuses on the concept of time and how SQL interacts with time series data. When our founder, Bobby Johnson, ran infrastructure at Facebook, he found that all of the important questions were about how things behaved over time. His experience made it clear that these are the questions that matter most to businesses. As more of the economy moves online and becomes subscription-based, understanding events over time is more important than ever. Looking at user behavior over time allows organizations to make better decisions and increase revenues. Time series data allows you to understand the actions people, products, and machines make over time. With that in mind, let’s dive in!
What Do We Mean by SQL?
SQL is often used as shorthand to refer to many different things, so it’s helpful to align our understanding. Those three little letters can refer to a standardized language definition, a particular product, an interpreter implementation, a database, etc. SQL was originally designed in the 1970’s for working with normalized relational databases. Over time, SQL as a language was standardized and that standard periodically updated to address evolving business needs. Different versions of the SQL standard support different capabilities and slightly different syntax. In addition, there are many implementations of SQL interpreters and query engines. Popular SQL implementations deviate from the standards in minor and major ways, so just because the SQL standard defines something doesn’t mean that it’s available in a particular solution or that the feature will work the same way between two different implementation.
When it comes to time, SQL is primarily good at answering point in time questions. Often this relates to the state of the database at the time of the query (e.g., tell me about this order and the customer). The latest standardized version is SQL:2011 which finally added temporal support to the SQL standard. It can potentially add time intervals to each record and essentially allows the database to be versioned by time. The new capability supports asking point of time questions on previous states of the database and on specific intervals of time. The SQL:2003 standard introduced Window Functions, which enable queries to perform certain types of ordered operations using the “OVER” clause. The clause specifies a “window” over which the calculation should be performed, and allowed certain cross-row calculations that are useful for Top-N queries and basic sessionization. Prior to that, SQL-92 added notions of date, time, timestamp, and interval — the point-in-time and interval-based concepts of interest to database experts at the time. Earlier versions of the SQL standard didn’t have any native notions of time data and primarily worked with times as integers or strings.
From the perspective of somebody developing applications with SQL, there are three types of time-related queries:
- Point-in-Time — SQL works well for these queries
- Time Interval — Questions related to earlier points in time, and questions related to (essentially) counting what happened during a particular time interval. Newer versions of SQL have support for these questions and are slowly being deployed in more production environments.
- Sequences of events over time — No native SQL support or concepts.
It’s the third kind of query that’s most important for answering behavioral questions.
SQL Isn’t Designed for Time
Recall that SQL is based on sets and sets are inherently unordered. SQL doesn’t assume or use any order in the sets of rows it scans for a query. Newer versions of SQL that work with time still have no native notion of time sequence. In contrast, behavioral analytics is focused on sequences of actions initiated by actors over time. The order of those actions is essential. So is being able to smoothly unravel all recorded actions into sequences by actor.
SQL isn’t optimal for exploring behavioral analytics questions over event data, which is time-ordered and needs to be examined as a sequence of actions attributed to actors. Event data for behavioral analytics is fundamentally better modeled as sequences instead of sets. Behavioral queries on sequences may be modeled as finite state machines that operate on (i.e., replay) per-actor sequences observed in the recorded events. The state machine operations can be very simple: counting various state entries and exits, accumulating observed values, subtracting timestamps to track duration, etc. The answer to complex behavioral queries then becomes a matter of efficiently processing the per-actor sequences in parallel, and then aggregating the results. This makes Interana a better fit for behavioral analytics on event data than a solution based only on sets and predicates. Especially when the data is stored organized by row and optimized for read-modify-write updates.
It also adds great flexibility, letting you dynamically change what you consider interesting behavior from one interactive query to the next. Let’s consider a couple of examples based on a music streaming service:
- Imagine we want to create a metric that tracks the number of unique songs played per user. The order of the songs doesn’t matter, but the process of natively examining the event sequences by actor (user) helps. We can keep track of songs we’ve seen per-user and deliver a count at all interesting time intervals. Figure 1 shows what that would look like with Interana.
- Now let’s say we want to create a funnel that tracks users who’ve invited a friend, then created a playlist, and then upgraded to the paid level. When we start, let’s say that all users are at State 0 and we record the timestamp when we first saw the user. As we examine the per-user event sequences, any users that add a song to a playlist move to State 1 and we record the timestamp of when they transitioned to State 1. Inviting a friend transitions the user to State 2, and upgrading transitions the user to State 3. We can now easily figure out how many users traversed the various steps in the funnel, calculate the drop-off between steps, and gather statistics on the time the users took between steps. All of this can be done with a single scan of a few columns in the database. Contrast this with what would be required to code all the subqueries and joins to do the same thing in SQL. Figure 2 shows what that would look like with Interana.
- Finally, let’s say that we wanted to build a cohort of female users from the New York City area who have listened to over 500 unique songs and upgraded after building a playlist and inviting a friend. We can define the cohort by combining some simple demographic filters (e.g., gender is Female and location is New York City area), with the metric from Example 1, and specifying that user should reach State 3 in the funnel from the second example. Interana will need to add a few more columns to the scan, but we can again compute this cohort with a single scan through the event data. Figure 3 shows how this cohort would be defined in Interana.
Importantly, Interana named expressions (e.g., metrics, funnels, cohorts, etc.) are defined but not evaluated until needed. Adjusting their definition is quick, and there is no need to discard and re-compute any pre-aggregated data sets or indexes. This makes working with the Interana query builder fast and painless, allowing our users to explore their data at the speed of thought.
Contrast the Interana approach to some example SQL in Figure 4 which aims to build a funnel analyzing emailed invitations. The subqueries for each step would need to be defined and combined all together in the right order. Calculating per-user unique songs would be another subquery and join operation. And filtering by demographics would be yet another subquery and join. Each of these subqueries continue to complicate the SQL code and make the process of asking questions from the data less and less approachable. Only the most intrepid and technical members of the organization would even attempt it. With Interana, anybody that understands the business and the collected data is able to quickly explore on their own.
Another mundane challenge is that SQL requires scanning the table and filtering by date when dealing with time periods. Without time order, finding sets of events that happened during a particular time interval calls for scanning the entire table. Plus working with time periods in SQL and relational databases often requires grouping and aggregating by smaller intervals (e.g., minutes, hours, days). This is more code to write, understand, and maintain. Partitioning the table by date and time can help to some extent, assuming the database supports partitioning and doesn’t charge too much for the feature. Yet even with partitions there’s a balance between performance and manageability. And partitioning can be orthogonal to distributed placement, requiring experience and expertise to deliver scalability.
After all that work, the SQL result is a table that is then often further processed somewhere else (spreadsheet, analytics software, visualization software, etc.). Interana is a full-stack solution that includes visualization, time-optimized query planning, and a back-end database designed for distributed, pipelined analysis and optimized for rapid scans. It’s a better choice than dealing with SQL and databases not designed to handle time as a first-order principle.
Set-Oriented SQL is Awkward and Slow for Behavioral Queries
SQL was designed for normalized relational databases. It’s a declarative language with the interpreter trying to figure out how to best execute the queries. Storage is also an important factor, with traditional database storage composed of partially filled blocks of rows. Finally, the interpreter and database engine are optimized to work on sets of rows, with columns coming along for the ride. This has some important implications:
First, set-oriented SQL queries for behavioral queries end up with many subqueries and nested JOIN operations required to fit the predicate logic. The code ends up hard to follow (e.g., see Figure 4), but the biggest problem is how the subqueries and multiple nested joins cause repeated table scans across billions of rows. SQL databases aren’t optimized to do table scans well. Instead, they rely on indexing and pre-aggregation to reduce the amount of real work at query time. But analytic workflows tend to be exploratory, with table scans common, optimal indexing unclear, and any indexing running far behind the real-time data stream. Given enough memory, caching or in-memory databases also help. But in the long run, leaning on memory is a losing proposition. The database engine needs to be optimized for rapidly scanning the data and using pipelined operations. Anything else isn’t scalable.
Second, scanning by rows pulls in more data than necessary for the query. Especially for traditional relational databases that store data organized by rows. All the work of reading unused data goes to waste. Columnar databases are much better at this, but efficiency is in the hands of the SQL interpreter which may not be optimized for efficient table scans. If you need behavioral analytics on event data, Interana provides a better solution than SQL and relational databases.
Finally, the contortions required for forming behavioral queries in SQL across multiple subqueries, joins, and aggregations make code hard to write, understand, and maintain. It forces analytics to be the domain of a few technical experts, who quickly become the bottleneck to the rest of the organization. Many potential ideas are lost and questions go unanswered, limiting the value of the collected data. Forcing product, business, or application experts to go through a small team of SQL experts results in fewer insights, poorer decision making, and a slower pace of innovation. Interana’s alternative provides an intuitive graphical interface for building complex behavioral queries without needing to code. We democratize access to the data, so that people throughout the organization can dive into the data themselves.
Conclusion
To wrap up, we think that SQL isn’t the right tool for behavioral analytics on event data at massive scale. Familiarity is the main reason there’s a proliferation of SQL interpreters layered on top of Big Data platforms. That’s not a great reason to stick with something that’s awkward to use and limits access to a chosen few experts. The alternative is intuitive and simple to learn. At Interana, we’ve built a better solution that allows anybody to create powerful and flexible queries that complete in seconds across trillions of events. No knowledge of SQL required.
What do you think? When it comes to SQL, is it love, hate, or indifference? Whatever it is, we’d love to get your perspective on our blog.
Thanks
This was a tough blog to write given the breadth of the material. I’d like to extend a big thanks to Paddy Ganti, Christina Noren, John Drahos, Neal Kaplan, and Boris Dimitrov for their technical reviews and discussions. Thanks as well to Matt Lubrano, Max Elisman, Eric Brune, Danny Sacks, Nicole O’Malley, and Mark Horton for their additional reviews and suggestions.