Chose your glue carefully

A growing imperative for media companies of all sizes is to directly manage first-party audience data so they can own the customer journey, so selecting the right providers for performance, value, and flexibility is top of mind. The first choice organizations make is often the warehouse, and the second choice is a visualization tool that integrates well with that warehouse. But these tools need to get data from various applications, behavioral trackers, and enrichments in order to work, and that requires an entirely separate flavor of software, which we can think of as glue. In fact, there’s a whole industry built around this kind of data transport and transformation, often called ETL (Extract, Transform, Load). Big tech companies often have whole teams dedicated to each step of the process; in most media companies, however, engineers and analysts wear multiple hats.

In making the choices of tools and visualization providers, an organization that’s already fully immersed in Google’s app ecosystem might choose BigQuery and Looker Studio. Data coming off an Amazon collector such as AWS might logically end up in Redshift. I’m personally partial to Snowflake because I have a penchant for building my own dashboards. Snowflake is also an integration-agnostic and cloud-agnostic warehouse, making it an agile choice for organizations with tools and integrations spread across multiple brands and platforms. The assumption during this process can be that most sorts of data can find a way into whichever warehouse we choose, and while that’s broadly true, the reality of translating data from source to warehouse can be pretty rocky, depending on where the data is coming from.

Media companies are particularly susceptible to a duct-tape-and-glue approach to data transport, for a number of reasons:

  • Our sector as a whole is relatively late to the data warehousing game in terms of media companies that  maintain their own, first-party data centralization and enrichment processes given that much of our data has long been sequestered in third-party providers. The exception I’ve seen most often are subscriber-focused magazines that have comprehensive databases about their audience, although these might not be as digitally connected as they would like.
  • Various ad-tech partners, CDPs, analytics tools, and ESPs operate data warehouses where they surface insights and enhanced data (particularly in the form of audience sets), but these tools don’t typically connect with other parts of the digital ecosystem.
  • We have lots of data sources! Behavioral analytics, customer data, email marketing data, and the content itself often come from different places. In some cases, there’s also purchased datasets around demographics, location, and households.

So what’s glue? Anything that transports data from an origin into an analytics platform or downstream marketing channel. The origin could be just about anything and the destination is a data warehouse or other sort of data aggregation product, some of which are listed above.

Crucially, the data warehouse should serve as the single source of truth for a media organization’s core customer, performance, and revenue data – the baseline for accuracy when comparing against data downstream in third-party marketing tools that may either be duplicated or partially-accurate. Critically, data ingested into the warehouse should already be cleansed and depuped, ensuring utmost data integrity and reliability. 

This high level of reliability allows for unrestricted analysis into behavior and product usage. This often looks like creating queries — often directly in SQL — and dashboards, a familiar pattern for anyone with experience working in BigQuery and Looker Studio.

Why is glue important? It’s important because it’s fragile, and it’s fragile because it typically relies on at least two different products, neither of which may be entirely in our control. Product changes that impact an API used to pull data might result in a broken connection and thus stale data in the warehouse.

There are many separate products that specialize in gluing systems together – migrating data to and from warehouses, databases, and marketing platforms (often called integration connectors or automation engines). Zapier is a common tool to move data around the internet — not just to data warehouses. Fivetran can copy entire databases into warehouses cleanly and efficiently. We use this to synchronize long-lived records from application databases to warehouses, where they’re transformed and denormalized. With tools like Zapier and Fivetran, each connection has three parties involved, but these tools are often a good choice because they’re well-proven. Another commonly used platform is DBT, which helps move data and can transform it along the way.

Sometimes, specialized use cases and custom reporting needs dictate that we must build our own glue. We can put glue into the application, which we do to synchronize behavioral events from an Amazon SNS queue, for example. We can also build it directly into the data warehouse. We have some glue in Snowpark stored procedures, which run on Snowflake. This introduces the complexity of custom code, but limits the surface area to two systems that could break.

Less optimally, we could build a custom transport application that lives on neither the source nor destination infrastructure. We do this in some marketing technology applications where large catalogs of products have to be synchronized to Meta and TikTok for catalog-driven ad campaigns. This kind of very custom glue has to be monitored as if it were a mission-critical part of the application, especially if it can’t recover from an outage. Ideally, if data didn’t make it from the origin to the data warehouse, it could be caught up later, as is the case in our catalog synchronization. But when connecting disparate third-party applications, this can be difficult. Large organizations have whole teams devoted to data transportation, commonly called data engineering. But in our sector, these tasks can commonly fall to data analysts or application developers.

I try to choose glue based on what’s least likely to break given the constraints of the source and destination. Choose glue carefully, it’s often the weakest link in the data chain.