Working at Google: Ads
Inside the Money Machine
This is another in a series about what I was doing at Google when I was not putting on movies or hosting author talks and live theater (this and this). After all, I was being paid to do Google work, not those things.
When I left off in Part 1, I had just transferred from Enterprise to Ads. Now we resume our story.
The Beating Heart of the Google Money Machine
You can see my first manager in Ads, Diane Tang, here in the video in Supplemental Material. We’ll be talking much more about Experiments later.
Many of the technical details of Google Ads have leaked out over the years, or even been published deliberately by Google. You won’t learn a whole lot here that isn’t already public. In many cases I will cite the public explanations, just to demonstrate that it’s public.
I was in Ads Quality, the mathematical brains of Google Ads, which, then as now, brought in the overwhelming share of the money. They used to have a T-shirt which said “We use math.”. The group was called “Quality” and not “Revenue” or something more capitalistic because it really did attempt to keep ads relevant to the user, rather than just an annoyance. Of course, making more money was nice, too.
Stop snickering at the idealism. This was 2007. Back then, there were three ads on top of the search results page, and seven on the right hand side. The top ads were in a different color than the “organic” search results and brought in vastly more money than the right hand side ones.
Before I joined, they had done an experiment where they changed the background color of the top ads, from blue to yellow (or maybe it was the other way around?). That alone brought in more money for Google than every other change Ads Quality had made in the entire year. A psychologist addressed the weekly meeting and explained why the new color was better.
Ads Quality had no responsibility for actually serving the ads or for displaying them. Those were in different groups. Our task was only determining the actual ads that got served. You might wonder how that was accomplished if it wasn’t by writing code. We’ll get to that.
Our favorite money-making query to use as a conversational example was “flowers.” The assumption, probably valid, is that people only query that if they want to send flowers, not to learn about flowers. Therefore, it’s the perfect query to sell ads on. Here’s what the results page looks like in 2022
This is quite a bit different from 2007, isn’t it? Back then, you would have gotten three ads on top, seven on the right, and the rest of the page would be “organic” search results. All of the search results probably were “send someone flowers” links anyway. Maybe one link was a botanical explanation of flowers; I don’t know. Anyhow, you notice that ads are not in a different color anymore; they just have “Ad” in front of them.
Experiments
Disclaimer: every fact discussed here is at least 12 years old. It could be and probably is different nowadays, in many ways.
Note that I mentioned an “experiment” for changing the top ads’ background color. Experiments were (and probably still are) Google’s secret sauce, except it’s not a secret anymore. The popular press talks about Google’s “algorithm” for serving ads, as if it’s a code module that you could just look at and understand. There is no such algorithm, and there is not one person in Google who understands everything that goes into selecting an ad. Diane Tang probably came the closest when I was there, and I don’t know who it might be now.
The way you control the various levers that serve ads is via experiments. The experiments set certain variables embedded in the ads servers. You do not have to change any code for the vast majority of Ads changes, even if they’re permanent.
That link I shared (PDF here), explains it. There are so many people who need to do experiments, in fact, that “running multiple experiments simultaneously” is a big issue; after all, you don’t want your experimental results to be contaminated by someone else’s experiment.
Diane Tang is a Ph.D. from Stanford, and most of the Ads Quality heavyweights have Ph.D.’s, too. There is some serious brain power devoted to selecting ads at Google. That Stats 101 course you took in college doesn’t come remotely close to letting you understand it.
There are probably a thousand experiments eligible to run on every Google query. Your query might be subjected to many of them. So what happens to the results of all of those, and how does an experimenter evaluate them? That’s where I came in.
RASTA-mon Vibration
RASTA was the internal name for the system that collected and displayed experiment results. It stood for <something> Ads Stats Architecture. Each row was an experiment (one row was “no experiment”) and each column was a particular measurement. “RPM” (revenue per thousand queries) was the ultimate measure, but there were dozens of other ones, more granular and thus more interesting: % of impressions that got clicked, % on the top, % on the right, cost per click, % of clicks that were “good clicks” (the user went to the ad page and didn’t immediately leave) and on and on. When you ran an experiment, you went to RASTA to see how it did. There were hundreds of Googlers who used RASTA every day.
But we still haven’t gotten to what I did! Before that, I need to explain four more systems: protocol buffers, logs, sessions, and Sawzall.
Logs
If you’ve done any sort of programming, you’re probably familiar with logs. They’re how you debug something that’s already happened. You can search “server logs” and find an example like this (from this page):
11.222.333.44 - - [11/Dec/2018:11:01:28 –0600] “GET /blog/page-address.htm HTTP/1.1” 200 182 “-” “Mozilla/5.0 Chrome/60.0.3112.113”
This log data is in text and you can (sort of) read it. Google logs are nothing like that. Google takes logging very seriously. A log entry is a protocol buffer.
“All your Google interactions are logged.” That’s what you read in the popular press. It’s true, too.
Of course, does that mean that every Google engineer can access anything? No, definitely not. Before I could get any logs access, I had to sign a confidentiality statement and take it personally to the person who administered those. No, I couldn’t send it via interoffice mail - I had to take it to her in person. Furthermore, certain logs required a VP’s signature before you could access them. Following any one user across more than one day was strictly forbidden, even though you didn’t know who it was. And lastly, we were warned that looking up any specific person’s data, even your own, would lead to your termination.
All of that probably doesn’t reassure you if you’re cynical. I can’t help that.
Protocol Buffers
Google’s made Protocol Buffers public. Protobufs are highly compact, and almost everything in Google uses a protocol buffer. Engineers who are being snarky say that their jobs are 50% just copying stuff from one protobuf to another. Before you can process any log file, you have to know which protobuf to use.
Sessions
You probably imagine that Google has a separate log for everything you do. You’re wrong, or least back then you were. Rather, it has logs for everyone together, and they have to be separated into sessions, which are all the activity of one user in a day. That can be complicated since a “user” can be on different devices during the day, and different people can use the same device. I won’t attempt to explain how all this is done.
In any case, if. you want to know if someone clicked an ad that you displayed, you don’t want to sort through several million other user actions to get the one you want; you want their clicks to be near their query in the log. That’s why the Sessions logs are the workhorses of ads analysis. Creating the Sessions logs is a daily task.
Sawzall
Sawzall — for cutting logs. Clever, huh?
That was a language that the famous computer scientist Rob Pike and others created. It was basically an easier way to do a Map-Reduce, a technique that’s become ubiquitous in the Data Science world. You’re writing a job that processes many log files in parallel, reducing them to a single output. As far as I know, Google now uses the Go language to process logs and not Sawzall.
Google engineers live by the Map-Reduce. To quote the original Map-Reduce paper:
As a reaction to this complexity, we designed a new abstraction that allows us to express the simple computations we were trying to perform but hides the messy details of parallelization, fault tolerance, data distribution and load balancing in a library. Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record’ in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use reexecution as the primary mechanism for fault tolerance.
So I had to learn Sawzall as well.
So What Did I Do?
Now we’re finally ready to answer that question.
The Sawzall code that processed the Session logs was a mess. No one wanted to touch it. So I refactored it into a fairly standard object-oriented system, which I called Irie, to go with the Jamaican word “Rasta.” Irie’s meaning in the Urban Dictionary is “to be at total peace with your current state of being. The way you feel when you have no worries.”
This was a long, tedious job requiring understanding other people’s code; something that most people hate. However, I’d done this before, and I was the New Guy, so I dove in.
Irie was a big success. Everyone liked it. Now what?
Advertiser Experiments
I was still trying to climb the career ladder at Google, and being a “project lead” seemed like the best way to do it, so I told Diane that was what I wanted. She had just the project for me: Advertiser Experiments!
Let’s say you’re a florist and you advertise on the “flowers” query. You have two different ad campaigns you want to test. You could do this yourself manually, but Google really understands experiments, doesn’t it? Maybe Google could do it for you!
This had been a moribund project in Ads for a long time by 2008 when it was proposed to me. It had one engineer and one product manager assigned to it. To do it “properly” required serious modifications to almost every part of the ads serving system. Nonetheless, like an idiot I said Yes.
Pro tip: if something has been available for quite a while, you should always ask “why hasn’t someone snapped this up already?”
On the other hand, there’s the joke about two economists walking down the street, and they see a $100 bill lying on the sidewalk. One economist says “If that really was a $100 bill, someone would have picked it up already” and walks on.
At any rate, after exploring this, I naturally wondered if there wasn’t some easier way to do it; not as statistically valid, maybe, but adequate for the advertiser who just wants to improve his performance. I won’t go into the details here, but let’s just say that everyone wanted a Super Deluxe version even if it did require changing every part of the Ads system. No one wanted something quick-and-dirty that just did the job. This was Google, after all; “quick and dirty” would not get you promoted or get your talk accepted at a conference. It did not make me popular to suggest this.
After two months, I told Diane I didn’t want to do this anymore, and I moved into a non-lead role on the Ground Truth project. Don’t ask me to remember too much about what that was. But see the next section.
Ironically, in 2017, 9 years later, I transferred back into Ads and Advertiser Experiments was still going. They were doing a limited rollout of it, since it still wasn’t ready for prime time. Apparently it is now since you can do your own experiments. So they finally got their Super Deluxe version, 15 years later.
Working the Levers for My Cousin’s Benefit
This is just an amusing vignette about life at Google and how I was able to get my cousin Missy paid.
“Ad raters” and “search raters” have been described in many places, including by Google and elsewhere.
“Machine learning” and “artificial intelligence” are terms that a lot of people are familiar with nowadays, although maybe they weren’t as much in 2008. Anyway, when you “train” a model, you have to give it some examples of things that are true and things that are false, so it can learn the difference.
Google’s system for deciding which ads are likely to be clicked on was called SmartASS, and it’s also been described publicly.
That’s called “supervised learning.” Before anyone gets all pedantic, there is also “unsupervised learning.”
But how do you know the best ad for a given query? There has to be some human being who decides that. That’s where the “raters” come in. These are contract workers, hired by an outside third party (this fact will become critical later on), who work on an hourly basis. The raters have to pass a sort of basic intelligence test, and they have no official connection with Google. They can’t come to campus and have lunch, for example. After a period of time (two years back then) they’re let go, since Google doesn’t want a crew of permanent raters; they want “real people.”
To train your model, you need lots of examples; millions, in fact. A rater gets a series of search queries and ads, and is asked to rate the suitability of that ad for that query, on a 1-5 scale. But wait, there’s more! There could be as many as five raters handed that exact problem, and they can discuss it with each other if they don’t agree. There was even some serious statistical talent devoted to determining how many raters you need: if the first three all agree, maybe you don’t need to give it to two more! It’s all real money, so they cared about this.
Missy, my cousin’s daughter, had three small children and couldn’t work outside the home, so this was an ideal gig for her. She lives in Tucson, but there was one time she came to the Bay Area and I brought her to Google for lunch. We went up to the area where the Ads staff worked, and even though some of them worked with raters’ data every day, none of them had ever met an actual rater.
Eventually her time as ads rater came to an end, and I helped her get another job as a Search rater. This was run by a different third-party company with quite different rules. Then came the Incident.
You Took Too Long; No Payment for You
One of her first assignments was a timed test, basically a training exam. Since many thousands of people had taken it, they had a very good idea how long it should take. Missy took too long and they said they weren’t going to pay her. So I got an angry message from her about it.
At the time, the Google weekly meetings had microphones in the audience and anyone could ask the top managers a question. I sent a message to Laszlo Bock,
at the time the head of HR, telling him that I would stand up at the mic on Friday and ask about this.
As it happened, he was waiting to go onstage at another company meeting that you could watch from your desktop, so I could actually see him reading the message. Within about 90 seconds, he had answered and said he’d look into it. He did. Someone from HR contacted me, and set up a meeting with me and the two 20-something Googlers in Search who ran the rater program.
This meeting was one of the highlights of my Google career. I said very little. A lady from HR (who was not a lawyer) listened politely as the Googlers explained that the third-party company had very detailed data on how long you should take on that test, and moreover, Google could not intervene in any one person’s case, because we had to have an arms-length relationship with the third party. They were just so cute, these 20-somethings.
The HR lady listened politely, and then explained (I’m paraphrasing here), “Wow, it’s great that you guys are so diligent on this! But you know, according to labor law, if you hire someone to do a job and you don’t like the way they do it, your remedy is to fire them. It’s not to refuse to pay.”
Missy got paid, and went on to be a good Search rater. I approached Laszlo before the company-wide meeting on Friday and told him I would not be asking a question at the mic, since he had handled it.
If you’re wondering, as the people on Hacker News did when I told this story, what someone does who doesn’t know a person on the inside: I don’t know. They probably get screwed.
Owned and Operated Properties
A “Google owned and operated property (OOP)” was anything from Google that was not Search: Maps, YouTube, Shopping, Blogger, etc. etc. These were the poor stepchildren of Google, and didn’t get nearly the level of expert attention as AdWords. They were controlled by a Google project called Ads for Search (AFS). There were many outside AFS clients, like Target, who had a search page on their website and wanted to serve ads, but preferred to let Google do it. AFS was a pretty big business, although not compared to AdWords (actually, nothing is big compared to AdWords).
Diane offered me OOP as a specialty. After that Advertiser Experiment fiasco, I was much more cautious about saying Yes to her suggestions; in fact, after two weeks she has to ask if I’d decided yet. I finally said yes.
This turned out to be the one year in my entire career where I could be unambiguously certain I’d earned my salary! Usually in Engineering, you work on a product with a bunch of other people, and some time later some revenue flows in. It’s impossible to say how much money you earned for the company. Unlike a salesman, you usually can’t quantify the value of what you did.
Not here. I could raise the RPM (revenue per thousand queries) of a Google property, multiply the increase in RPM by the thousands of queries handled by that property, and tell you exactly how much money I made for Google. So you’re wondering: what was my method?
Stealing Ideas from AdWords
I mentioned that there was some very serious mathematical talent devoted to AdWords, and it wasn’t just luck that they kept increasing the revenue. I mentioned earlier that the top ads brought in much more money than the right-hand-side ads, and consequently, a great deal of analysis was done on the best way to select ads to be promoted to the top.
One of my biggest wins was just to take those top ad settings and apply them to the OOPs. Et voila, more money rained down from the sky.
Of course, it’s not a 5-minute operation. I first had to explain it all to their monetization team, then do a small (less than 1% of the traffic) experiment with the new settings. Then we’d review the RASTA results with the entire team, and if it passed, increase the percentage. If results were still positive, we might put it into production. But even that isn’t the end of the story: after you put something into production, you always have a “holdback” experiment which runs the property without your change, so you can still discover that you were wrong and maybe undo it.
YouTube
At the time, YouTube had very few ads (now, if I watch a 20 minute video by Rick Beato, it gets interrupted at least four times by ads). It was still run by Chad Hurley and Steve Chen, the founders, as far as I know. I never met them.
YouTube ran at a colossal loss then, and I found that there were TV stations in India broadcasting nothing but YouTube videos. It’s great when you can get your content for free.
YouTube was mostly staffed at their original site in San Bruno, not at the main Google site, so I went up there a couple of times, and really enjoyed it. I had a friend Taylor from the Cinema Club, and once she was introducing me to some colleagues as “my friend from Google” and I spread my arms out and said “We’re all from Google!”
Taylor said, “No, this is YouTube.”
(this series will be continued in the next post)
This brings back a ton of long-forgotten memories, of RASTA and Sawzall and AFS, and running monetization experiments on Maps. This of course is how you and I first became acquainted, around 2009. The ads experiments on Maps mostly revealed that our clever ideas for location-based ads were an order of magnitude less effective than search-based ads. Turns out it's very difficult to infer a user's intent by what they are looking at on the map, at least not without search history, which we weren't allowed to use at the time.
Regarding the shift from yellow to blue ad backgrounds (or vice versa), I think it was later observed that on many monitors the new background color just looked "white", so there was no longer a strong visual distinction between ads and organic results. And eventually it just became white.