CT No.57: Pretty data and folk theories of algorithms

Sep 10, 2020

Big hugs to out to readers on the US west coast this week. If you are not panicking or furious, please tell me what drugs you are on to maintain your zen in the face of these preventable apocalyptics.

Today I’m also considering how, for 18 years, I considered 9/11 to be the defining trauma of my life and naively thought that I’d never be in close proximity to such major disaster again, that the world would mostly right itself. As an 18-year-old freshman at NYU in 2001, I was in my dorm two miles away, and I remember applying eyeshadow for my 8am class when I heard the low-flying planes and all the sounds that followed.

Ultimately I was fine, and my friends were fine, and my memory from the time is just that I didn’t want more people to die needlessly. But they did, in decades-long wars that were supposedly for our protection. In the years afterward I danced to record labels and bands and songs called “death from above” as the PATRIOT Act eroded our rights, presumably in retaliation from this horrific crime that could only ever be committed by a foreign power.

Now nearly 200,000 are dead because the U.S. doesn’t actually believe in security and safety (except for the rich and white, in that order), and we measure the days in how many 9/11’s worth of people have died. Those low-flying aircraft showed up in my Midwest city neighborhood, this time flown by surveilling locals instead of foreign terrorists, but still hell-bent on control through destruction. Although it was triggering to my personal trauma — I am not calm when I hear low-flying aircraft — I am technically “fine” and I still do not want any more people to die needlessly, but the people I am most concerned about are those living in our city parks and those protesting our local, state and federal government, not in wars abroad. Never fucking forget.

Anyway, contemplation of lifetime trauma and the current global crisis aside, I’m working on original research that I’d like to share today. So, contents:

Light insights from an in-progress search data research project
A review of freemium data visualization software Flourish
Links of the week

Are you new here? You can

How do U.S. searchers think that tech algorithms work?

Plenty of other newsletters cover the U.S. debate over Big Tech regulation, so I stay away from writing about those hearings and takes. (To be honest, I don’t really care who owns TikTok, but it would be hilarious if it were Oracle, which is like the Russell Crowe of enterprise software.) But I watch them closely because they’re a fantastic way to learn how the public conceives technology and content recommendation algorithms.

Compared with the general public’s knowledge of how journalism works — not great! — our collective understanding of algorithms is exponentially lower. Since Big Tech feels no need to educate users about their complex products in the name of intellectual property, most people instead believe what academics would consider a folk theory of algorithms. Technically, what I discuss in this newsletter are educated folk theories. The SEO industry comprises educated folk theories, since no SEO experts actually work at Google and all current and former Googlers who understand the algorithm are under NDAs.

One of these folk theories may be the centerpiece of a forthcoming antitrust suit against Google, at least according to the NYT: the widely believed notion that tech companies discriminate against conservative media and voices. According to Pew, a majority of Americans believe that social media companies censor certain points of view. The study doesn’t explore whether people feel this discrimination is directly manipulated by humans or algorithmic.

Like all folk theories of tech, these theories of algorithmic censorship are not entirely wrong, but they’re often off-base. They’re on the outer part of the dartboard where the numbers are but technically corked in on the same board as the bull’s-eye. (The theories in this newsletter are outer bull’s-eye, according to me.)

To understand the reach of these and other folk algorithm theories, as well as to test out a few tools, I pulled some search data to explore the public’s conception of our algorithmic networks.

Why search query data?

Search query data records the exact words that a person types into a search bar. Google and other search engines record and aggregate these terms primarily for advertising, but SEO and content strategists use it as well. The data is completely anonymized — never publicly associated with who typed it. Some search queries aren used tens of thousands of times per month, and others are completely unique.

It’s also wildly revealing, providing in-depth insights on what and how we ask Google about what we don’t know. More than any other type of digital data, search queries reveal what makes us curious at a massive scale. Even though Google’s search share is likely declining as more options become available, most of us still search on a daily basis, across all walks of life. Close-reading the most popular queries can provide an accurate assessment of how the public considers a certain topic.

For this dataset I pulled, cleaned, consolidated and organized about 6,000 queries representing 264k monthly U.S. searches using the following tools:

SEMRush
Answer the Public (review next week)
Tableau Prep (yep I’ll probably review this too)
Flourish (review below!)

I pulled queries around Google and all major social networks related to the terms “x algorithm,” “x bias,” and “how x works.” I also used the head term “shadowban” in general to see which networks were discussed because I was curious about who uses the word (which I learned from a Stephen Malkmus song) in general.

I’m not going deep into my methodology here because it’s not an academic paper! — but if you’d like more detail on how to organize and pull keyword data, how long it takes, how accurate it is, I’m happy to chat.

Additional caveats: This is part of a larger data project that I’m working on. So this is neither the full nor the only dataset. Mostly it started as a curiosity: how do users think that content recommendation algorithms work? How do they think algorithms are being controlled? How much do they use the search engine they don’t trust to discover whether the search engine is biased?

A few early insights from this work-in-progress dataset:

1. Different channels reveal different levels of sophistication around algorithmic knowledge.

Instagram users are most likely to talk directly about how the algorithm works, with TikTok users asking more general mechanics questions like “how do I get more likes on TikTok” that don’t directly mention the algorithm.
Google users no longer search for “how google works.” Users who want to learn more about Google algorithms generally structure those queries around the algorithm or other queries.
For the amount of Facebook users, there’s surprisingly little search volume, compared to its sister at Instagram.
No one asks about the Pinterest algorithm, to a comical degree.

2. Instagram users notice (and panic) when there is an algorithmic change.

Sudden drops in any metrics or visuals prompt queries like “new instagram rule” (6,600 searches/month) and “whats going on with Instagram” (1,000 searches/month). Both Insta and TikTok creator/creative audiences are invested in understanding how those algorithms work and why they are changing.

Compare this withYouTube’s massive creator/creative base, which doesn’t seem to search that much on Google. (FYI, Google’s and YouTube search engines are completely different algorithms and products, so it’s likely that there are more searches on YouTube itself about the same issues and many of these queries.)

3. If a majority of Americans believe that Facebook and Google/YouTube have bias or censor some opinions over others, they aren’t searching for more information about why or how that might be.

These search volume wonders are fairly low, compared to other verticals of search queries, and compared to the majority of Americans in Pew’s study. If every query in my study was searched only once by a unique person (not likely), only .8% of Americans would be represented in my dataset.

I trust Pew’s data — and I need to do some more explorations specifically around the word “censor” — so how do we turn concepts of algorithmic and tech bias from a broadcast to a conversation? How do we create trust in our existing social platforms or any other emerging media when we’re starting from a point of both misinformation and mistrust?

And finally, the data itself…

Below is the keyword data, visualized by Flourish, but I’d recommend that you click through to the interactive versions, with five different viz options (and one more artistic interpretation) of the same data. Circles are to scale, so bigger circles indicate more search interest around a specific keyword.

The data kindof makes sense as a static graphic. You can see the more-than/less-than. But in the interactive version, you can hover to understand the totals scale, click into each of the circles to see individual search queries, duplicate and manipulate for yourself, and even check out the raw data if you want to do your own analysis.

And if you’d like this kind of topic research and analysis for your business or content strategy — or if you’d like me to help your team develop a similar data-driven content research process — please reach out. I’m happy to chat.

If you like this pretty, pretty data, you can

Flourish: Interactive data visualization for when Excel is too limiting

Digital data is hella complex, even when you’re doing a simple count and close-reading. Data is also often displayed in a way that’s not easy to understand, and we’re still getting a handle on the best kind of data visualization for the general public to understand.

The whole success of the digital economy relies on data to be three dimensional, multifaceted, worthy of exploration, especially when you’re not trying to answer a yes or no question. Static PowerPoint slides or marginally deeper Excel sheets may do the trick occasionally, but for the most part static graphs cut out the complexity and drive “more than” or “less than” decision-making. Tableau powers through massive amounts of data and is very slick, but it’s an enterprise-level expense and requires a power user to drive.

I’ve been searching for a middle lane for ages and I think I may have found it: The freemium data visualization tool Flourish offers a solution for independent data visualizers and enterprise teams alike.

Flourish at a glance

Flourish’s functionality isn’t unlike PowerPoint or Excel charts. Choose your template, and then upload data to make a chart. Export a static image or HTML file, or embed the chart as needed.

I used the public version of Flourish for this trial, which was functional but very slow at processing and visualizing the data. I’m sure when you click through you’ll experience a similar slow speed; the tool is processing and visualizing a lot and data takes a while. When I eventually subscribe, I’m hoping it’s a little faster.

Once the data is uploaded, you can select the columns you want to analyze in your preferred order. It’s far easier than selecting columns in Excel or Tableau.

I also recommend preparing the data ahead of time, rather than fixing errors in the tool. And Flourish is case sensitive (Microsoft products are not as a default), so make sure your groupings are properly labeled.

Flourish offers a business-level version that some blue-chip clients like Google use. In the enterprise version, developers can upload their own viz templates: a super cool feature. I also imagine there are more collaborative elements to the enterprise version as well.

But I recommend Flourish highly, especially for semantic/text-based research projects like this one— I’ve been looking for ages for a better visualization of semantic and keyword research than a pivot table and finally I’ve found my dance partner.

A minor correction: Last week’s review graphic stated that Authory was “free.” It’s not free; it’s $10/month.

You can also

Share The Content Technologist

if you so desire.

Content tech links of the week

The chum bucket vendors Taboola and Outbrain are not merging, via TechCrunch.
Did you see the GPT-3 generated article in The Guardian?
If you did, you should read why it’s overhyped, via TheNextWeb.
Many of us are often rightfully pissed about inaccurate and deceptive phrases in reporting, like “racially motivated” or “officer-involved shooting.” Here’s a good guide to why those phrases should be retired and some suggestions to fix them.
And I don’t like to link to research I haven’t read, but if you haven’t seen the report on amending Section 230, it’s on my docket and I’m sure we’ll be discussing here soon.

Visit The Content Technologist! About. Ethics. Features Legend. Pricing Legend.