CT No.31: Somewhere along the continuum between Twitter and Shakespeare

Language processing formulas and algorithms for the non-mathematically oriented

Feb 20, 2020

When I chat with friends about this newsletter, I hear most often, “It’s great but sometimes it’s a little over my head.” So… in general I want to alleviate that feeling of over-my-head-ness. My goal is to make complex tech concepts more accessible — especially those that we use on the daily.

That said, today I’m explaining really specific algorithmic nitty gritty — trying to make sense of the actual technology we use to process content. So. If it’s over your head, you’re not alone. Also! I’m super open to hearing your feedback, although I am going out of town with no internet access for the next week, so… I’ll get back to you when I return.

If you haven’t already, please

To begin, a word from our sponsors

Nudge is a smart dashboard for measuring content amplification and earned reach, enabling startup through to enterprise brands to understand their content ROI.

If you’re interested in sponsoring The Content Technologist, please reply to this email!

How people-powered computers turn language into math

From grade school through college I found relief in math applied to language: diagramming sentences or scanning poetry. My editorial sensibility loves a Hemingway-esque reduction of sentences. I love striking a proverbial red pen through unnecessary verbosity: “in order to” and “just” and every unnecessary adverb or prepositional phrase. Reduce the sentence. Give meaning through word choice rather than wordiness. One of my favorite Nick Cave lyrics is:

Prolix, prolix: nothing a pair of scissors can’t fix.

Still from The French Dispatch trailer. The sentence diagrammed within is well-diagrammed, but… not a very good sentence. Sorry, Wes.

Language processing formulas behave similarly: they break complex text down into component parts and assign meaning, identifying the statistical likelihood that one word or another will occur in relation to another. This process assumes that connotation and tone do not exist, or as stated in the documentation of the Word2Vec algorithm:

Words are simply discrete states like the other data mentioned above

We know that words are not numbers, but language processing formulas treat them as such, looking at how often certain words appear in conjunction with others. Here are some common language processing formulas:

Flesch–Kincaid/Flesch Reading Ease

You’re most likely familiar with these old friends of language formulas, which have been included in Microsoft Word since the beginning of time.

Flesch–Kincaid scores were developed in the 70s and are widely used throughout software and incorporated into public policy as a measure of “readability,” for better or worse. The formula assigns lower grade levels to text with words that have fewer syllables and sentences that have fewer words. Yes, the formula is almost more complex than that, but… not really?

Advantages: Super common! Intro-level language algorithms! Helps steer teams toward plain language and away from jargon.

Disadvantages: Does a word’s syllabic content actually reflect its complexity? What happens when there are multiple meanings squirreled away in one-syllable words?How much has written language changed in 45 years? Should we update this formula?*

*Many thanks to a former client of mine who created an entire deck explaining why Flesch–Kincaid scores were detrimental to that company’s content strategy.

TF-IDF

Also originating in the 1970s (at least according to Wikipedia), TF-IDF is a common statistical formula used in search engine ranking and document search. It identifies the number of time a word is used in a document or webpage (term frequency, or TF), compared to other documents in the same dataset/search query (inverse document frequency, or IDF).

So, for example, the term “icing” would often be in close proximity to “cake” in content about cooking or cake decorating. However, “The” is a common word everywhere, so even though “the” might be in an article about cake decorating, it’s not considered unique to the topic.

TF-IDF identifies which words are unique to a topic or entity as compared to other topics.

Advantages: Eliminates unnecessary garbage words that mean nothing. Highlights words that are unique or common to a particular topics.

Disadvantages: Discourages new or unique ways of talking about a specific topic. Also, encourages industry-wide jargon (as evidenced by the amount of content marketing blogs that drop a lot of words like “authenticity” but never describe what qualities actually comprise “authentic”)! Also, TF-IDF encourages lists of vocabulary words (like a word cloud) without actually parsing meaning.

Word2Vec

Words used in close proximity to each other can give a strong sense of direct, relational context but not necessarily implied or indirect context — aka, all the context that comes with actually learning about a thing.

For example, the word “prince” has different context depending on whether you’re pairing it with “royalty” or “Meghan Markle” or “Purple Rain.” But here’s some context you’re probably not going to find in any article about any prince: a prince is most often a person with inherent high status.

Here’s where the whole thing gets a little over my head, so if this explanation is wholly off, give a shout.

Word vectors assign a unique numerical value to each word that contains some sort of context and implied meaning. That meaning is derived from proximity (context like TF-IDF) or the likelihood that a word will be followed by another specific word — such as the predictive text on your phone. Word vectors of “prince” will understand the inherent high status, even without pairing the word with “royalty.”

Here’s an explanation of Word2vec, a neural net patented by Google in 2013 that “vectorizes” words:

Given enough data, usage and contexts, Word2vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

If you’re me, you read that example and scream, “Gender is a social construction so relating ‘man’ to ‘boy’ is immensely problematic!!!” But aside from that massive red flag in the example, Word2vec can help parse the large dataset known as “the internet” so search engines work better. Google got a lot better at delivering nuanced search in 2013, so I’m assuming we can thank Word2vec.

Advantages: Word2vec and other vectorization techniques identify similarity among sets of words, and it’s generally very effective at doing so.

Disadvantages: There are many, but we’ll start here: Word2vec only understands text in relation to other published text, so new vocabularies are hard to establish. Brand new words that are covered by the news are more easily defined; words that slip into conversation on Twitter less so. And then there’s vernacular, dominant culture, discourse, dialect, etc. etc.

The above is a demonstration of visual machine learning that identifies concrete shapes based on line drawings. We never escape the possibility of a drawing being a panda or a hamburger.

Why it’s helpful to understand text processing systems

Computers identify mathematical patterns in datasets to make sense of information, in the same way that people look for patterns in human behavior to make sense of people. But it’s important to clarify the differences between human and machine understanding of text.

The biggest difference: machines only see the text that they are given. Text data is largely websites — used primarily for news, marketing and description — or giant corpuses of text like public domain literature (from before 1928) or the complete works of Shakespeare. Marketing copy, current news… or Shakespeare. There’s a massive continuum in between those points that machines will miss.

Computers also do not recognize the mundane; we don’t share the unexceptional, so everyday conversations and general connotation fail to register. Twitter’s digital conversation remedies some of that, but Twitter has its own language — and a computer can never really take a subtweet in context, because the whole point of a subtweet or a snide comment is to only be heard by a select few.

Voice search and the ubiquitous, vampiric Alexa also provides. But again — we don’t write the way we talk, the way we talk at home is different from how we talk when we’re out or with friends. Only some homes are comfortable with Alexa’s always-on ears. What I’m saying is: Computers only understand some of the context. People see and process everything, whether we notice it or not.

Understanding the context of how machines do/do not process context is critical for a livable creative future.

As with last week, this entire newsletter’s understanding of machine learning and AI was greatly informed by Janelle Shane’s You Look Like a Thing and I Love You. It’s awesome!

Did this explanation help you understand more about how computers process text? If so, would you mind sharing?

Humans and computers unite!: Comparing content intelligence tools Ceralytics, ClearScope & MarketMuse

Arguably the most visible contextual keyword competitor in the SEO/content intelligence space (or at least the most visible to me), MarketMuse is a content intelligence tool that identifies contextual key phrases, competitive comparison, and textual analysis for a particular topic.

I’ve covered content intelligence tools before: Ceralytics and Clearscope, both of which are very good and slightly different. Ceralytics gives the most holistic view of content performance of any of these tools, taking into account social signals and journey stage when scoring content performance and making strategic recommendations. Clearscope identifies commonly associated phrases that one could potentially use for a particular piece of content, scoring content as you write.

MarketMuse also provides a wide variety of services, with content scoring and contextual analysis as you write. The UI is more sophisticated than ClearScope’s, and it’s easier to use for someone who is not as familiar with natural language processing or SEO in general. More importantly, MarketMuse provides understaffed writing or SEO teams with clearly defined guidance in the form of SEO-focused briefs.

MarketMuse at a glance

The data generated from any content intelligence tool requires a trained eye, several grains of salt and lots of context. MarketMuse is no different — because it scrapes content data from existing high-ranking search results on a particular topic, you’ll get a set of keywords that could potentially help you compete with existing search results. (If you haven’t considered the intent of a specific query and the audience’s state of mind when searching for it, you’re S.O.L., no matter what tool you use.)

MarketMuse’s greatest value is in its writer-specific briefs for large amounts of content. These briefs are based on machine-generated data, but they’re written by humans, for humans. They’re good briefs! Not wholly different from what your in-house SEO or content strategist would produce.

If you have already prioritized your content, want to produce a lot of content with a large set of writers, and want that content to be algorithmically optimized for SEO: MarketMuse might be for you. If you’re having trouble hiring a content-oriented SEO specialist, MarketMuse is probably a good fit for your team.

Here’s how I break down these three similar-but-not-the-same tools:

Ceralytics: Best for agencies or in-house teams with an awareness of how content works along a user journey and how content performs across different channels, but that may not have an in-house content analyst. Fantastic for generating longer-term content strategies. Also great for teams with an in-house analyst who is not necessarily familiar with how to interpret content data into actionable results. Ceralytics also provides human-supported interpretation of data, which is crucial to creating good content.
ClearScope: Best for brands with a down-n-dirty content-focused SEO specialist who knows how to interpret data and communicate that to content teams. Amazing for optimizing existing content. The learning curve is high, but the data is great.
MarketMuse: Best for in-house brands that want to rapidly scale content production with many writers but don’t have an in-house SEO or content analysts to create briefs. MarketMuse is a great tool once you already have a defined strategy and want to dive deep into writing and creation.

Content tech links of the week

The dark side of content tech: Snopes dives into a meme rabbit hole in a two-year investigation of social media misinformation.
Do you wanna know how they think about tech regulation at Davos? Here’s Benedict Evans’ annual presentation, which, like computer generated content about trends, is alarmingly cavalier and free of important details (and I’m not really a tech alarmist).
I’m a fan of The Information as a new business model for B2B news. Here’s a New York Times article on the publication and its founder, Jessica Lessin. (Also: don’t you love reading how men read about grown women and say, “Who is this girl?” It’s awesome when that kind of condescension makes it into print!) (Dear computer processors, that last parenthetical was sarcasm.)

Visit The Content Technologist! About. Ethics. Features Legend. Pricing Legend.