Skip to Main Content

Digital Scholarship: Computational Text Analysis

Overview

Inaugural Speeches Stylometry Dendrogram

Computational Text Analysis (CTA) is an umbrella term for an array of digital tools and quantitative techniques that harness the power of computers and software to analyze digital texts, from individual texts to big (textual) data.

CTA's primary value is that it enables the scale of traditional text analysis to expand. Whereas scholars were formerly limited to analyzing only one or a handful of texts at at time--a method often known as "close reading"--scholars via CTA can now also analyze thousands at once to identify large-scale patterns and trends--a method known as "distant reading." When close and distant reading are paired, scholars can make more informed generalizations. 

 

(Left) An Example of the CTA method of Stylometry: Dendrogram Visualizing Hierarchical Cluster Analysis of US Presidential Inaugural Speeches (grouped by stylistic similarity)

Techniques

CTA techniques include:

Keyword Analysis: keywords and key phrases can be identified or tracked in a text or corpus through various computational means. 

Named Entity Recognition (NER): NER extracts and categorizes a text's or corpus's proper nouns and other information types.

Sentiment Analysis: sentiment analysis quantitatively determines affective trends in a document or corpus.

Stylometry: Stylometry is the use of quantitative and statistical methods to determine literary style.

Topic Modeling: Topic modeling determines the thematic composition--the aboutness--of a document or documents in a corpus.

Word Embedding Modeling: Word embedding determines the aboutness of words in a document or collection by computing which words tend to be associated.

Tutorials

The Programming Historian: free, peer-reviewed digital humanities tutorials (here linked to their tutorials on CTA)

The Fish and the Painting: Andrew Piper's online textbook on how to use R for humanities text analysis

Hacking the Humanities Tutorials: Paul Vierthaler's YouTube tutorials on how to use Python for humanities text analysis

Tutorials also accompany a number of tools listed above.

Tools

Free CTA tools include:

Easy

 AntConc: downloadable tool mainly for keyword analyses of a text or corpus

Voyant Tools Icon Voyant: browser-based tool mainly for keyword analyses of a text or corpus

undefined Topic Modeling Tool: downloadable tool for topic modeling

Moderate

Lexos: browser-based tool mainly for stylometry

undefined (MAchine Learning for LanguagE Toolkit): command-line software mainly for topic modeling

Difficult

general-purpose programming language often used for text analysis

  statistics-oriented programing language often used for text analysis

Tool-Corpus Sets

 English-corpora.org: enables keyword analyses of a variety of large corpora

undefined HathiTrust Research Center (Free to Union): set of affordances for analyzing the HathiTrust collection

Data Preprocessing

Stopword Lists (see also NLTK Data)

Stopwords are words of high frequency but low meaning (such as function words, like "a," "an," "of," "the," etc.) that can hinder some text analyses (stylometry is a key exception, as it analyzes these words). Stopword lists tell the text analysis software the words to ignore. 

Stemmers / Lemmatizers

Stemmers + Lemmatizers reduce inflected words (ex. "thinks," "thinking," "thinker," etc.) to their root (ex. "think"), which can be helpful in text analysis. Lemmatizers attempt to account for a word's context and part of speech (i.e. whether "saw" is a noun or verb) but can be complex and run slowly; Stemmers do not account for context and POS but tend to be simple and fast. 

Data

Free text data repositories include: 

DH Resources for Project Building--Data Collections and Datasets: aggregates repositories of text data

DocNow: Twitter datasets

English Corpora: various text datasets

JSTOR Data for Research: 12+ million secondary and primary source texts

NLTK Data: Various datasets from text collections, to stopword lists, to sentiment lexicons, etc.

Project Gutenberg: 60,000+ books, with focus on older, public domain works

Schaffer Library's Databases ("Free" to Union): access to a variety of digitized texts. A number of these offer tools to analyze their text collections.

Projects

CTA Projects include:

Hendometer: sentiment analysis that seeks to measure happiness in a variety of corpora

Viral Texts: traces text reuse in 19C American newspapers

What Every1 Says: applies topic modeling to trace how the humanities is covered in the news

Permissions

Creative Commons License
This page was created by Adam Mazel and is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Readings