Subject Research, Course Guides, Documentation: Digital Scholarship: Computational Text Analysis

Overview

Computational Text Analysis (CTA) is an umbrella term for an array of digital tools and quantitative techniques that harness the power of computers and software to analyze digital texts, from individual texts to big (textual) data.

CTA's primary value is that it enables the scale of traditional text analysis to expand. Whereas scholars were formerly limited to analyzing only one or a handful of texts at at time--a method often known as "close reading"--scholars via CTA can now also analyze thousands at once to identify large-scale patterns and trends--a method known as "distant reading." When close and distant reading are paired, scholars can make more informed generalizations.

(Left) An Example of the CTA method of Stylometry: Dendrogram Visualizing Hierarchical Cluster Analysis of US Presidential Inaugural Speeches (grouped by stylistic similarity)

Techniques

CTA techniques include:

Keyword Analysis: keywords and key phrases can be identified or tracked in a text or corpus through various computational means.

Named Entity Recognition (NER): NER extracts and categorizes a text's or corpus's proper nouns and other information types.

Sentiment Analysis: sentiment analysis quantitatively determines affective trends in a document or corpus.

Stylometry: Stylometry is the use of quantitative and statistical methods to determine literary style.

Topic Modeling: Topic modeling determines the thematic composition--the aboutness--of a document or documents in a corpus.

Word Embedding Modeling: Word embedding determines the aboutness of words in a document or collection by computing which words tend to be associated.

Tutorials

The Programming Historian: free, peer-reviewed digital humanities tutorials (here linked to their tutorials on CTA)

The Fish and the Painting: Andrew Piper's online textbook on how to use R for humanities text analysis

Hacking the Humanities Tutorials: Paul Vierthaler's YouTube tutorials on how to use Python for humanities text analysis

Tutorials also accompany a number of tools listed above.

Tools

Free CTA tools include:

Easy

AntConc: downloadable tool mainly for keyword analyses of a text or corpus

Voyant Tools Icon Voyant: browser-based tool mainly for keyword analyses of a text or corpus

undefined Topic Modeling Tool: downloadable tool for topic modeling

Moderate

Lexos: browser-based tool mainly for stylometry

undefined (MAchine Learning for LanguagE Toolkit): command-line software mainly for topic modeling

Difficult

general-purpose programming language often used for text analysis

statistics-oriented programing language often used for text analysis

Tool-Corpus Sets

English-corpora.org: enables keyword analyses of a variety of large corpora

undefined HathiTrust Research Center (Free to Union): set of affordances for analyzing the HathiTrust collection

Data Preprocessing

Stopword Lists (see also NLTK Data)

Stopwords are words of high frequency but low meaning (such as function words, like "a," "an," "of," "the," etc.) that can hinder some text analyses (stylometry is a key exception, as it analyzes these words). Stopword lists tell the text analysis software the words to ignore.

Matt Jockers' Expanded Stopwords List The list of stop words used in topic modeling a corpus of 3,346 works of 19th-century British, American, and Irish fiction. The list includes the usual high frequency words (“the,” “of,” “an,” etc) but also several thousand personal names.
SMART Stopword List This freely available stopword list was built by Gerard Salton and Chris Buckley for the experimental SMART information retrieval system at Cornell University. This stopword list is generally considered to be on the larger side and so when it is used, some implementations edit it so that it is better suited for a given domain and audience while others use this stopword list as it stands. This wordlist is 571 words in length.

Stemmers / Lemmatizers

Stemmers + Lemmatizers reduce inflected words (ex. "thinks," "thinking," "thinker," etc.) to their root (ex. "think"), which can be helpful in text analysis. Lemmatizers attempt to account for a word's context and part of speech (i.e. whether "saw" is a noun or verb) but can be complex and run slowly; Stemmers do not account for context and POS but tend to be simple and fast.

Snowball Stemmer Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.

Data

Free text data repositories include:

DH Resources for Project Building--Data Collections and Datasets: aggregates repositories of text data

DocNow: Twitter datasets

English Corpora: various text datasets

JSTOR Data for Research: 12+ million secondary and primary source texts

NLTK Data: Various datasets from text collections, to stopword lists, to sentiment lexicons, etc.

Project Gutenberg: 60,000+ books, with focus on older, public domain works

Schaffer Library's Databases ("Free" to Union): access to a variety of digitized texts. A number of these offer tools to analyze their text collections.

Projects

CTA Projects include:

Hendometer: sentiment analysis that seeks to measure happiness in a variety of corpora

Viral Texts: traces text reuse in 19C American newspapers

What Every1 Says: applies topic modeling to trace how the humanities is covered in the news

Permissions

Readings

Distant Horizons by Ted Underwood Just as a traveler crossing a continent won't sense the curvature of the earth, one lifetime of reading can't grasp the largest patterns organizing literary history. This is the guiding premise behind Distant Horizons, which uses the scope of data newly available to us through digital libraries to tackle previously elusive questions about literature. Ted Underwood shows how digital archives and statistical tools, rather than reducing words to numbers (as is often feared), can deepen our understanding of issues that have always been central to humanistic inquiry.  Without denying the usefulness of time-honored approaches like close reading, narratology, or genre studies, Underwood argues that we also need to read the larger arcs of literary change that have remained hidden from us by their sheer scale. Using both close and distant reading to trace the differentiation of genres, transformation of gender roles, and surprising persistence of aesthetic judgment, Underwood shows how digital methods can bring into focus the larger landscape of literary history and add to the beauty and complexity we value in literature.  
ISBN: 9780226612836

Publication Date: 2019-02-14
Distant Reading by Franco Moretti How does a literary historian end up thinking in terms of z-scores, principal component analysis, and clustering coefficients? The essays in Distant Reading led to a new and often contested paradigm of literary analysis. In presenting them here Franco Moretti reconstructs his intellectual trajectory, the theoretical influences over his work, and explores the polemics that have often developed around his positions. From the evolutionary model of Modern European Literature, through the geo-cultural insights of Conjectures of World Literature and Planet Hollywood, to the quantitative findings of Style, inc. and the abstract patterns of Network Theory, Plot Analysis, the book follows two decades of conceptual development, organizing them around the metaphor of distant reading, that has come to define well beyond the wildest expectations of its author a growing field of unorthodox literary studies.
ISBN: 9781781680841

Publication Date: 2013-06-04
Enumerations by Andrew Piper For well over a century, academic disciplines have studied human behavior using quantitative information. Until recently, however, the humanities have remained largely immune to the use of data--or vigorously resisted it. Thanks to new developments in computer science and natural language processing, literary scholars have embraced the quantitative study of literary works and have helped make Digital Humanities a rapidly growing field. But these developments raise a fundamental, and as yet unanswered question: what is the meaning of literary quantity?           In Enumerations, Andrew Piper answers that question across a variety of domains fundamental to the study of literature. He focuses on the elementary particles of literature, from the role of punctuation in poetry, the matter of plot in novels, the study of topoi, and the behavior of characters, to the nature of fictional language and the shape of a poet's career. How does quantity affect our understanding of these categories? What happens when we look at 3,388,230 punctuation marks, 1.4 billion words, or 650,000 fictional characters? Does this change how we think about poetry, the novel, fictionality, character, the commonplace, or the writer's career? In the course of answering such questions, Piper introduces readers to the analytical building blocks of computational text analysis and brings them to bear on fundamental concerns of literary scholarship. This book will be essential reading for anyone interested in Digital Humanities and the future of literary study.  
ISBN: 9780226568751

Publication Date: 2018-08-29
Macroanalysis by Matthew L. Jockers In this volume, Matthew L. Jockers introduces readers to large-scale literary computing and the revolutionary potential of macroanalysis--a new approach to the study of the literary record designed for probing the digital-textual world as it exists today, in digital form and in large quantities. Using computational analysis to retrieve key words, phrases, and linguistic patterns across thousands of texts in digital libraries, researchers can draw conclusions based on quantifiable evidence regarding how literary trends are employed over time, across periods, within regions, or within demographic groups, as well as how cultural, historical, and societal linkages may bind individual authors, texts, and genres into an aggregate literary culture. Moving beyond the limitations of literary interpretation based on the "close-reading" of individual works, Jockers describes how this new method of studying large collections of digital material can help us to better understand and contextualize the individual works within those collections.
ISBN: 9780252079078

Publication Date: 2013-04-01

Research at Schaffer Library

Digital Scholarship: Computational Text Analysis