Skip to main content
Article
Automation Can Lead to Confounds in Text Analysis: Back, Küfner, and Egloff (2010) and the Not-So-Angry Americans
Psychological Science (2011)
  • Cynthia L. S. Pury, Clemson University
Abstract

Automated text analysis facilitates research using large archival data sets but can be confounded by automatically generated repeating entries. Back, Kufner, and Egloff (2010) used Linguistic Inquiry and Word Count (LIWC; Pennebaker, Francis, & Booth, 2001) to analyze pager messages sent to more than 85,000 American pagers on September 11, 2001. They found that anger, as indexed by the words contained in those messages, rose steadily throughout the day.

The data contained many technical codes; thus, Back et al. counted only words recognized by LIWC. However, this procedure did not exclude automatically generated messages. Consequently, LIWC words in such messages were counted, even if the words lacked emotional meaning in context. Furthermore, computers can send messages with superhuman frequency, turning an otherwise minor measurement error into a serious confound. This confound can be detected by treating individual text messages as primary units, reading samples of each key word in context, and looking for repeating false positives.

Using a spreadsheet text-search function, I searched for each of the LIWC anger words in each alphanumeric pager message (these messages were identified by the code ALPHA at the beginning of each message; N = 280,074) sent during the time period Back et al. analyzed. I found 16,624 instances of anger words. Of those, 5,974 (35.9%) were in nearly identically worded messages sent to the same pager (Pager X). Each message said in its entirety, “Reboot NT machine [name] in cabinet [name] at [location]:CRITICAL:[date and time].” In this context, critical likely means “urgent” rather than “disparaging.”

Frequency of this message increased rapidly after the first instance, at 11:42 a.m., and plateaued by 3:30 p.m. at a mean of 46 repetitions per 5-min block (SD = 2.9), or 552 times per hour. Total text-message volume declined beginning in the afternoon. Thus, Pager X messages accounted for increasing proportions of the total text messages sent throughout the day, peaking at .16.7 in the final 5-min block (see Fig. 1a for these proportions as a function of the 30-min blocks shown in Back et al;’s figure). Removing Pager X messages from the data set significantly reduced the rise in anger reported by Back et al. (z = 10.71, p < .001; see Fig. 1b). The correlation between time and anger words per message for each block was .84 for the original data set over the entire time period (p < .001), but only .20 (p = .003) for the same data set with Pager X messages removed. After removal of Pager X messages, there were more angry words per message after the first attack at 8:45 a.m. (M = .039, SD = .026, n = 216) than before (M = .019, SD = .014, n = 24), t(214) = 3.20, p < .001; however, the linear relationship between anger and time after the first attack (r = .08, p = .30) was negligible.

Thus, it appears that much of the dramatic rise in anger reported by Back et al. was due to a repeated and emotionally neutral technical message associated with a single pager. Because today’s e-mail, social media, and text messages can include automatically generated messages, future researchers of linguistic archives should consider ways to prevent similar confounds.

Keywords
  • automated text analysis,
  • methodology,
  • data cleaning
Publication Date
June 7, 2011
Citation Information
Cynthia L. S. Pury. "Automation Can Lead to Confounds in Text Analysis: Back, Küfner, and Egloff (2010) and the Not-So-Angry Americans" Psychological Science Vol. 22 Iss. 6 (2011)
Available at: http://works.bepress.com/clspury/5/