Skip to main content
Contribution to Book
Fast and Memory-Efficient TFIDF Calculation for Text Analysis of Large Datasets
School of Computer Science & Engineering Faculty Publications
  • Samah Senbel, Sacred Heart University
Document Type
Book Chapter
Publication Date
1-1-2021
Abstract

Term frequency – Inverse Document Frequency (TFIDF) is a vital first step in text analytics for information retrieval and machine learning applications. It is a memory-intensive and complex task due to the need to create and process a large sparse matrix of term frequencies, with the documents as rows and the term as columns and populate it with the term frequency of each word in each document.

The standard method of storing the sparse array is the “Compressed Sparse Row” (CSR), which stores the sparse array as three one-dimensional arrays for the row id, column id, and term frequencies. We propose an alternate representation to the CSR: a list of lists (LIL) where each document is represented as its own list of tuples and each tuple storing the column id and the term frequency value. We implemented both techniques to compare their memory efficiency and speed. The new LIL representation increase the memory capacity by 52% and is only 12% slower in processing time. This enables researchers with limited processing power to be able to work on bigger text analysis datasets.

Comments

Part of the Lecture Notes in Computer Science book series (LNCS, volume 12798)

Accepted version is posted.

DOI
10.1007/978-3-030-79457-6_47
Citation Information

Senbel, S. (2021). Fast and memory-efficient TFIDF calculation for text analysis of large datasets. In H. Fujita, A. Selamat, J. CW. Lin, & M. Ali (Eds.), Advances and trends in artificial intelligence: Artificial intelligence practices (pp. 557-563). Springer. Doi: 10.1007/978-3-030-79457-6_47