"Hebrew Bible (Tanakh) with Unicode for Data Analysis" by Keith L. Yoder

Dataset
Hebrew Bible (Tanakh) with Unicode for Data Analysis
(2019)
Keith L. Yoder
Download
Description
This database contains the entire Tanakh, or Hebrew Bible (HB), in a form suitable for Data Analysis. It is a modified and abbreviated version of what I have developed for use in my own studies. I originally received the data in the form of individual text files from CCAT in 2004, which I first imported into Microsoft Access and later into Microsoft Excel. The CCAT text was a special 1993 edition of the Westminster Hebrew Morphological Database version 1.0, (WHMB) and is no longer available online to my knowledge. 
 
I have made corrections too numerous to count, most importantly I dis-ambiguated the morphological coding of the of the 11,872 instances of ˀēt in the Tanak between its use as direct object marker and preposition, a major defect in the CCAT edition. In the 2.0 version of April 2020, I also corrected over 25,000 erroneous suffix divisions, thus enabling correct Unicode rendering in the morpheme and word fields. The only change I have made in the actual text is the removal of the extraneous yod in the WHMB 1.0 qere form of the first word of Psalm 18:51 (מַגְדִּל/ MaG:D.iL), which had actually found its way into the printed 1999 JPS Tanakh (מַגְדִּיל / MaG:D.iYL).
 
Defects still remain in this present database, but I offer it to the interested public simply because anyone doing data analysis on the Hebrew Bible needs to have a complete digital version of the lemmatized and morphologically coded text. No available commercial software allows common users and researchers to manipulate the HB information for the needs of such studies. It has been said in the literary computing field that getting the text in a form suitable for computerized study is often 90% of the work; this project is an attempt to take care of that 90% hurdle.
 
All Hebrew text is here encoded in a custom transliteration format, as well as UTF-8 Hebrew Unicode. The transliteration, UTF-8, and the original WLC 1.0 beta, character codes are displayed in the CharacterCodes worksheet/table tab in the Excel file. For my own work, I continue to use the original beta code, as it is completely ASCII based and does not take up computer power on secondary issues such as RTL encoding. The transliteration scheme is orthographic rather than phonetic, using the SBL transliteration system as a base, but with modifications to eliminate ambiguities, to enable direct conversion to Unicode, if desired. My only nod to phonetics is that the patah furtive has its own transliteration character (â) and is displayed before rather than after the consonant to which it is attached..
 
The three table columns containing Hebrew Unicode text are correctly displayed using the SBL Hebrew font. Other fonts may or may not do as good a job as the SBL font in properly displaying all the text features, such as the patah furtive. That font may be freely downloaded from https://www.sbl-site.org/educational/BiblicalFonts_SBLHebrew.aspx.
 
I have left all compound words un-lemmatized as in WHMB 1.0, whether proper names or common nouns, approximately 839 records. In the transliteration fields, the individual components of each compound are connected by a tilde (~) unless a maqaf is already present; in the Unicode fields, the tilde is replace by a simple space, to match the customary printed appearance. The Lemma field for all such records contains the three character shin-shin-shin placeholder (שׁשׁשׁ / ŠŠŠ), an acronym for a triple repetition of "name" (שֵׁם/ ŠēM). The tMorph and tWord fields, however, do contain the appropriate prefix and suffix morpheme dividers for these compounds.
 
 
Version History:
·         1.0, 22 Apr 2019 - original version
·         1.1,15 Nov 2019 - minor corrections made for word 1 in Isaiah 1:1 and words 8-9 of 2 Kings 8:14, and a new Codes worksheet was added with a transliteration code table.
·         1.2, 24 Mar 2020 - an extra "4DH" column was added to the Torah table containing tags for each morpheme per the Documentary Hypothesis assignments from Richard Friedman's Who Wrote the Bible (1989) and The Bible with Sources Revealed (2003) as assembled and digitized by Christopher V. Kimball at this url: https://www.tanach.us/Pages/DH.html. Also a new WHMB_1.0_ReadMe worksheet was added, which contains the complete user guide published with the original version 1.0 of the WHMB as modified in 1993 for the CCAT format.
·         2.0, April 2020 - column names have been modified and three new columns were added added with UTF-8 encoded text for morpheme, lemma, and full-word.
·         2.1, May 2020 - Character Code table has been modified to include a column for the original WLC 1.0 Beta Code values.
 
This edition of the Tanakh is designed especially, though not exclusively, for data analysis. These four Excel tables in the download file contain the books of the Tanakh in traditional HB order:
Torah........................Genesis - Deuteronomy
Former Prophets.....Joshua - 2 Kings
Latter Prophets........Isaiah - Malachi
Writings....................Psalms - 2 Chronicles
 
Each of the four Tanakh tables contains the following fields (columns):
·         ID - Identification Number for each record 1 through 425885
·         Ref - biblical Reference identifier in the format 11_Bbb_22:33.44.0, where 11 is the numerical order of the book (01 for Genesis, 02 for Exodus, etc), Bbb is the abbreviation of the (English) book name, 22 is the chapter number, 33 the verse number, 44 the word number, and 0 is the morpheme index. The only exception to this format is in the Psalter, where both chapter and verse numbers are necessarily expressed in three-digit format. The base "word" is always designated by morpheme index .5 Inseparable prefixes are designated in order with digits up to but less than .5, and suffixes (in this edition, primarily the Aramaic definite article and Hebrew directional -He) with ordered digits greater than .5. For example, the primary fields for the first orthographic word of Genesis 1:2 are distributed on three consecutive lines, where the Ref field entries end with morpheme indices of 3, 4, and 5, and the tWord field only contains an entry on the line with morpheme index .5: 
............Ref...........................tMorph...........Code..............tLemma.....tWord
............01_Gn 01:02.01.3...W:....................Pc...................W
............01_Gn 01:02.01.4....Hā...................Pa...................H
............01_Gn 01:02.01.5....ˀāReṢ.............ncbs................ˀeReṢ.........W:\Hā\ˀāReṢ
 
·         uMorph - morpheme rendered in UTF-8 Hebrew characters
·         tMorph - morpheme rendered in transliteration characters; see worksheet CharacterCodes for the coding of all consonants (upper case) and vowels (lower case); non-lexical paragraph markers are designated as either P or S. See uWord and tWord below for the forward-slash and back-slash morpheme dividers.
·         Code - for the explanation of this morphological coding, see the worksheet labelled  WHMB_1.0_ReadMe, beginning at line 233; non-lexical paragraph dividers are designated with morph Code x.
·         uLemma - dictionary form of morpheme in UTF-8 Hebrew
·         tLemma - dictionary form of morpheme in transliteration
·         uWord - full orthographic word in UTF-8 Hebrew; inseparable prefix morphemes are separated from base words by a forward slash "/", and suffix morphemes are separated by a backslash "\". In un-lemmatized compound names, the Hebrew definite article prefix is separated by a double slash ("//"), and the Aramaic definite article suffix is separated by a double backslash ("\\"). 
·         tWord - full orthographic word in transliteration; morpheme slash dividers are the mirror opposite of the Unicode fields, using a backslash for prefixes ("\") and a forward slash for suffixes ("/").

These secondary fields are filled as follows for all records:
·         K/Q - a K in this field designates a ketiv reading, and Q a qere reading. A ketiv without a corresponding qere is designated as K~Q, and a qere without corresponding ketiv as Q~K; there are only 16 of these abbreviated features in the Tanakh. 
·         H/A - the language is annotated as either Hebrew or Aramaic; this field is left blank for records containing the non-lexical paragraph markers
·         WCnt - word count; this is normally 1, but will exceed that in cases of compound names, which use a "~ separator in cases where no maqaf is used. Note that compound names are NOT lemmatized, regardless of separator, but the WCnt will contain the proper count of 2, 3, etc.
·         CCnt - count of all consonants (upper case characters); for now, non-lexical paragraph separators also have WCnt of 1.
·         WLC - special WHMB 1.0 notes (]1,]2, ]3, etc) explaining differences from BHS, see the explanations at the end of the WHMB_1.0_ReadMe worksheet.
·         HFBS - designates records containing miscellaneous material Header, Footer, Blessing, and Selah, which are so tagged for purposes of excluding/including such "editorial" material in data analysis, if desired; these designations are provisional and may not be complete in all cases.
·         4DH - this field is populated only in the Torah table, and contains the 4‑Document-Hypothesis tags as designated by Richard E. Friedman, and as assembled and digitized by C.V. Kimball (available online at https://www.tanach.us/Pages/DH.html)
 
Like WHMB 1.0, this database does not analyze pronoun suffixes as separate morphemes. I recognize that as a shortcoming which may be incorporated in a future revision. Another shortcoming, although not usually an issue for data analysis,  is that this database contains no accent or cantillation markers.The CCAT version of WLC 1.0 only used a "^" character to designate the presumed accented syllable. I may (or not) incorporate this in future versions as accent markers ole (U05AB above the syllable) or mahapak (U05A4 below the syllable, when an ole would conflict with an existing holam). 

Finally, the WHMB "Readme" file, admittedly, does not serve well as a user guide for this project. I will formulate my own custom user guide in the next version of this database, as time allows. In the meantime, please direct any comments or error notices to my listed UMass email address, or alternatively to "keith-underline-yoder-at-yahoo-dot-com".
Disciplines
Biblical Studies
Publication Date
Spring 2019
Citation Information
Keith L. Yoder. "Hebrew Bible (Tanakh) with Unicode for Data Analysis" (2019)
Available at: http://works.bepress.com/klyoder/49/