Google Books Dataset

Temporary Downtime!

The subset generator tool described below has been taken down temporarily for maintenance and for an upgrade. The underlying files are still available via the browsing methods described below. Please contact Devin Higgins with any questions.

Instructions & Guidelines

The Google Books Dataset (GDS) is a collection of scanned books, totalling approximately 3 million volumes of text, or 2.9 terabytes (2,970 gigabytes) of data. The books included in the dataset are public domain works digitized by Google and made available by the Hathi Trust Digital Library.

The instructions below may help answer some questions about how to access and use the dataset. If you have any questions, please don't hesitate to contact Devin Higgins at the MSU Libraries.

Guidelines for Use

The dataset is not meant to be used as a source for reading material, but rather as a linguistic set for text mining or other "non-consumptive" research, that is, research conducted by computational methods which does not reproduce significant portions of text for personal or public display. The terms of the contract with Google that make this corpus available strictly prohibit publishing the texts that comprise the dataset.

Important: If you plan to present work publically that makes use of data gathered through MSU's Google Dataset, or if you intend to use this data in the classroom, please contact Devin Higgins to complete a required registration form before doing so.

The dataset is available to download in full or in part by on-campus users. Authorized MSU faculty and staff may also access the dataset while off campus by connecting to the campus VPN.

Accessing Text

There are several ways to access the Google Dataset.

If you wish to access the collection directly, instead of via the search function, the following instructive information is provided.

You may notice, however, that the dataset directory structure is not well-organized for browsing individual items. To make more efficient use of the dataset, it will be useful (necessary, even) to build a smaller corpus of texts, based on the research questions being addressed. To make this possible, the GDS comes with bibliographic metadata for each and every book, all of which has been compiled in XML metadata, and archived in a file called meta.tar.gz, available at the top-level directory here.

Unzipping this file reveals several very large .xml files that contain the metadata for each volume in the dataset, in no particular order. Each file contains millions of lines of metadata in a format derived from each book's original MARC record. Because these metadata files are large and unordered, they will be mostly useful when building a corpus programatically, that is, by means of a script that reads through the files and collects the unique identifiers for books that match certain criteria. For instance, you may wish to find all works of French fiction from the 1960's. This unique ID provides the pathway to download the full text of all works in this category.

The unique ID that appears in each metadata record looks like this:

<datafield tag='974' ind1=' ' ind2=' '>
     <subfield code='a'>mdp.39015005109353</subfield>
</datafield>
The code between the subfield tags, mdp.39015005109353, is the unique identifier for the text. Every ID listed in this format corresponds to a full-text volume contained in the Google Dataset. Each of these ID's can also be found at the Hathi Trust within the URL of the "full view" page for each book:
http://babel.hathitrust.org/cgi/pt?id=mdp.39015005109353;view=1up;seq=1
The part in red is the unique ID. Please note that a unique ID found at the Hathi Trust can be used to locate downloadable files within the GDS only if it is a Google-digitized public domain volume.

These IDs provide the pathway through the GDS file system to the appropriate files. The following section on "structure" will explain how.

Structure of the Google Dataset

The texts that comprise the Google Dataset are organized in a very particular way. Knowing the system will make working with it much easier.

Each text is assigned a unique ID by the institution that scans it. These ID's are used in turn to build the directory structure the houses the texts. For instance, a volume with the unique ID of uc1.b4101647 can be found in the following directory:

uc1/pairtree_root/b4/10/16/47/b4101647
The segment before the dot, "uc1" in the example above, will always indicate the institution which digitized the volume, in this case the University of California. This institutional code will be the top level directory for all of the documents scanned by that institution. Then, each institutional folder will have a sub-folder called pairtree_root, inside of which is a series of subfolders named for each pair of characters in the ID. In the example above, you can see that b4101647 is broken into four pieces, which create a trail leading to the item itself. Since the structure is completely regular, it is always possible to move back and forth between the unique ID and the appropriate volume in the directory.

At the end of each directory path, there are two files: a zipped file containing one .txt file for each page in the scanned volume, and an .xml file containing technical metadata for that volume (Some of these .xml files will also contain descriptive metadata about author, title, publisher, etc., but this feature is unfortunately inconsistent).

Descriptive metadata, including information about the volume's author, title, and publisher is found at the root directory within an archived file called meta.tar.gz. Unzipping this metadata file reveals a number of .xml files, each of which contains bibliographic metadata taken from the original library MARC record.

Analyzing Text

Once in possession of the corpus in a workable format, the text is ready to be analyzed. There are a myriad of tools out there for this purpose, many of which are catalogued at Bamboo DiRT.