Skip to Main Content
* UC Irvine access only

HathiTrust and You: 2020

Data-driven Research using Text Mining
URL: https://guides.lib.uci.edu/htrc

Winter Quarter

Date & Time 

Every other Tue. 3:00-4:30pm starting Jan. 28

Schedule

Session 1 - Intro to Text Mining and HathiTrust: Jan. 28 

Session 2 - Gather Textual Data and Use Text Analysis Tools in HathiTrust: Feb. 11 

Session 3 - Gather, Process, and Analyze Textual Data on the Web: Feb. 25

Session 4 - HathiTrust Data Capsule Service, Visualize Textual Data: Mar. 10

Location

Multimedia Resources Center, Room 164 (First Floor, Science Library) Here is a map!

Fall Quarter

Date & Time 

Oct. 13-16, 2:00-3:30pm

Schedule

Session 1 - Intro to Text Mining and HathiTrust: Oct. 13

Session 2 - Gather Textual Data and Use Text Analysis Tools in HathiTrust: Oct. 14

Session 3 - Gather, Process, and Analyze Textual Data on the Web: Oct. 15

Session 4 - HathiTrust Data Capsule Service, Visualize Textual Data: Oct. 16

Location

Online via Zoom

Lessons & Resources

Session 1

This session addresses the following questions:

  • What is text mining? What can it do?
  • What are common text mining concepts and terminology?
  • What are common text mining methods?
  • What is HathiTrust Digital Library?
  • What is HathiTrust Research Center?
  • What is "non-consumptive" research?

Participants will start framing their text mining projects in relation to material and tools available in HathiTrust.

Lesson (Winter Quarter): Intro to Text Mining and HathiTrust

Lesson (Fall Quarter): Intro to Text Mining and HathiTrust

Handout: Text Mining Common Methods and Definitions

Handout: HathiTrust Info Sheet for Faculty and Researchers

Handout: HathiTrust Digital Library Access and Search Tips

Resource: U.S. Federal Documents in HathiTrust: A Collection Profile

Session 2

This session addresses the following questions:

  • How do I conduct research within HathiTrust?
  • What is a workset and how do I create it?
  • How do I create a tag cloud based on word count within HathiTrust?
  • What is NER and how do I do it within HathiTrust?
  • What is Topic Modeling and how do I do it within HathiTrust?

Participants will gain hands-on experience with searching material in the HathiTrust Digital Library and using web-based text analysis tools in the HathiTrust Research Center.

Lesson (Winter Quarter): Gather Textual Data and Use Text Analysis Tools in HathiTrust

Lesson (Fall Quarter): Gather Textual Data and Use Text Analysis Tools in HathiTrust

Session 3

This session addresses the following questions:

  • What are the ways to bulk retrieve information on the Web?
  • What is "web scraping" and how do I do it?
  • What is API and how do I use it?
  • What is "text as data" and what is involved in it?
  • What is Python and how can I use it in text mining?
  • What is PythonAnywhere and how to use it?
  • What is HTRC Extracted Features and how can I use it?

Participants will gain hands-on experience with API and running command lines for web scraping, text preprocessing, and text analysis.

Lesson (Winter Quarter): Gather, Process, and Analyze Textual Data on the Web

Lesson (Fall Quarter): Gather, Process, and Analyze Textual Data on the Web

Handout: Work with Textual Data

Activity: Download activity_files.zip

Link: PythonAnywhere

Session 4

This session addresses the following questions:

  • What is HathiTrust Data Capsule service and what are the benefits of using it?
  • How can I use the Data Capsule for "non-consumptive" research?
  • What are the different ways of textual data visualization?
  • What tools are out there for textual data visualization?
  • What is HathiTrust Bookworm and how can I use it for my research?
  • What is Google Ngram Viewer and how can I use it for my research?

Participants will gain hands-on experience with using HathiTrust Data Capsule service and learn textual data visualization in-depth.

Lesson (Winter Quarter): HathiTrust Data Capsule Service, Visualize Textual Data

Lesson (Fall Quarter): HathiTrust Data Capsule Service, Visualize Textual Data