The dataset powering @poetrysearchbot: about 600,000 poems from 50,000 scanned books.

What is this place?

This is an online search engine for a large collection of poetry that was automatically extracted from digitally scanned books as part of my dissertation work. Since then, we've been working on curating and organizing this poetry collection.

Who is responsible for this dataset/site?

I'm John Foley, an Assistant Professor of Computer Science at Middlebury College. The twitter bot that accesses this same data is a collaboration with my former student at Smith College, @SivanNachum.

Are searches logged?

Nope. You can search in privacy here. If you label documents, that gets saved (along with the query).

How did you find this poetry?

We have a machine learned model for identifying pages of digitally scanned books that have poetry on them, and we're working on a model that can identify the specific lines that contain poetry. Our approach is described in my dissertation, as well as in my submission (#dh2020 abstract; #dh2020 poster to the Digital Humanities Conference (DH2020).

The page model is open-source and available on github: Our Poetry Identification Model.

Can I download the data or code for this?

The raw poetry data is available in jsonl or json lines format online: Poetry50k dataset.

I have much more poetry data now; if you want to work with me to study a larger segment of it or to curate it -- get in touch with me, and we can figure out how to get you the data you need: johnf@middlebury.edu.