Project | Davide Romano

Voices of the Manhattan Project

Key Takeaways

Implementation of topic modeling and NER on a historical dataset containing 600+ interview transcripts of people who experienced the Manhattan Project.

Detection of key individuals and significant events recalled by interviewees, including some lesser-known figures from history.

Creation of a "gazette" linking each individual to their most relevant topic.

Analysis and interpretation of the results by taking in consideration historical and psychological aspects.

This project was done for the course History and the Digital of the professor Jerome Baudry.

Oral histories

"Communicative memory" is people's everyday tales and talks about an event. It's messy, partial, changeable, impacted by psychology and society. It lasts as long as those who witnessed the event live, about 80-100 years. Given we're in the 2020s, we're nearing this limit for MP events, mostly from 1941-1945.

"Cultural memory" covers concrete aspects of a society's culture like texts, rituals, art, and monuments. These are made on purpose to remember and bring back important past events shared by all.

These two terms coined by Assmann in his work on oral memory and history.

Research questions

Based on Assmann theories, we defined our two main research questions as:

1 - What topics and individuals are retained in the communicative memories of the AHF oral histories?

2 - How are these topics and individuals related to each other within the context of the AHF oral histories?

Secondary literature

To comprehend better the context of our study we read various sources about:

1 - Assmann’s work on communicative/cultural memory

2 - Explored the concept of "conversational remembering" and the factors that shape the dynamics of recalling events.

Data

Our research analyzes 600 interview transcripts from the Voices of the Manhattan Project, managed by the AHF and Los Alamos Historical Society.

Limitations in the data

Despite the rich data, the limited sample from over 150,000 workers may not fully represent diversity, and the reliance on memories from over 60 years ago introduces potential biases. Additionally, the AHF's emphasis on personal stories could skew the cultural context of the interviews.

Methods

The data was scraped from the web, then we utilize a basic NLP pipeline for text pre-processing.

Topic modelling: We applied Latent Dirichlet Allocation (LDA) using the Gensim library.

Person Entity Extraction: we utilize SpaCy for Named Entity Recognition (NER) to detect individuals mentioned in the interviews.

We tested various techniques to solve the problem of entity disambiguation including methods like Soundex and string matching followed by the name linking pipeline described here:

Results

We identified 7 main topics that not only include scientific and technical aspects, but also valuable insights into the interviewees’ daily lives, familial experiences, and the political context of this event.

The most notable figures were some well-known figures like Robert Oppenheimer, Leslie Groves, Enrico Fermi and Edward Teller.

32% of mentions belong to only 1% (50 names) of the recognized total. One possible hypothesis is related to the constraints of conversational remembering. As time passes, famous individuals often receive more comments and questions from interviewees, thus keeping these people in the collective memory even more.

In the top 100 individuals mentioned in the interviews we also noted the presence of some interesting exceptions, such as military veterans, a dozen Project workers/staff, and a few women scientists.

Finally, we also created a ”gazette” in Table II that associates for each person a topic and a list of words.

What I've learned & Challenges

(Historical) research is hard: I discovered how rigourous one has to be before concluding anything.

Let the dataset speak: It's tempting to focus on methods and algorithms, hoping they align perfectly, but in reality, this works out only 1% of the time.

During our project's evolution, we had to discard code and results as they didn't consider crucial assumptions or tried to address unanswerable or historically obvious questions.

We had to ask ourselves: what questions is worth answering with this dataset?

Read, read, read: with my background of computer science, I entered a new realm of humanities and historical research. I found myself reading a lot more material, documents and literature.

Future improvements

Studying the dynamics of memories over time: Comparing interviews conducted during different periods can provide valuable insights into the evolution of memories over time.

Exploring the exceptions: investigating more on individuals who are less recognized in historical archives.

Exploring the possible bias in the questions: In order to gain a more comprehensive understanding of our results, it would be beneficial to examine the formulation and content of the questions.

Team: Davide Romano, Cindy Tang, Junzhe Tang

Author: Davide Romano