Space 002AI-driven podcast app designed to remix and preserve orginal human voices
ROLE
AI Web Developer
TEAM
Yuanqing Xie (Harvard)
TIMELINE
5 weeks (Apr – May 2025)
TOOLS
Overview
Traditional podcasts are constrained by linear, siloed audio files, limiting the emergent connections between voices.
In contrast, SPACE002 reimagines the archive as a dynamic convergence space—one where AI-driven procedural archiving samples and remixes speaker audio segments across episodes to form seamless, organic conversational units. By transcending episode boundaries, our goal is to surface fresh, collective insights and foster a continually evolving dialogue that celebrates diversity, interconnection, and the richness of multiple perspectives.
Problem Finding
In collaboration with the Institute for Black Imagination’s Podcast Archive, we encountered a vast dataset of invaluable human experiences and narratives that nonetheless felt disconnected. The existing interface failed to amplify the archive’s core message—to make Black voices heard and honor diverse perspectives—because it treated each episode as an isolated container.
Our research identified this siloed structure as a barrier to discovery: listeners couldn’t easily explore thematic or conversational threads that wove through different episodes. To address this, we sought a way to surface connections between speakers, topics, and moments of shared insight, ensuring that each voice could resonate beyond its original context.
Audio Processing Our pipeline begins by downloading individual podcast episodes in their entirety, then converting the audio into a complete transcript. Next, we split each episode into sentence-level audio clips, enabling fine-grained analysis.
An LLM then assigns a semantic keyword to every sentence—capturing themes like “community,” “resilience,” or “innovation”—and stores these labeled clips in a SQL database.
To make this data accessible, we built a Flask API that accepts GET requests based on keywords or other metadata. This architecture turns a monolithic archive into a richly indexed library of atomic audio segments, each tagged with meaning and primed for dynamic recombination.
1st Prototype: Round TableOur first user-facing prototype, “Round Table,” embraces a literal, graphic approach to listening. The interface is centered on a circular network of persona nodes, each representing an individual speaker.
When a user hovers over a persona node, they hear that person’s voice in isolation; hovering over a connection line between two nodes plays a back-and-forth exchange focused on a shared keyword, chosen at random from their overlapping semantic tags. By removing friction—no clicks, only hovers—the experience feels fluid: users “listen” by simply passing their cursor over visual elements. The circular layout and animated connections convey the metaphor of voices in conversation, visually highlighting who is talking to whom and how ideas bridge across different episodes and contexts.
2nd Prototype: RAG Podcast AppBuilding on the Round Table concept, our second prototype blends retrieved audio clips into a more cohesive, podcast-like listening experience. Rather than hard jump cuts between isolated sentences, we feed the unabridged transcripts and their associated metadata into a large language model (LLM), asking it to generate logical transitions that maintain conversational flow.
The result is an AI-mediated remix: rather than hearing a disjointed sequence of clips, listeners experience a smooth dialogue that still preserves each speaker’s original voice and intent. By combining keyword-driven retrieval with semantic understanding, we elevate remixing from a patchwork of sentences to a thoughtfully constructed narrative that feels like a live discussion.
Design:
Conversation Samples:
ReflectionsAs AI entertainment and media evolve—particularly with advances in voice cloning and higher-fidelity speech synthesis—questions arise about the role and value of authentically human voices.
In a future where AI-generated voices can perfectly mimic a speaker’s timbre and inflection, original recordings may serve not only as training data but also as a higher form of content, valued for their spontaneity, emotion, and context.
SPACE002 invites us to consider this tension: How do we honor the soul of a human voice in an era of synthetic alternatives, and what new forms of creativity emerge when we can weave authentic speech into ever-shifting conversational tapestries?