Semantic Analysis Applied to "Sent From My Telephone" by Voice Actor
whisper, embeddings, python, Audio Analysis, semantic embeddings, music, Voice Actor, Experimental Music, Digital Humanities
Introduction
Spotify sometimes shuffles in the right direction when an album ends (that’s how I discovered Sent From My Telephone by Voice Actor). And sometimes I also sleep relatively well (that’s why I had time and energy to analyze it).
I noticed something in the monologues which spread across its +100 tracks: the album seems more like a log of emotions than a conventional record. Maybe there even was a linear story hidden inside it.
As Kurzweil (the artist) had released the album in alphabetical order, she made it an ideal candidate to be reassembled. I couldn’t resist, so I decided to buy it (so I could have every track in a separate file), treat it as a fragmented log system and see what kind of signal I could recover.
I asked the oracle 🧞 why Voice Actor’s Bandcamp cites Martha C. Nussbaum’s Upheavals of Thought: Since that book explores emotion as a form of intelligence and self-knowledge, the reference suggests that the album, too, treats feeling as structure: an attempt to think through emotion rather than perform it.
Perhaps taking an objective / systematic / procedural approach isn’t missing the point at all. It might be a sensible way to explore a four-hour dataset of dreams and emotions.
Collection and Preparation
- I bought the album on Bandcamp and downloaded the tracks as
.aacfiles - Renamed and normalized the file names with PowerShell
Get-ChildItem -Path $src -Filter *.m4a | ForEach-Object {
$newName = $_.BaseName -replace '[^\w\-]', '_'
$newPath = Join-Path $dst ($newName + $_.Extension)
Move-Item $_.FullName $newPath
}
- Transcribed all 108 tracks with Whisper (medium) in Python
result = model.transcribe(audio_file, word_timestamps=True, fp16=False, language="en")
- Each transcription was saved as a JSON file in a
speeches/directory. You shouldn't find these files in my GitHub repository anymore.
The transcripts are full of the small (and sometimes not-so-small) glitches typical of speech-to-text software, but they generally preserve the meaning and intent.
Semantic Extraction
- Processed each
textfield with sentence-transformers (all-MiniLM-L6-v2).
from sentence_transformers import SentenceTransformer
import numpy as np
import json
from pathlib import Path
src = Path(r"C:\temp\reconstructingMyTelephone\speeches")
dst = Path(r"C:\temp\reconstructingMyTelephone\embeddings")
dst.mkdir(exist_ok=True)
model = SentenceTransformer("all-MiniLM-L6-v2")
for json_file in src.glob("*.json"):
with open(json_file, "r", encoding="utf-8") as f:
data = json.load(f)
text = data.get("text", "").strip()
if not text:
continue
emb = model.encode(text, normalize_embeddings=True)
out_path = dst / (json_file.stem + ".npy")
np.save(out_path, emb)
print(f"Saved {out_path.name} ({emb.shape})")
This saved one .npy embedding per track.
Ordering by Similarity
- Computed a cosine similarity matrix (
np.dot(embeddings, embeddings.T)) - Used a simple greedy heuristic: start from one track and always jump to the most similar unvisited one
- Saved the resulting order to
semantic_order.csvand manually recreated it as a Spotify playlist
Validation with a Self-Similarity Matrix (SSM)
Visualized the similarity matrix using Matplotlib.
The diagonal blocks can be interpreted as “narrative chapters” (clusters of voices and moods).
Notes on sorting the embeddings
That first attempt to order the embeddings used a greedy “nearest-neighbor” approach: starting from one track and always jumping to the most similar one not yet visited. It is intuitive, almost simple, but I learned that this algorithm only makes local decisions, it optimizes only the next step (or track in this case). The result looks fine at first but gradually breaks down: as the algorithm runs out of nearby points, the remaining ones become progressively unrelated, creating a high-entropy tail where similarity collapses. As if someone would be picking 'nice things' from an 'assorted things' box, but here the 'nicety' is the proximity of the next embedding. Naturally the box of embeddings will have farther and farther embeddings.
Then I tried to remember something (anything) from my Signals and Systems classes. Of course I remembered almost nothing, but I firmly knew that this was a known issue. Again, I asked the oracle 🧞 about what I can do in this situation, and then the robot answered: Recasting the problem as a Traveling Salesman Problem (TSP) would change the picture.
Instead of linking neighbors greedily, the solver minimizes the total distance across all tracks. It looks at the whole map at once, producing a globally smoother “semantic path” where similar ideas remain contiguous from start to finish.
In signal-processing terms, the greedy order behaves like a locally-stable but globally-unstable filter — it drifts. The TSP approach enforces global phase coherence, reducing that drift and flattening the noise floor of the sequence.
Then our SSM using the new sorting looks like this:
Note: while the Δ-distance plot captures local continuity between consecutive tracks, the Self-Similarity Matrix provides a global representation of structural relationships, as it considers all pairwise similarities (all vs all). Chapter boundaries in the SSM therefore reflect more stable, cluster-level transitions.
And the "Sent From My Telephone - Semantic Order - TSP Algorithm" Spotify playlist is here.
Bonus: Acoustic Analysis
While I was at it, using librosa 0.1x, I extracted tempo, key, RMS energy, and spectral centroid for each .m4a.
The resulting .csv file is here.
I made a playlist based on key and energy. Tracks are sorted by key (A, A#, B, C, C#, D, D#, E, F, F#, G, G#) and then, inside each group, by RMS energy.
Closing Thoughts
Fittingly, the TSP playlist closes the cycle with “Object Positioning”, a 33-second track that simply says, “What? I didn’t understand a word. I think I heard notes.”
True bottom line / Annex A:
Reordering the voice notes by semantic proximity produces a trajectory that isn’t present in the original order. Themes cluster naturally: quotations group with quotations, interpersonal loops group together, and low-context lists converge at the end. Below is an outline of what the algorithm surfaces, with short examples that represent each region of the embedding space.
1. External References and Performative Speech
This first cluster contains quotations, pop-culture fragments, compressed rants, and rhetorical monologues. Most entries keep distance from personal experience.
Examples:
- Biggie / Ten Crack Commandments dump: “I've been in this game for years, it made me an animal… Rule number uno…”
- Stream-of-consciousness conspiracy riff: “What’s the point? Is it an attempt to get a reaction from you? They mostly come out at night.”
- Surreal external monologue: “Because we're secretly crowd people sent to the surface world by the ghost of alien…”
Why this forms a cluster: These notes rely on quotation, stylistic performance, or external references rather than introspection. In embedding space, they form a loose group far from later dream-heavy and emotional sections.
2. Transitional / Ambiguous Personal Material
In this region the speaker enters the narrative, but the tone is uncertain, unresolved, or ambivalent. Dreams appear, but without strong emotional charge.
Examples:
- The Japan suitcase dream with no ending: “I didn't finish the dream. There was no resolution.”
- Social tension with the dad: “I snap at him… he hangs up… I feel awful.”
- Self-doubt about sharing: “Is it weird that I'm feeling weird about it or is it just me being weird?”
- Text-signoff etiquette: “No ex… XX… XXX+… people get offended if you provide no exes…”
Why this forms a cluster: These items tend to involve uncertainty, incomplete stories, and mild interpersonal tension. They sit semantically closer together than the quotations but apart from the heavier emotional loops that follow.
3. Emotionally Dense, Repetitive, Interpersonal Cluster
This is the densest region in the reordered sequence. Here we get interpersonal conflict, creative-partnership themes, recurring motifs, and dreams with stronger emotional charge. Several items repeat phrases or ideas with minor variations.
Examples:
- Music-as-children metaphor (highly self-referential): “When you give me tracks I feel like they're your children… I feel bad for your children sometimes… I can't be fair.”
- Repetitive internal loop: “I can't explain I can't explain I can't explain…” (14×)
- Direct confrontation: “Hey, why are you fucking with my head? What did I do to you?”
- High-intensity dream: “She slaps me in the face so hard… I can almost hear ringing…”
- Sexual queue dream: “A long queue of men… using her body… she was nearly passed out…”
Why this forms a cluster: These entries are high-density in the embedding space: heavy emotional content, repeated motifs, and strong interpersonal framing. The TSP algorithm loops tightly through these because the semantic distances between them are small.
4. Fragmentation and Low-Context Speech
At the end of the TSP-ordered sequence, structure drops off: entries become lists, fragments, or low-content utterances. These are highly similar to each other but distant from the earlier narrative material.
Examples:
- Pure lexical inventory: “see, hear, feel, be, set, walking over, there, here, more, for, about… compressed…”
- Language failure / acoustic perception: “What? I didn't understand a word. I think I heard notes.”
Why this forms a cluster: These items share extremely low contextual content and short, simple structures, so embeddings place them near each other. That's why they end up at the tail of the TSP path.
Overall Pattern
The semantic TSP path doesn’t reconstruct intentional storytelling. It highlights how:
- external references cluster together,
- personal but unresolved material forms mid-range groups,
- emotional or repetitive content becomes densely interconnected,
- and low-context fragments converge into a final block.
This progression is an artifact of semantic proximity, not a psychological claim or narrative imposed by the author.