Redact PII From Call Recordings — Beep or Silence Spoken Card Data & Names

Destroy spoken personal data inside a call recording

Upload a support or sales call, choose the categories to remove, and the seconds where a card number, security code, name or address is spoken are located from the transcript and destroyed on the waveform — replaced by a 1 kHz beep or by silence — before the file is shared with QA, analytics or an outside reviewer.

Medianonymizer TeamJuly 1, 20265 min read

Before a call recording leaves your archive, destroy the seconds where personal data is spoken. A support or sales call is an unstructured conversation, so the card number, the security code and the customer's name are not tucked into a labelled field — they are said out loud, mid-sentence, scattered across minutes of dialogue. You can redact a call recording now without an account: upload it, pick the categories to remove, and download a clean MP3.

What a call recording actually captures

Payment and support calls leak the exact categories regulators care about most, and they leak them as speech rather than as tidy database columns:

Card data in PCI-DSS scope — the primary account number read digit by digit, the expiry, and the three-digit security code.
Direct identifiers — full names, spelled-out surnames, dates of birth and email addresses.
Contact and location data — phone numbers, billing and home addresses.
Reference numbers — customer IDs, order numbers, IBANs and national ID numbers.

The difficulty is never what to remove. It is finding where each value appears across thousands of hours of audio, and removing it in a way you can prove afterwards. That is a pipeline problem, not a manual one.

How the redaction works: locate, then destroy

The pipeline keeps the probabilistic part and the exact part strictly apart.

First it locates. The recording is normalised to a clean 16 kHz mono track and transcribed with word-level timestamps — using a Whisper-class model in the cloud by default, or a local aligner as a fallback. The model only points at where words are said; it never touches the audio. Over that transcript, detection runs two ways: entity recognition flags people and places, while checksum-validated matchers catch structured identifiers — a card number is only flagged when it passes the Luhn check, so a real account number is caught while a random sixteen-digit string spoken in passing is left alone. The same validation guards IBANs and national IDs. You can add your own deny-list of exact strings — an internal case code, a product name — to be removed in the same pass.

Then deterministic code destroys. Each detected word is mapped back to its start and end time, a small pad is added on each side, overlapping ranges are merged so nothing slips through a gap, and ffmpeg rewrites the samples in those ranges. Nothing about this step is generative: the same input produces the same output every run.

Detection is best-effort — the destruction is exact

We are honest about the boundary. Finding spoken data depends on the transcript; if a word is not transcribed or cannot be time-aligned, that span is safely left untouched rather than guessed at. What is not probabilistic is the removal: once a range is marked, the original samples in it are physically destroyed. Pair the automatic pass with a human spot-check on your most sensitive calls.

Silence or beep — you choose the marker

Both options erase the audio underneath; they differ only in what a listener hears afterwards.

Hiding the moment

Lowering the volume or muffling leaves the speech recoverable
A separate bleep laid over the top can be removed to expose the original
Source metadata may still name the device, agent or session
Nothing proves a redaction was ever intentional

Destroying the samples

The waveform in that range is set to zero — the digits are gone
A 1 kHz beep or clean silence replaces them in the same file
The MP3 is re-encoded with every tag stripped
The audit list marks the range, never the value

What we detect, and what we do not promise

We remove card numbers validated by Luhn, IBANs and bank data, national ID and passport numbers, names, email addresses, phone numbers and postal addresses — plus anything on your deny-list. A bare three-digit security code spoken in isolation is not a strong standalone signal, so treat the review step as part of the workflow and add specific values to the deny-list when you know them. This tool processes audio and returns audio: it does not hand you a transcript to keep, it does not detect faces in video, and it does not mark up a PDF — those are separate modalities with their own tools.

0accounts needed to redact a call

1kHzcensor beep over each redacted range

MP3clean output, all metadata stripped

Built for real contact-centre audio

Real call audio arrives in awkward shapes, and the pipeline is built for them. A stereo call leg keeps the agent on one channel and the caller on the other, while a mono export mixes both voices together. Recordings squeezed through telephony codecs, resampled by an IVR menu or held in an ACD queue, still normalise cleanly before transcription. Hold music, DTMF keypad tones, cross-talk and packet jitter do not derail the timing map, because the alignment is rebuilt from the spoken words themselves — so a caller who blurts a card number over the agent is still pinned to the exact second it was said. Acoustics aside — sibilance, phoneme, formant, narrowband, wideband, spectrogram, diphone, prosody — the timing map still lands the redaction where the digits were spoken.

Redact a call recording now

Upload the recording, choose whether spoken card data, names, IDs and addresses become a beep or silence, confirm the price, and download the clean MP3. The AI only finds the sensitive moments; deterministic code destroys them, so the result is irreversible and identical every run. No account, pay only for what you redact.

When you need this

It is a Tuesday afternoon in a contact centre and an agent is taking a card payment over the phone. The customer reads the sixteen digits of their card aloud, then the three-digit security code, then spells their surname and confirms the billing address. Every second of that is being recorded for quality scoring and dispute resolution, and it will sit in your archive for months. Multiply it by a few thousand calls a day and you are keeping a searchable library of live card data and identities. Upload that recording to Medianonymizer, choose the categories to remove, and the seconds where the card number, the code, the name and the address are spoken are located from a word-level transcript and destroyed on the waveform — replaced by a 1 kHz beep or by silence — before the file is ever handed to QA, an analytics vendor or an offshore reviewer.

The compliance angle

PCI-DSS v4.0 is explicit: sensitive authentication data such as the card verification code must never be retained after authorisation (Requirement 3.3.1), and the primary account number must be rendered unreadable wherever it is stored (Requirements 3.4 and 3.5). A recording that captures those digits being read aloud is storage. In parallel, the GDPR treats the caller's name, address and phone number as personal data you must minimise and protect. Destroying the spoken values in the recording takes that file out of PCI and GDPR scope for those items — there is no card number left to protect and no identity left to leak.

What you can verify

The result is checkable, not a promise. Open the returned MP3 and jump to the timestamp where the card number was read: you hear a 1 kHz tone or silence, not the digits — the original samples in that range are gone, not lowered and not hidden under a layer. Inspect the file's tags with any tool and there is no ID3 metadata carried over. The audit list records only the time ranges that were redacted — start and end seconds — never the numbers themselves, so even the log cannot leak what it removed.

Frequently asked questions

How does the tool find personal data inside spoken audio?

It transcribes the call to text with word-level timestamps using a Whisper-class speech model, then runs entity recognition and checksum-validated matchers over that transcript. Names and places come from entity recognition; card numbers, IBANs and national ID numbers are caught by structure and validated — a card number is only flagged when it passes the Luhn check. The model only points at where a value is spoken; it never edits the audio.

Is the redaction reversible, or is the audio really destroyed?

It is destroyed. Each detected range is rewritten on the waveform: the original samples are set to zero and replaced by a 1 kHz beep or clean silence in the same file. There is no separate bleep track to remove and no hidden layer to peel back. Once a range is redacted, the speech that was there cannot be reconstructed from the output.

Should I use a beep or silence for card data?

For PCI and legal contexts a beep is the safer default because it leaves an audible marker that something was intentionally removed — an auditor can hear the redaction. Silence is cleaner for analytics and QA datasets but can be mistaken for a recording dropout. Both erase the underlying samples, so both are irreversible.

Does the output still carry metadata from the original recording?

No. The call is re-encoded to a fresh MP3 with every tag stripped, so device identifiers, session data and timestamps from the source do not ride along. The file you download is the cleaned audio and nothing else.

What are the honest limits of automatic detection?

Finding spoken data depends on the transcript. If a word is not transcribed or cannot be time-aligned, that span is safely left untouched rather than guessed at, so a bare three-digit security code spoken in isolation is not a strong standalone signal. Pair the automatic pass with a human spot-check on sensitive calls, and add exact strings you know about to the deny-list. The destruction step is exact; the detection step is best-effort.

Related guides

ChatGPT & LLMs

Remove PII Before ChatGPT — Redact Text You Paste Into an LLM

Strip names, ID numbers, emails, phones, addresses and bank data out of any text before you paste it into ChatGPT, Claude or Gemini. Irreversible redaction, no account.

Meetings & Video Calls

Anonymize a Meeting Recording — Hide Participants' Faces in a Zoom, Teams or Meet Video

Upload a recorded video call and irreversibly cover every participant's face with a solid box or heavy pixelation before you share the recording beyond the people who were in the room. Faces are destroyed pixel by pixel, the output MP4 is re-encoded with no metadata, and spoken names can be beeped in the same pass.

CCTV & Surveillance

Anonymize CCTV Footage — Redact Bystanders for a Subject Access Request

Pixelate third-party faces in CCTV footage before you release it for a subject access request. Each bystander is destroyed frame by frame, not masked. Verifiable output. No account.