Give a file to someone who must never see the personal data inside it — and make sure they can't. When you pull a slice of production into a CSV so a contractor can reproduce a bug, or route a batch of tickets to an outsourced analytics team, the names, emails and account numbers are not tucked into tidy columns. They sit inside log lines and ticket bodies as free text. You can clean a text file now with no account: upload it, tick the categories to strip, and download a plain .txt you can diff against the original.
Where the personal data actually hides
In a spreadsheet you know which column holds the email. In a log dump or a ticket export you do not: the customer's surname turns up mid-sentence, the phone number is wedged into a stack trace, the card number was pasted into a note by an agent in a hurry. That is what makes bulk text so leaky. Deleting the values by hand across thousands of rows is slow, and one miss is a breach — so the job has to be mechanical.
Upload the file and pick from the categories the pipeline knows how to find:
- People's names, including when they appear in the middle of a line.
- Emails and phone numbers, in their common international shapes.
- Payment card numbers (PAN), matched by structure and checked before removal.
- IBANs and bank data, validated by their built-in checksum.
- National ID numbers — DNI, NIF, NIE, CIF and the like — caught by their control character.
- Postal addresses that pin a record to a place.
Anything you already know about — an internal case code, a project name, a specific handle — goes on a deny-list and is removed in the same pass.
Validated removal, not a lucky guess
The categories that would hurt most if a real value slipped through are exactly the ones the pipeline refuses to guess at. A run of sixteen digits is only treated as a card number when it satisfies the Luhn algorithm, so a genuine PAN is stripped while a random order number of the same length is left untouched. An IBAN is only rewritten when its mod-97 remainder is correct. Spanish DNI/NIF/NIE/CIF are keyed off their check letter, and phone numbers off their national pattern. Because these tests are arithmetic, they work the same whatever language the surrounding text is in — a structured identifier does not care whether the sentence around it is German or Portuguese.
Structured IDs travel across languages; names do not
The name detector leans on Spanish and English models, so people's names in German, French and Italian text are found only partially. Be honest with yourself about that: add the surnames you know to the deny-list and give those files a read-through. Cards, IBANs, emails, phones and check-digit IDs are language-agnostic and need no such caveat.
Editing by hand versus a deterministic pass
- A find-and-replace misses the one row spelled differently
- Every reviewer removes a slightly different set
- No record of what was taken out, or where
- A single overlooked value is a leak
- Every span in a chosen category is located in one pass
- The same input yields the same output, run after run
- Redact, mask with
*, or replace with<ENTITY>— your call - The audit list keeps character offsets, never the value
The output is a plain .txt, so you can diff it against the source and see for yourself that each detected span is gone: redacted to nothing, masked to a run of asterisks, or replaced by a typed marker. Whichever style you choose is applied the same way everywhere in the file. And the audit trail records only where a span sat — its start and end offsets — never the characters that were there, so the log itself can never leak what it removed.
Data minimisation is the legal point
GDPR Article 5(1)(c) makes minimisation binding: personal data must be limited to what is necessary. A contractor debugging a query, or a staging box running a test suite, does not need real identities — so under the principle they should not receive them. Article 4(5) draws the line you actually care about: pseudonymised records can still be traced back, while removing the identifiers outright pushes the file toward anonymisation and out of that risk. Handing a raw export to a third party, or copying live data into a non-production environment, is exactly where regulators look. Stripping the PII first is the cheapest way to stay on the right side of that line — and this tool processes text and returns text: it does not mark up a PDF visually, blur a face or beep an audio file, which are separate jobs with their own tools.
Clean a file now
Upload the .txt, .docx or PDF, choose the categories and the rewrite style, confirm the price, and download the clean copy. Detection finds the spans; deterministic code rewrites them, so the result is identical every run. No account, pay only for what you clean.
When you need this
An engineer needs to hand a file to someone who should never see the personal data inside it. It might be a production database export pulled into a CSV so a contractor can reproduce a bug, a batch of support tickets going to an outsourced analytics team, or a slice of application logs becoming test fixtures for a staging environment. The file is free text, so the customer names, emails, phone numbers, national ID numbers, IBANs and the odd card number are not tucked into labelled columns — they are strewn through log lines and ticket bodies. Deleting them by hand across thousands of rows is error-prone, and one missed value is a leak. Upload the file, choose the categories to remove, and every span that is a name, an email, a phone, a card, an IBAN or an ID is located and rewritten deterministically, so the .txt you hand over carries the structure and none of the people.
The compliance angle
GDPR Article 5(1)(c) makes data minimisation a binding principle: personal data must be limited to what is necessary, and a contractor debugging a query or a staging box running tests does not need real identities. Article 4(5) draws the line you actually care about — pseudonymisation still lets a record be linked back, while removing the identifiers outright takes the file toward anonymisation and out of that risk. Sharing a raw export with a third party or copying live data into a non-production environment is precisely where regulators look; redacting the PII first is the cheapest way to stay on the right side of that line.
What you can verify
The output is a plain .txt you can diff against the source. Each detected span is gone: redacted to nothing, masked to a run of asterisks, or replaced by a typed marker like <PERSON> or <IBAN_CODE> — your choice, applied the same way every run. The risky categories are validated, not guessed: a sixteen-digit string is only removed when it passes the Luhn check, an IBAN only when its mod-97 checksum holds, so real card numbers go and a random order number stays. The audit list stores only character offsets — start and end — never the value that sat there.
Frequently asked questions
- What file types can I upload, and what do I get back?
- Upload a plain-text `.txt`, a Word `.docx` or a PDF. The text is extracted, the categories you chose are located and rewritten, and you get back a clean `.txt` you can diff against the source. It is a file-in, file-out job — one text file for the whole batch of rows, not a per-record form.
- What is the difference between redact, mask with asterisks, and replace with <ENTITY>?
- Redact deletes the span so nothing is left where the value sat. Mask keeps the length by writing a run of asterisks over it. Replace swaps in a typed marker such as `<PERSON>` or `<IBAN_CODE>`, which keeps a line readable and shows a reviewer what kind of value stood there. You pick one style and it is applied the same way to every detected span in the file.
- Will it reliably catch credit card numbers, IBANs and national IDs — or will it flag random digits?
- The risky categories are validated, not guessed. A sixteen-digit string is only treated as a card number when it passes the Luhn algorithm, so a real PAN goes and a same-length order number stays. An IBAN is only removed when its mod-97 checksum holds, and Spanish DNI/NIF/NIE/CIF are keyed off their control character. Because these are arithmetic checks, they behave the same whatever the surrounding language is.
- Does it work on German, French or Italian text, or only English and Spanish?
- The structured identifiers — cards, IBANs, emails, phones, check-digit IDs — are language-agnostic and work everywhere. Personal names are different: the name model leans on Spanish and English, so names in German, French and Italian text are caught only partially. For those files add the surnames you know to the deny-list and give the output a read-through.
- Once a value is redacted, can the original text be recovered from the output file?
- No. The detected characters are rewritten out of the file itself — deleted, masked or replaced — and a fresh `.txt` is written. There is no hidden layer underneath and no metadata that stores the pre-redaction text. The audit list keeps only character offsets, the start and end of each span, never the value that was there, so nothing in the output can be reversed back to the original.