Posted in adoc Studio, markup.
Regex in AsciiDoc: Hidden Characters & How to Find Them
Marvin Blome – .
Use this guide to decide when a character makes sense and how to find and replace it fast in adoc Studio with regex and unicode escapes.
You write in AsciiDoc. Typography matters. Hidden characters also matter. Some help readability. Others break parsing, search, or exports.
So you don't have to adjust all these characters individually, you can use “regular expressions" (in short: "regex"). They filter your text according to certain criteria and replace the search results en masse.
How to read the regex in this article
- We use literal characters when safe (e.g.,
—). - We show Unicode escapes:
\uXXXXor\x{XXXX}. Use whichever your regex engine accepts. - If in doubt, paste the literal character into the Find box.
Spaces and joins
These control wrapping and word boundaries. They often slip in via copy-paste.
| Character | Looks like | Code | Use it when | Avoid because | Find (regex) | Replace with |
|---|---|---|---|---|---|---|
| Non-breaking space | U+00A0 | Keep numbers + units or names together | Invisible; breaks expected wrap/search | \u00A0 | normal space " " |
|
| Narrow no-break space | U+202F | French punctuation; tight number–unit spacing | Easy to miss; inconsistent in exports | \u202F | " " or regular NBSP |
|
| Thin space | U+2009 | Fine typography (e.g., 5 %) | May collapse or vanish | \u2009 | " " |
|
| Zero-width space | | U+200B | Soft break in long URLs | Breaks search, identifiers, markup | \u200B | "" (remove) |
| Zero-width non-joiner | | U+200C | Needed in some scripts | Same issues in Latin text | \u200C | "" |
| Zero-width joiner | | U+200D | Emoji ligatures; complex scripts | Same issues | \u200D | "" |
| Soft hyphen | | U+00AD | Optional hyphenation | Can show as stray '-' and break copy | \u00AD | "" |
Batch cleanup (safe default for most docs):
Find: [\u200B\u200C\u200D\u00AD] → Replace: ""
Find: [\u00A0\u202F\u2009] → Replace: " " (or keep if you need non-breaking behavior)
Hyphens and dashes
AsciiDoc treats -, --, and --- literally unless your converter does typographic substitutions. Be explicit.
| Character | Looks like | Code | Use it when | Find | Replace with |
|---|---|---|---|---|---|
| Hyphen-minus | - | U+002D | Compound words; flags; code | - (literal) | keep |
| Non-breaking hyphen | - | U+2011 | Prevent break in compounds | \u2011 | - (or keep) |
| En dash | – | U+2013 | Ranges; relationships | \u2013 | -- (ASCII normalization) |
| Em dash | — | U+2014 | Breaks in thought | \u2014 | --- (ASCII normalization) |
Quotes and apostrophes
Curly quotes look good in prose. They are bad in code, attributes, IDs, and markup.
| Character | Looks like | Code | Use it when | Find | Replace with |
|---|---|---|---|---|---|
| Straight double quote | " | U+0022 | Code, attributes, JSON, CSV | " | keep |
| Straight single quote / apostrophe | ' | U+0027 | Contractions, possessives, code | ' | keep |
| Left double quote | “ | U+201C | Prose quotes | [“] or \u201C | " |
| Right double quote | ” | U+201D | Prose quotes | [”] or \u201D | " |
| Left single quote | ‘ | U+2018 | Nested quotes | [‘] or \u2018 | ' |
| Right single quote / curly apostrophe | ’ | U+2019 | Prose apostrophes | [’] or \u2019 | ' |
| Backtick | ` | U+0060 | AsciiDoc monospace | ` | keep |
Bulk fixes:
Curly doubles → straight:
[\u201C\u201D]→"Curly singles → straight:
[\u2018\u2019]→'
Ellipsis and punctuation
| Character | Looks like | Code | Use it when | Find | Replace with |
|---|---|---|---|---|---|
| Ellipsis | … | U+2026 | Prose pause; UI truncation | \u2026 | ... |
| Middle dot | · | U+00B7 | Inline lists; math vectors | \u00B7 | • or - |
| Bullet | • | U+2022 | Rich text bullets | \u2022 | AsciiDoc list "* " |
| Figure/En/Em spaces | / / | U+2007 / U+2002 / U+2003 | Tabular alignment | [\u2007\u2002\u2003] | (space) or keep in tables |
| Non-ASCII punctuation | ¿ ¡ « » | U+00XX | Language-specific typography | [¡¿«»] | ASCII equivalents if needed |
Math, units, and symbols
Prefer semantic ASCII in source and let the converter handle typography—unless you publish raw text.
| Character | Looks like | Code | Use it when | Find | Replace with |
|---|---|---|---|---|---|
| Multiplication sign | × | U+00D7 | Dimensions: 10×20 | \u00D7 | x or * |
| Minus sign | − | U+2212 | True math minus | \u2212 | - |
| Division sign | ÷ | U+00F7 | Simple math | \u00F7 | / |
| Degree | ° | U+00B0 | Temperatures, angles | \u00B0 | keep (ensure spacing) |
| Trademark | ™ ® © | U+2122 / U+00AE / U+00A9 | Legal marks | [™®©] | keep or remove by style |
Practical recipes
QA checklist before publishing
Remove zero-width and soft hyphens.
Decide: ASCII-only vs typographic output. Normalize accordingly.
Ensure non-breaking spaces where needed (numbers + units, names).
Check quotes in code and attributes are straight.
Search for BOMs and stray figure/em spaces.
A. Purge hidden troublemakers
Find: [\u200B\u200C\u200D\u00AD\uFEFF] → Replace: ""
B. Replace non-breaking and thin spaces with normal spaces
Find: [\u00A0\u202F\u2009] → Replace: " "
(Skip this if you rely on non-breaking behavior for layout.)
C. Normalize quotes to ASCII
Doubles:
[\u201C\u201D]→"Singles:
[\u2018\u2019]→'
D. Normalize dashes (if your style requires ASCII)
En dash:
\u2013→--Em dash:
\u2014→---