Writing a Multilingual Book in Markdown

This post is part of a collection on Code and Computers.

Since last summer I've been working on a book about Japanese Natural Language Processing. I'm writing it in Markdown, and publishing chapters simultaneously in Japanese and English. I was surprised to not find many tools for to support this kind of workflow, so I ended up writing a preprocessing script to help me publish the book.

To give a little background, the book is about NLP, the field of machine learning concerned with processing language, in this case text documents. As such the book contains significant amounts of code. My co-author writes his chapters mainly in Jupyter notebooks. I considered doing the same, but many of my chapters are more prose than code, so I have stuck with my usual system for writing on the computer - Markdown in vim.

Our needs are already a little unusual - we have to write a document with significant amounts of code, in two languages, where the code will be shared between languages, and, oh, we need to turn it into a book too. We also want the texts to be parallel, with roughly paragraph-to-paragraph correspondence, so if possible we want to see both languages at the same time. That's a pretty specialized need, and I didn't immediately turn up any tools designed for this kind of usage.

So what I needed was something that would work equally well in plain text and in a Jupyter notebook, without introducing too much overhead. It should be able to modify the manuscript files that we edit directly to prepare them for processing by pandoc. The solution is almost laughably simple: the use of special tokens to mark any line of text that's language specific. Here's a sample document:

# English Title :en

# 日本語のタイトル :ja

This is a paragraph in English. For simple processing, it's kept on one line. :en

この段落は日本語で書かれています。処理を簡単にするため、改行を入れないようにしています。 :ja

I call the tokens like :en or :ja "sigils"; their position on the line doesn't matter, just their presence.

As part of the build process for the book, a script is run with a target language, and it filters the input in three simple ways:

Unmarked lines are passed through
Lines marked with the current language have the sigil (like :en) removed and are passed through
Lines with other sigils are removed

It turns out this by itself is highly effective. For Jupyter Notebooks this means that all the important lines in the notebooks - which are basically large JSON objects - are passed through without modification. It also has the neat side effect that by using an imaginary language code, like :xx, it's possible to leave comments that won't show up in any final output files.

Technically there is one more feature - writing the language marker on every line of lists ended up being too tedious, so there's also block processing. This is slightly more complicated, and required some special rules to deal with Jupyter formatting, but still was easy to implement. The rule is just that a sigil on a line by itself starts a block, and :end on a line by itself exits block processing.

While this system was designed primarily to be easy to implement, it came with an added bonus. Because the sigils are plain text and highly flexible in their placement, they can be used in any existing text editor without modification. So when our source files are added to Github, all the usual Markdown preview machinery works as-is. You can see the demo file shows up like a normal Markdown with a little weird extra text.

I've gone ahead and thrown the script up on Github. If you find a neat use for it I'd love to hear about it. Ψ

2022-06-22T13:15:57+09:00