Résumé

Howdy! My name is Paul McCann.

As of April 2023, I am taking on consulting projects related to Natural Language Processing. I may be available for a full time position if the details are right.

While my focus is specifically Japanese NLP, a lot of my professional experience has been in systems programming, and I've done a bit of everything over the years. If you think I can help you feel free to shoot me a mail.

Since 2011 I've been living in Tokyo near Tokyo Tower. Before that I lived in Providence for five years while I got my undergraduate and Master's degrees at Brown.

Open Source and Personal Projects

My open source work and personal projects are focused on tools and resources for the Japanese language.

fugashi and mecab-python3: In 2019 I took over maintenance of mecab-python3, one of the top 1000 packages on PyPI by downloads and one of the most popular Japanese tokenizers in Python. In 2019 I wrote fugashi, a newer and easier to use MeCab wrapper, that is widely used for large language models. For example, it's the standard Japanese tokenizer in HuggingFace Transformers.

Packaged Tokenizer Dictionaries. PyPI packages for UniDic and other Japanese tokenizer dictionaries enable simple, reproducible project builds with no particular effort on the part of developers, and are key to enabling widespread Japanese support in NLP libraries. For hosting the largest UniDic I partnered with AWS.

Kanji Club. A site to search kanji by parts. The site is static, and the main search works offline using under 1MB of (compressed) resources. The site ranks highly in Google for queries related to finding kanji (example: search for "石切 漢字").

Introduction to Japanese Natural Language Processing. A technical guide to modern NLP with a focus on Japanese-specific problems and resources, coauthored with Masato Hagiwara. In English and Japanese.

Others. cutlet romanizes arbitrary Japanese reasonably. posuto makes the Infamous Japanese Postal CSV easy to use. My articles on Japanese Language Technology have regularly been featured on Hacker News.

Work Experience

Explosion: March 2021 to March 2023. Explosion are the developers of spaCy, one of the most popular and accessible NLP libraries. I worked extensively on core spaCy development, as well as support and consulting. Before becoming an employee, I had done consulting for Explosion, and as part of my open source work contributed PRs focused on Japanese support starting in 2017.

Consulting: 2019 to early 2021. I advised and worked on NLP and ML related projects. Selected projects:

  • Did some work on spaCy internals for Explosion AI, mainly working on language data efficiency.
  • Consulting and prototype implementation for a fashion (image) search app, miel, by Metaps One.
  • Made a tool for Switch game Beat Talk! to analyze patterns of stress in English words to generate rhythm game stages.

Metaps One: 2014 to 2019. The main product of Metaps One is Become, a comparison shopping search engine with an unusual revenue model. (The company was actually known as Become until 2019.) In July 2019 they released miel, a fashion search app using image recognition, that I continued to work on as a consultant after leaving the company. When I entered the company I mostly worked in C++, though by the end I was mostly working in Python.

  • Rewrote datafeed ETL; 100k lines of C++ to 1500 lines of Python with a slight speed increase, handling 20M products each day on one machine.
  • Improved product classification (quality and efficiency).
  • Dealt with issues with Japanese tokenization affecting search quality.
  • Created annotation tool for anyone in the company to create category training data.

M3: 2011 to 2014. M3's primary business is sending drug advertisements to doctors. I worked on the backend team, implementing tools for internal use. Most of my work here was in Ruby on Rails or Java, but my largest project, a query builder for non-technical salespeople, used a Python backend I wrote to run efficient queries on a large database of Japanese medical professionals.

Academic Publications

fugashi, a Tool for Tokenizing Japanese in Python for the OSS NLP Workshop at EMNLP 2020.

Academic Background

Brown University: Master of Computer Science. 2011. Worked in BLLIP, the Brown NLP Lab, under Eugene Charniak.

  • Implemented proof-of-concepts for PhD students
  • Explored some novel methods in document dating and authorship attribution

Brown University: Bachelor of Arts in Computer Science. 2006-2010. Also fulfilled the requirements for a degree in English Literature (effectively double-majored).


If you have any questions, don't hesitate to mail me. Ψ