June 4, 2018

How to teach your AI to read Chinese

Josh Anish

Mandarin* is by far the most widely-spoken language on the planet. With more than a billion native speakers, the group of Chinese dialects is spoken by 1 out of every 7 earthlings every day.

Equally daunting is data asserting that Mandarin is one of the most difficult of the world’s major languages to learn, taking foreigners on average 2300 hours of study to enter into basic fluency.

Hence the challenge: how to teach your AI program to read Mandarin, when the feature is potentially so helpful to your business, yet so difficult for humans who don’t speak the language to design.

Here’s how AppZen did it.

Our platform uses Tesseract open source for optical character recognition (OCR). For use with English, the technology still needs to be trained – in receipt reading, image resizing, background diminution, pixel enhancement and other improvements.

But this is just text, words that mean nothing to code. Natural-language processing (NLP) programs must then come in and tag the text so the platform understands where to locate, for example, “this is the amount” on the receipt. The receipt amount, bubbled up by OCR and then framed by NLP, is next cross-checked with the vendor name submitted on the expense report.

This cross-checking process also features a form of AI; AppZen scans dozens of databases like Yelp and TripAdvisor in seconds to validate the vendor listed on each receipt. For example, AppZen has learned that “K-Kel, Inc” is actually just a shell name that appears on credit card statements as a proxy for Spearmint Rhino gentlemen's club in Las Vegas.

When it comes to reading Mandarin, AppZen’s AI’s reading process is more complicated, but only slightly.

Presented with expense receipts in Mandarin, the platform again uses OCR to enhance and frame the imagery of the receipt, serving up relatively clean images of text to the NLP.

Remember, NLP can’t necessarily “read” either; it’s just been programmed to find phrases like “this is the amount.” But of course the receipts don’t say “this is the amount” in China; they say 这是金额. No problem, we trained the NLP to look for these phrases having to do with payment, in Chinese. Same process, different language. AI doesn’t inherently know how to read Mandarin, but it isn’t born reading English, either. That’s where human programmers come in.

The automated vendor validation process with Mandarin works the same way; our system is told by humans which symbols to look for that might mean “gentlemen’s club” (or, in this case, 剥离俱乐部) and then AI can find which expenses don’t match and learn what else could signal misconduct.

Like any nation, China has its added accounting complexities and one example is the country’s Fapiao system, which essentially functions as a Value-Added Tax (VAT). In a previous post, we dug into different sorts of VAT misconduct; in the case of clients with employees in China, we’ve essentially had to teach our AI system to look for Fapiao (发票) receipts and check them for known patterns of VAT malfeasance (such as employees splitting the surcharge with the vendor, and expensing the full amount for recompense).

Teaching AI to read Mandarin isn’t terribly difficult; the challenge is to help AI read in the first place -- that immense project is left to your data scientists and engineers. The good news here is that when you do succeed in that task, any language is possible.


(*) Mandarin is far from the only language spoken in China. More than 50 million people speak Cantonese, especially around Hong Kong. For the sake of reading flow, Mandarin will be the focus of this article and it will be used interchangeability with the word Chinese, while Cantonese will not be discussed (though AI could learn Cantonese via the same basic process).