pyaiml21.utils.text_preprocessors.normalize_cjk_user_input

pyaiml21.utils.text_preprocessors.normalize_cjk_user_input(s: str) List[List[str]][source][source]

Perform CJK normalisation, split to sentences and each to words.

CJK (Chinese, Japanses, Korean) normalisation is equivalent to using <explode> on each word. Also UNICODE normalisation with uppercase-ing is done.

Parameters

s – user input to normalize

Returns

list of sentences, each sentence is a list of words

Example:
>>> text = u"こんにちは。この企画を気に入っていただけたでしょうか?"
>>> expected = [list("こんにちは"),
...             list("この企画を気に入っていただけたでしょうか")]
>>> normalize_cjk_user_input(text) == expected
True