Natural Language Processing: tokenization and numericalization
Original Source Here Natural Language Processing: tokenization and numericalization NLP processing techniques NLP nowadays is considered to be one of the most booming fields in Deep Learning, offering more and more possible applications, starting from detecting or generating articles and reviews heading into direction of medical applications like diagnosis recognition, not mentioning extensive business opportunities. How does it work? The basic initial step is converting texts using Tokenization method, which ‘breaks’ raw text in smaller pieces. Tokens can be words, singles characters or subwords (n-gram characters: a contiguous sequence of n items from a given sample of text or speech). The most common way of tokenization is based on space. Taking space as a delimiter for example, you will get from “ Natural Language Processing ” 3 tokens : “ Natural ”, “ Language ”, “ Processing ”. Major techniques for tokenizing are: Split(): split method is used to break the given...