Compiler Design Lecture | Introduction to Lexical Analyzer | Tokens, Patterns, Lexemes, computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens (strings with an identified "meaning"). A program that performs lexical analysis may be called a lexer, tokenizer, or scanner (though "scanner" is also used to refer to the first stage of a lexer). Such a lexer is generally combined with a parser, which together analyze the syntax of programming languages. A lexeme is a string of characters which forms a syntactic unit. just call this a token, using 'token' interchangeably to represent (a) the string being tokenized, and also (b) the token datastructure resulting from putting this string through the tokenization process. A token is a structure representing a lexeme that explicitly indicates its categorization for the purpose of parsing. A category of tokens is what in linguistics might be called a part-of-speech. Examples of token categories may include "identifier" and "integer literal", although the set of token categories differ in different programming languages. The process of forming tokens from an input stream of characters is called tokenization. The first stage, the scanner, is usually based on a finite-state machine (FSM). It has encoded within it information on the possible sequences of characters that can be contained within any of the tokens it handles (individual instances of these character sequences are known as lexemes). For instance, an integer token may contain any sequence of numerical digit characters. In many cases, the first non-whitespace character can be used to deduce the kind of token that follows and subsequent input characters are then processed one at a time until reaching a character that is not in the set of characters acceptable for that token (this is known as the maximal munch rule, or longest match rule). In some languages, the lexeme creation rules are more complicated and may involve backtracking over previously read characters. For example, in C, a single 'L' character is not enough to distinguish between an identifier that begins with 'L' and a wide-character string literal.