lex2

<library> lex2

lex2 is a library intended for lexical analysis (also called tokenization). String analysis is performed using regular expressions (regex), as specified in user-defined rules. Mechanisms, such as a dynamic ruleset-stack, provide flexibility to some degree at runtime.

Modules

`lex2.excs`	Components of exceptions.
`lex2.lexer`	Components of lexer implementations.
`lex2.matcher`	Components of matcher implementations.
`lex2.predefs`	Predefined rule objects and -template classes.
`lex2.textio`	Components of textstreams.
`lex2.util`	Common library components and utilities.

Classes

`DEFAULT_LEXER`	alias of `GenericLexer`
`DEFAULT_MATCHER`	alias of `ReMatcher`
`ILexer`	Common interface to a lexer object instance.
`IMatcher`	Common interface to a rule matcher object instance.
`LexerOptions`	Struct to define processing options of a lexer.
`Rule`	Class representing a rule, used as filter during lexical analysis.
`RuleGroup`	Abstract base class for making a creator class to dynamically build up a group of rules.
`Token`	Represents a token that is output during lexical analysis.

Functions

make_lexer([MATCHER_T, LEXER_T])

Factory function for creating a lexer instance.

class lex2.Rule

Class representing a rule, used as filter during lexical analysis.

__init__(id, regex, returns=True)

Rule object instance initializer.

Parameters

id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).
regex (str) – The regular expression used by a matcher to perform regex matching.
returns (bool, optional) –
Specify whether tokens matched by this rule should be returned when scanning for tokens.

By default True

Raises

ValueError – If the given regular expression is empty.

id: str: <readonly> Rule identifier string.

returns: bool: Whether tokens matched by this rule should be returned when scanning for tokens.

regex: str: <readonly> The regular expression used by a matcher to perform regex matching.

get_matcher()

Gets the IMatcher-compatible object instance.

The rule matcher object is used by a lexer object to identify tokens during lexical analysis.

Return type: PtrType[IMatcher]

set_matcher(matcher)

Sets the rule matcher object reference.

Parameters: matcher (IMatcher) –

class lex2.RuleGroup

Bases: ABC

Abstract base class for making a creator class to dynamically build up a group of rules.

To condense down the rule group into a single Rule object, use the inherited .rule() method.

abstract __init__(id, returns=True, regex_prefix='', regex_suffix='')

Rule object instance initializer.

Parameters

id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).
returns (bool, optional) –
Specify whether tokens matched by this rule group should be returned when scanning for tokens by default.

By default True
regex_prefix (str, optional) –
Regular expression that is prefixed for every added regex pattern group.

By default ""
regex_suffix (str, optional) –
Regular expression that is suffixed for every added regex pattern group.

By default ""

rule(id=None, returns=None)

Compiles the rule group to a rule object.

Parameters

id (str, optional) –
Overwrites the predefined identifying name of a resulting token.

By default the id set by the parent class.
returns (bool, optional) –
Overwrites whether tokens matched by this rule group should be returned when scanning for tokens by default.

By default the returns set by the parent class.

Return type

Rule

Raises

ValueError – If the given regular expression is empty.

class lex2.Token

Represents a token that is output during lexical analysis.

__init__(id='', data='', pos=<lex2.textio._textposition.TextPosition object>, groups=())

Token object instance initializer.

Parameters

id (str, optional) –
The identifying string of the resulting token’s type (e.g. “NUMBER”, “WORD”).

By default ""
data (str, optional) –
String data of the identified token.

By default ""
position (TextPosition, optional) –
Position in the textstream where the token occurs.

By default TextPosition()
groups (Iterable[str], optional) –
Result of regex match, split by encapsulated groups.

By default ()

id: str: The identifier of a token’s type (e.g. “NUMBER”, “WORD”).

data: str: Result of regex match.

pos: TextPosition: Position in the textstream where a token occurs.

groups: Sequence[str]: Result of regex match, split by encapsulated groups.

is_rule(expected_rule)

Evaluates if the token’s identifier matches that of a given rule.

Parameters: expected_rule (Rule) – Rule object instance.
Return type: bool

is_rule_oneof(expected_rules)

Evaluates if the token’s identifier matches that one of a given list of rules.

Parameters: expected_rules (List[Rule]) – List of Rule object instances.
Return type: bool

validate_rule(expected_rule)

Validates that the token’s identifier matches that of a given rule.

Parameters: expected_rule (Rule) – Rule object instance.
Raises: UnknownTokenError – When the token’s identifier does not match that of a given rule.

validate_rule_oneof(expected_rules)

Validates that the token’s identifier matches that one of a given list of rules.

Parameters: expected_rules (List[Rule]) – List of Rule object instances.
Raises: UnknownTokenError – When the token’s identifier does not match that of a given rule.

class lex2.LexerOptions

Struct to define processing options of a lexer.

class SeparatorOptions

Struct that defines processing options of a separator token.

__init__()

ignored: bool: Flag to specify whether processing of tokens of this separator should be ignored. Defaults to False

returns: bool: Flag to specify whether tokens of this separator should be returned. Defaults to False

__init__()

space: SeparatorOptions: Options to specify how a SPACE separator should be handled.

tab: SeparatorOptions: Options to specify how a TAB separator should be handled.

newline: SeparatorOptions: Options to specify how a NEWLINE separator should be handled.

id_returns: Dict[str, bool]: Map with <str, bool> key-pairs to specify whether tokens matched by a rule (Rule.id) should be returned to the user.

class lex2.ILexer

Bases: ITextIO, ABC

Common interface to a lexer object instance.

abstract push_ruleset(ruleset)

Pushes a ruleset to the ruleset-stack.

Parameters: ruleset (RulesetType) –

abstract pop_ruleset()

Pops a ruleset from the ruleset-stack.

abstract clear_rulesets()

Clears all rulesets from the ruleset-stack.

abstract get_next_token()

Finds the next token in the textstream using the currently active ruleset.

Return type

Token

Raises

UnknownTokenError – If an unknown token type has been encountered.
EOF – If the lexer has reached the end of input data from a textstream.

abstract get_options()

Gets the lexer options to define processing options of the lexer.

Return type: LexerOptions

abstract set_options(options)

Sets the lexer options to define processing options of the lexer.

Parameters: options (LexerOptions) –

class lex2.IMatcher

Bases: ABC

Common interface to a rule matcher object instance.

abstract get_uid()

Gets the unique identifier (UID) of the matcher implementation.

Return type: str

abstract compile_pattern(regex)

Compiles regex pattern to implementation-specific regex matcher object.

Parameters: regex (str) – Regular expression to compile.

abstract match(ts, token)

Looks for a pattern match and sets it in the provided token object.

Parameters

ts (ITextstream) – Textstream object managed by the lexer object.
token (Token) – Used to set the match data in the token.

Returns

True in case of a match.

Return type

bool

lex2.DEFAULT_LEXER: alias of GenericLexer

lex2.DEFAULT_MATCHER: alias of ReMatcher

lex2.make_lexer(MATCHER_T=<class 'lex2.matcher._std_re.ReMatcher'>, LEXER_T=<class 'lex2.lexer._generic_lexer.GenericLexer'>)

Factory function for creating a lexer instance.

If no values are provided for the template parameters, the implementations used for the matcher and lexer will default to the library constants DEFAULT_MATCHER and DEFAULT_LEXER respectively.

Template Parameters

MATCHER_TType[BaseMatcher], optional

Template class type that implements the BaseMatcher base class.

By default DEFAULT_MATCHER

LEXER_TType[BaseLexer], optional

Template class type that implements the BaseLexer base class.

By default DEFAULT_LEXER

Parameters

ruleset (RulesetType, optional) –
Initial ruleset.

By default []
options (LexerOptions, optional) –
Struct specifying processing options of the lexer.

By default LexerOptions()

Return type

ILexer