lex2
<library> lex2
lex2 is a library intended for lexical analysis (also called tokenization). String analysis is performed using regular expressions (regex), as specified in user-defined rules. Mechanisms, such as a dynamic ruleset-stack, provide flexibility to some degree at runtime.
Modules
Components of exceptions. |
|
Components of lexer implementations. |
|
Components of matcher implementations. |
|
Predefined rule objects and -template classes. |
|
Components of textstreams. |
|
Common library components and utilities. |
Classes
alias of |
|
alias of |
|
Common interface to a lexer object instance. |
|
Common interface to a rule matcher object instance. |
|
Struct to define processing options of a lexer. |
|
Class representing a rule, used as filter during lexical analysis. |
|
Abstract base class for making a creator class to dynamically build up a group of rules. |
|
Represents a token that is output during lexical analysis. |
Functions
|
Factory function for creating a lexer instance. |
- class lex2.Rule
Class representing a rule, used as filter during lexical analysis.
- __init__(id, regex, returns=True)
Rule object instance initializer.
- Parameters
id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).
regex (str) – The regular expression used by a matcher to perform regex matching.
returns (bool, optional) –
Specify whether tokens matched by this rule should be returned when scanning for tokens.
By default
True
- Raises
ValueError – If the given regular expression is empty.
- id: str
<readonly>
Rule identifier string.
- returns: bool
Whether tokens matched by this rule should be returned when scanning for tokens.
- regex: str
<readonly>
The regular expression used by a matcher to perform regex matching.
- class lex2.RuleGroup
Bases:
ABC
Abstract base class for making a creator class to dynamically build up a group of rules.
To condense down the rule group into a single
Rule
object, use the inherited.rule()
method.- abstract __init__(id, returns=True, regex_prefix='', regex_suffix='')
Rule object instance initializer.
- Parameters
id (str) – The identifying name of a resulting token (e.g. “NUMBER”, “WORD”).
returns (bool, optional) –
Specify whether tokens matched by this rule group should be returned when scanning for tokens by default.
By default
True
regex_prefix (str, optional) –
Regular expression that is prefixed for every added regex pattern group.
By default
""
regex_suffix (str, optional) –
Regular expression that is suffixed for every added regex pattern group.
By default
""
- rule(id=None, returns=None)
Compiles the rule group to a rule object.
- Parameters
id (str, optional) –
Overwrites the predefined identifying name of a resulting token.
By default the id set by the parent class.
returns (bool, optional) –
Overwrites whether tokens matched by this rule group should be returned when scanning for tokens by default.
By default the returns set by the parent class.
- Return type
- Raises
ValueError – If the given regular expression is empty.
- class lex2.Token
Represents a token that is output during lexical analysis.
- __init__(id='', data='', pos=<lex2.textio._textposition.TextPosition object>, groups=())
Token object instance initializer.
- Parameters
id (str, optional) –
The identifying string of the resulting token’s type (e.g. “NUMBER”, “WORD”).
By default
""
data (str, optional) –
String data of the identified token.
By default
""
position (TextPosition, optional) –
Position in the textstream where the token occurs.
By default
TextPosition()
groups (Iterable[str], optional) –
Result of regex match, split by encapsulated groups.
By default
()
- id: str
The identifier of a token’s type (e.g. “NUMBER”, “WORD”).
- data: str
Result of regex match.
- pos: TextPosition
Position in the textstream where a token occurs.
- groups: Sequence[str]
Result of regex match, split by encapsulated groups.
- is_rule(expected_rule)
Evaluates if the token’s identifier matches that of a given rule.
- Parameters
expected_rule (Rule) – Rule object instance.
- Return type
bool
- is_rule_oneof(expected_rules)
Evaluates if the token’s identifier matches that one of a given list of rules.
- Parameters
expected_rules (List[Rule]) – List of Rule object instances.
- Return type
bool
- validate_rule(expected_rule)
Validates that the token’s identifier matches that of a given rule.
- Parameters
expected_rule (Rule) – Rule object instance.
- Raises
UnknownTokenError – When the token’s identifier does not match that of a given rule.
- validate_rule_oneof(expected_rules)
Validates that the token’s identifier matches that one of a given list of rules.
- Parameters
expected_rules (List[Rule]) – List of Rule object instances.
- Raises
UnknownTokenError – When the token’s identifier does not match that of a given rule.
- class lex2.LexerOptions
Struct to define processing options of a lexer.
- class SeparatorOptions
Struct that defines processing options of a separator token.
- __init__()
- ignored: bool
Flag to specify whether processing of tokens of this separator should be ignored. Defaults to
False
- returns: bool
Flag to specify whether tokens of this separator should be returned. Defaults to
False
- __init__()
- space: SeparatorOptions
Options to specify how a SPACE separator should be handled.
- tab: SeparatorOptions
Options to specify how a TAB separator should be handled.
- newline: SeparatorOptions
Options to specify how a NEWLINE separator should be handled.
- id_returns: Dict[str, bool]
Map with <str, bool> key-pairs to specify whether tokens matched by a rule (Rule.id) should be returned to the user.
- class lex2.ILexer
Bases:
ITextIO
,ABC
Common interface to a lexer object instance.
- abstract push_ruleset(ruleset)
Pushes a ruleset to the ruleset-stack.
- Parameters
ruleset (RulesetType) –
- abstract pop_ruleset()
Pops a ruleset from the ruleset-stack.
- abstract clear_rulesets()
Clears all rulesets from the ruleset-stack.
- abstract get_next_token()
Finds the next token in the textstream using the currently active ruleset.
- Return type
- Raises
UnknownTokenError – If an unknown token type has been encountered.
EOF – If the lexer has reached the end of input data from a textstream.
- abstract get_options()
Gets the lexer options to define processing options of the lexer.
- Return type
- abstract set_options(options)
Sets the lexer options to define processing options of the lexer.
- Parameters
options (LexerOptions) –
- class lex2.IMatcher
Bases:
ABC
Common interface to a rule matcher object instance.
- abstract get_uid()
Gets the unique identifier (UID) of the matcher implementation.
- Return type
str
- abstract compile_pattern(regex)
Compiles regex pattern to implementation-specific regex matcher object.
- Parameters
regex (str) – Regular expression to compile.
- abstract match(ts, token)
Looks for a pattern match and sets it in the provided token object.
- Parameters
ts (ITextstream) – Textstream object managed by the lexer object.
token (Token) – Used to set the match data in the token.
- Returns
True in case of a match.
- Return type
bool
- lex2.DEFAULT_LEXER
alias of
GenericLexer
- lex2.make_lexer(MATCHER_T=<class 'lex2.matcher._std_re.ReMatcher'>, LEXER_T=<class 'lex2.lexer._generic_lexer.GenericLexer'>)
Factory function for creating a lexer instance.
If no values are provided for the template parameters, the implementations used for the matcher and lexer will default to the library constants
DEFAULT_MATCHER
andDEFAULT_LEXER
respectively.Template Parameters
- MATCHER_TType[BaseMatcher], optional
Template class type that implements the
BaseMatcher
base class.By default
DEFAULT_MATCHER
- LEXER_TType[BaseLexer], optional
Template class type that implements the
BaseLexer
base class.By default
DEFAULT_LEXER
- Parameters
ruleset (RulesetType, optional) –
Initial ruleset.
By default
[]
options (LexerOptions, optional) –
Struct specifying processing options of the lexer.
By default
LexerOptions()
- Return type