pregex.core.classes
All classes within this module represent the so-called RegΕx character classes, which can be used in order to define a set or “class” of characters that can be matched.
Class types
A character class can be either one of the following two types:
Regular class: This type of class represents the […] pattern, which can be translated as “match every character defined within the brackets”. You can tell regular classes by their name, which follows the Any* pattern.
Negated class: This type of class represents the [^…] pattern, which can be translated as “match every character except for those defined within the brackets”. You can tell negated classes by their name, which follows the AnyBut* pattern.
Here is an example containing a regular class as well as its negated counterpart.
from pregex.core.classes import AnyLetter, AnyButLetter
regular = AnyLetter()
negated = AnyButLetter()
regular.print_pattern() # This prints "[A-Za-z]"
negated.print_pattern() # This prints "[^A-Za-z]"
Class unions
Classes of the same type can be combined together in order to get the union of
the sets of characters they represent. This can be easily done though the use
of the bitwise OR operator |
, as depicted within the code snippet below:
from pregex.core.classes import AnyDigit, AnyLowercaseLetter
pre = AnyDigit() | AnyLowercaseLetter()
pre.print_pattern() # This prints "[\da-z]"
The same goes for negated classes as well:
from pregex.core.classes import AnyButDigit, AnyButLowercaseLetter
pre = AnyButDigit() | AnyButLowercaseLetter()
pre.print_pattern() # This prints "[^\da-z]"
However, attempting to get the union of a regular class and a negated class
causes a CannotBeUnionedException
to be thrown.
from pregex.core.classes import AnyDigit, AnyButLowercaseLetter
pre = AnyDigit() | AnyButLowercaseLetter() # This is not OK!
Lastly, it is also possible to union a regular class with a token, that is,
any string of length one or any instance of a class that is part of the
pregex.core.tokens
module:
from pregex.core.classes import AnyDigit
from pregex.core.tokens import Newline
pre = AnyDigit() | "a" | Newline()
pre.print_pattern() # This prints "[\da\n]"
However, in the case of negated classes one is forced to wrap any tokens
within an AnyButFrom
class instance in order to achieve the same
result:
from pregex.core.classes import AnyButDigit
from pregex.core.tokens import Newline
pre = AnyButDigit() | AnyButFrom("a", Newline())
pre.print_pattern() # This prints "[^\da\n]"
Subtracting classes
Subtraction is another operation that is exclusive to classes and it is made possible
via the overloaded subtraction operator -
. This feature comes in handy when one
wishes to construct a class that would be tiresome to create otherwise. Consider
for example the class of all word characters except for all characters in the
set {C, c, G, g, 3}. Constructing said class via subtraction
is extremely easy:
from pregex.core.classes import AnyWordChar, AnyFrom
pre = AnyWordChar() - AnyFrom('C', 'c', 'G', 'g', '3')
Below we are able to see this operation’s resulting pattern, from which it becomes evident that building said pattern through multiple class unions would be more time consuming, and more importantly, prone to errors.
[A-BD-FH-Za-bd-fh-z0-24-9_]
It should be noted that just like in the case of class unions, one is only
allowed to subtract a regular class from a regular class or a negated class
from a negated class, as any other attempt will cause a
CannotBeSubtractedException
to be thrown.
from pregex.core.classes import AnyWordChar, AnyButLowercaseLetter
pre = AnyWordChar() - AnyButLowercaseLetter() # This is not OK!
Furthermore, applying the subtraction operation between a class and a token is also possible, but just like in the case of class unions, this only works with regular classes:
from pregex.core.classes import AnyWhitespace
from pregex.core.tokens import Newline
pre = AnyWhitespace() - Newline()
pre.print_pattern() # This prints "[\t \x0b-\r]"
Negating classes
Finally, it is useful to know that every regular class can be negated through
the use of the bitwise NOT operator ~
:
from pregex.core.classes import AnyDigit
pre = ~ AnyDigit()
pre.print_pattern() # This prints "[^0-9]"
Negated classes can be negated as well, however you should probably avoid this as it doesn’t help much in making the code any easier to read.
from pregex.core.classes import AnyButDigit
pre = ~ AnyButDigit()
pre.print_pattern() # This prints "[0-9]"
Therefore, in order to create a negated class one can either negate a regular Any* class or use its AnyBut* negated class equivalent. The result is entirely the same and which one you’ll use is just a matter of choice.
Classes & methods
Below are listed all classes within pregex.core.classes
along with any possible methods they may possess.
- class pregex.core.classes.Any[source]
Matches any possible character, including the newline character.
- class pregex.core.classes.AnyBetween(start: str, end: str)[source]
Matches any character within the provided range.
- Parameters
start (str) – The first character of the range.
end (str) – The last character of the range.
- Raises
InvalidArgumentTypeException – At least one of the provided characters is neither a Token class instance nor a single-character string.
InvalidRangeException – A non-valid range is provided.
- Note
Any pair of characters
start
,end
constitutes a valid range as long as the code point of characterend
is greater than the code point of characterstart
, as defined by the Unicode Standard.
- class pregex.core.classes.AnyButBetween(start: str, end: str)[source]
Matches any character except for those within the provided range.
- Parameters
start (str) – The first character of the range.
end (str) – The last character of the range.
- Raises
InvalidArgumentTypeException – At least one of the provided characters is neither a Token class instance nor a single-character string.
InvalidRangeException – A non-valid range is provided.
- Note
Any pair of characters
start
,end
constitutes a valid range as long as the code point of characterend
is greater than the code point of characterstart
, as defined by the Unicode Standard.
- class pregex.core.classes.AnyButCJK[source]
Matches any character except for those defined within the CJK Unified Ideographs Unicode block.
- class pregex.core.classes.AnyButCyrillicLetter[source]
Matches any character except for characters in the Cyrillic alphabet.
- class pregex.core.classes.AnyButFrom(*chars: str)[source]
Matches any character except for the provided characters.
- Parameters
*chars (str | Pregex) – One or more characters not to match from. Each character must be either a string of length one or an instance of a class defined within the
pregex.core.tokens
module.- Raises
NotEnoughArgumentsExceptions – No arguments are provided.
InvalidArgumentTypeException – At least one of the provided arguments is neither a string of length one nor an instance of a class defined within
pregex.core.tokens
.
- class pregex.core.classes.AnyButGermanLetter[source]
Matches any character except for characters in the German alphabet.
- class pregex.core.classes.AnyButGreekLetter[source]
Matches any character except for characters in the Greek alphabet.
- class pregex.core.classes.AnyButHebrewLetter[source]
Matches any character excpet for those defined within the Hebrew Unicode block.
- class pregex.core.classes.AnyButKoreanLetter[source]
Matches any character except for characters in the Korean alphabet.
- class pregex.core.classes.AnyButLetter[source]
Matches any character except for characters in the Latin alphabet.
- class pregex.core.classes.AnyButLowercaseLetter[source]
Matches any character except for lowercase characters in the Latin alphabet.
- class pregex.core.classes.AnyButPunctuation[source]
Matches any character except for punctuation characters as defined within the ASCII table.
- class pregex.core.classes.AnyButUppercaseLetter[source]
Matches any character except for uppercase characters in the Latin alphabet.
- class pregex.core.classes.AnyButWhitespace[source]
Matches any character except for whitespace characters.
- class pregex.core.classes.AnyButWordChar(is_global: bool = False)[source]
Matches any character except for alphanumeric characters and the underscore character “_”.
- Parameters
is_global (bool) – Indicates whether to include foreign alphabetic characters or not. Defaults to
False
.- Raises
GlobalWordCharSubtractionException – There is an attempt to subtract a negated character class from an instance of this class for which parameter
is_global
has been set toTrue
.
- class pregex.core.classes.AnyCJK[source]
Matches any character that is defined within the CJK Unified Ideographs Unicode block.
- class pregex.core.classes.AnyCyrillicLetter[source]
Matches any character from the Cyrillic alphabet.
- class pregex.core.classes.AnyFrom(*chars: str)[source]
Matches any one of the provided characters.
- Parameters
*chars (str | Pregex) – One or more characters to match from. Each character must be either a string of length one or an instance of a class defined within the
pregex.core.tokens
module.- Raises
NotEnoughArgumentsExceptions – No arguments are provided.
InvalidArgumentTypeException – At least one of the provided arguments is neither a string of length one nor an instance of a class defined within
pregex.core.tokens
.
- class pregex.core.classes.AnyHebrewLetter[source]
Matches any character that is defined within the Hebrew Unicode block.
- class pregex.core.classes.AnyLowercaseLetter[source]
Matches any lowercase character from the Latin alphabet.
- class pregex.core.classes.AnyPunctuation[source]
Matches any puncutation character as defined within the ASCII table.
- class pregex.core.classes.AnyUppercaseLetter[source]
Matches any uppercase character from the Latin alphabet.
- class pregex.core.classes.AnyWordChar(is_global: bool = False)[source]
Matches any alphanumeric character as well as the underscore character
_
.- Parameters
is_global (bool) – Indicates whether to include foreign alphabetic characters or not. Defaults to
False
.- Raises
GlobalWordCharSubtractionException – There is an attempt to subtract a regular character class from an instance of this class for which parameter
is_global
has been set toTrue
.