| Title: | Construct Parsers for Structured Text Files |
|---|---|
| Description: | Construct parser combinator functions, higher order functions that parse input. Construction of such parsers is transparent and easy. Their main application is the parsing of structured text files like those generated by laboratory instruments. Based on a paper by Hutton (1992) <doi:10.1017/S0956796800000411>. |
| Authors: | Douwe Molenaar [aut, cre, cph] (ORCID: <https://orcid.org/0000-0001-7108-4545>) |
| Maintainer: | Douwe Molenaar <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.6.1.9000 |
| Built: | 2026-05-18 09:14:39 UTC |
| Source: | https://github.com/systemsbioinformatics/parcr |
The %or% combinator (p1 %or% p2) returns the result of p1 if p1 is
successful or, if p1 fails that of p2 if p2 parses successfully,
otherwise it returns a fail.
p1 %or% p2p1 %or% p2
p1, p2
|
two parsers. |
A parser.
(p1 %or% p2)(x):
if p1(x)==[] then
if p2(x)==[] then fail()(x) else p2(x)
else p1(x)
where [] is the empty list.
(literal("A") %or% literal("a"))(LETTERS[1:5]) # success on first parser (literal("A") %or% literal("a"))(letters[1:5]) # success on second parser (literal("A") %or% literal("a"))(LETTERS[2:6]) # failure starts_with_a <- function(x) grepl("^a", x[1]) # success on both parsers, but returns result of p1 only (literal("a") %or% satisfy(starts_with_a))(letters[1:5])(literal("A") %or% literal("a"))(LETTERS[1:5]) # success on first parser (literal("A") %or% literal("a"))(letters[1:5]) # success on second parser (literal("A") %or% literal("a"))(LETTERS[2:6]) # failure starts_with_a <- function(x) grepl("^a", x[1]) # success on both parsers, but returns result of p1 only (literal("a") %or% satisfy(starts_with_a))(letters[1:5])
Sometimes we are not interested in the result from a parser, only that the
parser succeeds. It may be convenient to return some short representation
or nothing even rather than the string itself. The %ret% combinator is
useful in such cases. The parser (p %ret% c) has the same behavior as p,
except that it returns the value c if successful.
p %ret% cp %ret% c
p |
a parser. |
c |
string, i.e. a single-element character vector. |
A parser.
(p %xret% c)(x): if p(x)==[] then fail()(x) else succeed(c)(x[-1])
(literal("A") %ret% "We have an A!")(LETTERS[1:5]) (literal("A") %ret% NULL)(LETTERS[1:5])(literal("A") %ret% "We have an A!")(LETTERS[1:5]) (literal("A") %ret% NULL)(LETTERS[1:5])
(p1 %then% p2) recognizes anything that p1 and p2 would if applied in
succession.
p1 %then% p2p1 %then% p2
p1, p2
|
two parsers. |
A parser.
(p1 %then% p2)(x):
if p1(x)==[] or x==null then fail()(x)
else
if p2(x[-1])==[] then fail()(x)
else succeed([p1(x)$L, p2(x[-1])$L])(x[-2])
where null is the empty vector, x[-1] and x[-2] are the vector x
without the first element and without the first two elements, respectively.
The discarding versions %xthen% and %thenx%
starts_with_a <- function(x) grepl("^a", x[1]) starts_with_b <- function(x) grepl("^b", x[1]) (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("ab", "bc", "de")) # success (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("bb", "bc", "de")) # failure (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("ab", "ac", "de")) # failurestarts_with_a <- function(x) grepl("^a", x[1]) starts_with_b <- function(x) grepl("^b", x[1]) (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("ab", "bc", "de")) # success (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("bb", "bc", "de")) # failure (satisfy(starts_with_a) %then% satisfy(starts_with_b))(c("ab", "ac", "de")) # failure
The %using% combinator allows us to manipulate results from a parser. The
parser (p %using% f) has the same behavior as the parser p, except that
the function f is applied to its result value.
p %using% fp %using% f
p |
a parser. |
f |
a function to be applied to the result of a successful |
A parser.
(p %using% f)(x): if p1(x)==[] then fail()(x) else succeed(f(p1(x)$L))(x[-1])
(literal("ab") %using% toupper)(c("ab", "cdef")) # success (literal("ab") %using% toupper)(c("bb", "cdef")) # failure(literal("ab") %using% toupper)(c("ab", "cdef")) # success (literal("ab") %using% toupper)(c("bb", "cdef")) # failure
%then% sequenceTwo parsers composed in sequence produce a pair of results. Sometimes we are
only interested in one component of the pair. For example in the case of
reserved words such as 'begin' and 'end'. In such cases, two special
versions of the %then% combinator are useful, which keep either the
first or second result, as reflected by the position of the letter 'x' in
their names.
p1 %xthen% p2 p1 %thenx% p2p1 %xthen% p2 p1 %thenx% p2
p1, p2
|
two parsers. |
A parser.
(p1 %xthen% p2)(x):
if p1(x)==[] or x==null then fail()(x)
else
if p2(x[-1])==[] then fail()(x)
else succeed(p1(x)$L)(x[-2])
(p1 %thenx% p2)(x):
if p1(x)==[] or x==null then fail()(x)
else
if p2(x[-1])==[] then fail()(x)
else succeed(p2(x[-1])$L)(x[-2])
where null is the empty vector, x[-1] and x[-2] are the vector x
without the first element and without the first two elements, respectively.
is_number <- function(x) grepl("\\d+", x[1]) # Numbers are preceded by ">" symbols, but we only want the number (literal(">") %thenx% satisfy(is_number))(c(">", "12")) # Temperatures are followed by the unit 'C', but we only want the number (satisfy(is_number) %xthen% literal("C"))(c("21", "C"))is_number <- function(x) grepl("\\d+", x[1]) # Numbers are preceded by ">" symbols, but we only want the number (literal(">") %thenx% satisfy(is_number))(c(">", "12")) # Temperatures are followed by the unit 'C', but we only want the number (satisfy(is_number) %xthen% literal("C"))(c("21", "C"))
Splits a string by using a split pattern and then applies the parser p
to the resulting character vector. If finish = TRUE then the parser should
completely consume its input, otherwise the parser fails. If
finish = FALSE then any remaining part of the string is discarded.
by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)
p |
A parser. |
split |
a string (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. |
finish |
logical. Should the parser completely consume the string?
Defaults to |
fixed |
logical. If |
perl |
logical. Should Perl-compatible regexps be used? |
The function base::strsplit() is used to perform the splitting. The
parameters split, fixed and perl are passed on to that function.
A parser.
by_split((literal("a") %then% literal("b")),"\\t") ("a\tb") # success by_split((literal("a") %then% literal("b")),"\\t") ("a\tb\tc") # failure by_split((literal("a") %then% literal("b")),"\\t", finish=FALSE) ("a\tb\tc") # successby_split((literal("a") %then% literal("b")),"\\t") ("a\tb") # success by_split((literal("a") %then% literal("b")),"\\t") ("a\tb\tc") # failure by_split((literal("a") %then% literal("b")),"\\t", finish=FALSE) ("a\tb\tc") # success
Splits a string to individual symbols and then applies the parser p to the
resulting character vector, otherwise the parser fails. If finish = TRUE
then the parser should completely consume its input. If finish = FALSE
then any remaining part of the string is discarded.
This function is identical to by_split(p, "", finish).
by_symbol(p, finish = TRUE)by_symbol(p, finish = TRUE)
p |
A parser. |
finish |
logical. Should the parser completely consume the string?
Defaults to |
A parser.
by_symbol(exactly(3,literal("a"))) (c("aaa", "bb")) # success by_symbol(exactly(3,literal("a"))) (c("aaaa", "bb")) # failureby_symbol(exactly(3,literal("a"))) (c("aaa", "bb")) # success by_symbol(exactly(3,literal("a"))) (c("aaaa", "bb")) # failure
Removes all variables stored using store(). Use this to clean up
between independent parser runs or in test teardown.
clear_store()clear_store()
An empty line is a line that consists entirely of space-like characters.
EmptyLine is a parser that recognizes one empty line, Spacer recognizes
one or more empty lines and MaybeEmpty recognizes zero or more empty
lines. EmptyLine returns the empty line but Spacer and MaybeEmpty
discard these.
EmptyLine() Spacer() MaybeEmpty()EmptyLine() Spacer() MaybeEmpty()
A parser.
space_like_eraser(x):
d = x in which all "\\s+" are replaced by ""
if d=="" TRUE else FALSE
Emptyline: satisfy(space_like_eraser)
Spacer: one_or_more(EmptyLine()) %ret% null
MaybeEmpty: zero_or_more(EmptyLine()) %ret% null
where null is the empty vector.
EmptyLine()(" \t ") # success EmptyLine()(" .") # failure EmptyLine()("") # success Spacer()(c(" \t ", " ", "abc")) Spacer()(c(" ", " ", "Important text")) Spacer()(c("Important text")) # failure, missing empty line MaybeEmpty()(c(" ", " ", "Important text")) # success, just as Spacer() MaybeEmpty()(c("Important text")) # success, in contrast to Spacer()EmptyLine()(" \t ") # success EmptyLine()(" .") # failure EmptyLine()("") # success Spacer()(c(" \t ", " ", "abc")) Spacer()(c(" ", " ", "Important text")) Spacer()(c("Important text")) # failure, missing empty line MaybeEmpty()(c(" ", " ", "Important text")) # success, just as Spacer() MaybeEmpty()(c("Important text")) # success, in contrast to Spacer()
Tests whether the end of the input character vector has been reached,
which boils down to detection of character(0) in the R-element (see
succeed()). Since the intended application of this parser is parsing of
text files the function has been called after the end of file (EOF) signal.
To indicate that an end of file has been detected, the R-element side of
the parser output will be converted to an empty list.
eof()eof()
A parser.
eof()(x): if x==null then succeed(x)(list()) else fail()(x)
(literal("a") %then% eof())("a") # success # Notice the difference on the R-side with literal("a")("a") eof()(character(0)) # success eof()("a") # failure(literal("a") %then% eof())("a") # success # Notice the difference on the R-side with literal("a")("a") eof()(character(0)) # success eof()("a") # failure
Use this function to test whether your parser failed, for example in unit testing of your parsers when writing a package.
failed(o)failed(o)
o |
Output from a parser. |
A logical value.
d <- (literal("A") %then% literal("B"))(c("A", "A")) d failed(d)d <- (literal("A") %then% literal("B"))(c("A", "A")) d failed(d)
An example fasta-formatted file with a mixture of nucleotide and protein sequences. It is used in the vignette to demonstrate parsing with the tools from the package. It is not clear to me whether mixing of sequence types is allowed in a fasta file, but we demonstrate in the vignette that is is easy to parse them from a single file. The sequences used are truncated for the sake of the example.
fastafilefastafile
fastafileA character vector
Modified from https://bioinformatics.org/annhyb/examples/seq_fasta.html and https://en.wikipedia.org/wiki/FASTA_format
Checks if the parser result indicates that the entire input was consumed.
A parser has completely consumed its input when the input has satisfied
eof().
finished(o)finished(o)
o |
Output from a parser. |
A logical value.
finished((literal("A") %then% eof())("A")) # TRUE finished((literal("A"))("A")) # FALSE finished((literal("A") %then% eof())(c("A", "C"))) # FALSEfinished((literal("A") %then% eof())("A")) # TRUE finished((literal("A"))("A")) # FALSE finished((literal("A") %then% eof())(c("A", "C"))) # FALSE
Check if a variable is stored
has_stored(name)has_stored(name)
name |
Variable name to check |
Logical
Sometimes, a parser has found the end of the text that needs to be parsed and
all following lines can be ignored. Ignore will read and discard all
following lines until the end of the file, empty or not. So, this parser will
consume all elements until the end of the input vector.
Ignore()Ignore()
This parser is identical to
zero_or_more(satisfy(function(x) TRUE))
A parser.
starts_with_a <- function(x) grepl("^a", x) p <- function() { one_or_more(satisfy(starts_with_a)) %then% (literal("~End") %ret% NULL) %then% Ignore() %then% eof() } p()(c("ab", "abc", "~End", "boring stuff", "more stuff"))starts_with_a <- function(x) grepl("^a", x) p <- function() { one_or_more(satisfy(starts_with_a)) %then% (literal("~End") %ret% NULL) %then% Ignore() %then% eof() } p()(c("ab", "abc", "~End", "boring stuff", "more stuff"))
Returns the names of all variables currently stored in the parser environment.
list_stored()list_stored()
Character vector of variable names
literal tests whether a supplied string literally equals a desired value.
literal(string)literal(string)
string |
string, a single-element character vector, or an object that can be coerced to a character vector. |
A parser.
literal(a)(x): satisfy(F(y): y==a)(x)
where F is equivalent to the function declarator in R. So, we have an
anonymous function in the argument of satisfy.
literal("ab")(c("ab", "cdef")) # success literal("ab")(c("abc", "cdef")) # failureliteral("ab")(c("ab", "cdef")) # success literal("ab")(c("abc", "cdef")) # failure
match_s matches a string using a function and returns a desired object
type.
match_s(s)match_s(s)
s |
a string-parsing function. |
This parser short-cuts the pattern satisfy(b) %using% f. With match_s
you do not have to write separate predicate and processing functions b and
f when identification and parsing can be done with a single string
parsing function s.
The function s will be given a non-empty single-element character vector
as its argument, so you don't have to test for empty input, like
character(0). These two facts also often simplify further processing with
the string functions like grep, regmatches and those from the stringr
package. The function s can return any R-object when succeeding, but to
signal failure to the parser it must return the empty list(). Note that
list() output from s will be turned into a marker object, the internal
object to mark failure, by match_s(), see failed().
A parser.
match_s(s)(x): if x==null then fail()(x) else if s(x[1]) then succeed(s(x[1]))(x[-1]) else fail()(x)
expect_integers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- regmatches(x, m)[[1]] if (length(matches) == 0) { # this means failure to detect numbers where we expected them return(list()) } else { return(as.numeric(matches)) } } match_s(expect_integers)("12 15 16 # some comment") # success match_s(expect_integers)("some text") # failureexpect_integers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- regmatches(x, m)[[1]] if (length(matches) == 0) { # this means failure to detect numbers where we expected them return(list()) } else { return(as.numeric(matches)) } } match_s(expect_integers)("12 15 16 # some comment") # success match_s(expect_integers)("some text") # failure
named() wraps a parser and provides a meaningful name that will be shown
in error messages when the parser fails. This is particularly useful for
user-defined parsers to make error messages more informative.
named(p, name)named(p, name)
p |
a parser. |
name |
a character string describing what the parser expects. |
A parser.
# Define a parser with a semantic name Nucleotide <- function() { named( satisfy(function(x) grepl("^[GATC]+$", x)), "nucleotide sequence" ) } # When this parser fails, the error will say "Expected: nucleotide sequence" try(reporter(Nucleotide() %then% eof())(c("GATCxTC"))) # Combine named parsers with %or% to get helpful "Expected one of:" messages Nucleotide_or_Protein <- function() { named(satisfy(function(x) grepl("^[GATC]+$", x)), "nucleotide") %or% named(satisfy(function(x) grepl("^[ARNDCQEGHILKMFPSTWYV]+$", x)), "protein") } try(reporter(Nucleotide_or_Protein() %then% eof())(c("Some text")))# Define a parser with a semantic name Nucleotide <- function() { named( satisfy(function(x) grepl("^[GATC]+$", x)), "nucleotide sequence" ) } # When this parser fails, the error will say "Expected: nucleotide sequence" try(reporter(Nucleotide() %then% eof())(c("GATCxTC"))) # Combine named parsers with %or% to get helpful "Expected one of:" messages Nucleotide_or_Protein <- function() { named(satisfy(function(x) grepl("^[GATC]+$", x)), "nucleotide") %or% named(satisfy(function(x) grepl("^[ARNDCQEGHILKMFPSTWYV]+$", x)), "protein") } try(reporter(Nucleotide_or_Protein() %then% eof())(c("Some text")))
marker
An object of class marker is an empty list created by the function
fail(). To indicate that this object differs from simply list() its
print method prints [].
## S3 method for class 'marker' print(x, ...)## S3 method for class 'marker' print(x, ...)
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |
The marker class is used internally to mark the largest index number of
the element (i.e. line) of the input character vector at which the parser
failed. This number is stored in the attribute n of a marker and only
correctly corresponds to that index number if the parser is wrapped in a
reporter() call.
The printed marker object is returned invisibly.
d <- (literal("A") %then% literal("B"))(c("A", "A")) # prints the icon [] for failed parsing d # Reveal the modest content of the marker object unclass(d)d <- (literal("A") %then% literal("B"))(c("A", "A")) # prints the icon [] for failed parsing d # Reveal the modest content of the marker object unclass(d)
Turns a parser into an error reporting parser, and when the parser is
successful returns only the L-element of the parser output, the
successfully parsed part of the input (see succeed()).
reporter(p, context_size = 5)reporter(p, context_size = 5)
p |
a parser. |
context_size |
number of lines of context to show around the line where failure occurred. Default is 5. |
The error object that this function returns is a list containing the
elements linenr and linecontent, corresponding to the line in which the
parser failed and its content. The user of this package can catch this
object to create custom error messages instead of the message generated by
this function.
A warning is issued when the parser did not completely consume the input.
Complete consumption of input is only explicitly made when the parser ends
with eof(). Therefore, even though all elements were parsed, a zero-length
character vector will remain in the R element if the parser does not end
with eof().
The L-part of a successful parser result or an error message about
the line where the parser failed. A warning is thrown when the parser
did not completely consume the input.
at <- function() literal("a") %then% literal("t") atat <- rep(c("a", "t"), 2) # Yields an error message about parser failing on line 5 try( reporter(match_n(3, at()) %then% eof())(c(atat, "t", "t")) ) # No error, but parser result reporter(match_n(2, at()) %then% eof())(atat) # warning: the input is not completely consumed try( reporter(match_n(2, at()))(atat) )at <- function() literal("a") %then% literal("t") atat <- rep(c("a", "t"), 2) # Yields an error message about parser failing on line 5 try( reporter(match_n(3, at()) %then% eof())(c(atat, "t", "t")) ) # No error, but parser result reporter(match_n(2, at()) %then% eof())(atat) # warning: the input is not completely consumed try( reporter(match_n(2, at()))(atat) )
satisfy() turns a logical function into a parser that recognizes strings.
satisfy(b, expected = "matching input")satisfy(b, expected = "matching input")
b |
a boolean function to determine if the string is accepted. |
expected |
character string describing what is expected (for error messages). |
Notice (see pseudocode) that satisfy fails when presented with empty
input, so it is futile to write predicate functions that would recognize
such input.
A parser.
satisfy(b)(x): if x==null then fail()(x) else if b(x[1]) then succeed(x[1])(x[-1]) else fail()(x)
where x[1] is the first element of x, x[-1] all subsequent elements
(or null if it only has one element). null is the empty vector,
equivalent to character(0) in R.
# define a predicate function that tests whether the next element starts # with an 'a' starts_with_a <- function(x) grepl("^a", x) # Use it in the satisfy parser satisfy(starts_with_a)(c("abc", "def")) # success satisfy(starts_with_a)(c("bca", "def")) # failure # Using an anonymous function satisfy(function(x) { as.numeric(x) > 10 })("15") # success# define a predicate function that tests whether the next element starts # with an 'a' starts_with_a <- function(x) grepl("^a", x) # Use it in the satisfy parser satisfy(starts_with_a)(c("abc", "def")) # success satisfy(starts_with_a)(c("bca", "def")) # failure # Using an anonymous function satisfy(function(x) { as.numeric(x) > 10 })("15") # success
Sometimes you want to use a parsed object to modify a later parser operation,
as in the example below. The store() and retrieve() functions provide the
tools to create such a parser.
store(name, value) retrieve(name)store(name, value) retrieve(name)
name |
a string used as the name of the stored object. |
value |
object to be stored. |
Nothing for store() and the stored object for retrieve().
parse_nr <- function(line) { m <- stringr::str_extract(line, "number=(\\d+)", group = 1) if (is.na(m)) { list() } else { store("nr", as.numeric(m)) } } p <- function() { match_s(parse_nr) %then% exactly(retrieve("nr"), literal("A")) } p()(c("number=3", "A", "A", "A")) # success p()(c("number=2", "A", "A", "A")) # failureparse_nr <- function(line) { m <- stringr::str_extract(line, "number=(\\d+)", group = 1) if (is.na(m)) { list() } else { store("nr", as.numeric(m)) } } p <- function() { match_s(parse_nr) %then% exactly(retrieve("nr"), literal("A")) } p()(c("number=3", "A", "A", "A")) # success p()(c("number=2", "A", "A", "A")) # failure
Produce a string parser based on stringr::str_match(), to be used with
match_s().
stringparser(match_pattern, reshape = identity)stringparser(match_pattern, reshape = identity)
match_pattern |
A regular expression that matches the string. |
reshape |
A function that takes the character vector of captured strings
and modifies it to a desired output. By default this is the |
This function uses regexec() with parameter perl = TRUE to produce a
string parser for perl compatible regular expressions. Parsers
created with this constructor return the failure signal list() when a
string does not match the match_pattern. If the pattern contains captured
groups then these groups will be returned as a character vector upon matching.
If there is no capture group then the function will return silently upon
matching the pattern. You can provide a function to the reshape argument
to change the output upon matching.
You always have to wrap the parsers made with this constructor in
match_s() to create a parser combinator. I decided not to include this
standard pattern in the stringparser constructor itself because it
complicates testing of these parsers.
# single capture group parse_header <- stringparser("^>(\\w+)") parse_header(">correct_header") # returns "correct_header" parse_header("> incorrect_header") # returns list() # multiple capture groups parse_keyvalue <- stringparser("(\\w+):\\s?(\\w+)") parse_keyvalue("key1: value1") # returns c("key1", "value1") # modify output parse_keyvalue_df <- stringparser( "(\\w+):\\s?(\\w+)", function(x) data.frame(key = x[1], value = x[2]) ) parse_keyvalue_df("key1: value1") # returns a data frame# single capture group parse_header <- stringparser("^>(\\w+)") parse_header(">correct_header") # returns "correct_header" parse_header("> incorrect_header") # returns list() # multiple capture groups parse_keyvalue <- stringparser("(\\w+):\\s?(\\w+)") parse_keyvalue("key1: value1") # returns c("key1", "value1") # modify output parse_keyvalue_df <- stringparser( "(\\w+):\\s?(\\w+)", function(x) data.frame(key = x[1], value = x[2]) ) parse_keyvalue_df("key1: value1") # returns a data frame
These are the most basic constructors of a parser, but they are important
cogs of the parser machinery. The succeed parser always succeeds, without
consuming any input, whereas the fail parser always fails.
succeed(left) fail(lnr = LNR(), expected = NULL)succeed(left) fail(lnr = LNR(), expected = NULL)
left |
any R-object constructed from a parsed vector. |
lnr |
integer. The line number (element number) at which the fail occurs |
expected |
character vector. Optional descriptions of what was expected. |
The succeed parser constructs a list object with a 'left' or L-element
that contains the parser result of the consumed part of the input vector and
the 'right' or R-element that contains the unconsumed part of the vector.
Since the outcome of succeed does not depend on its input, its result value
must be pre-determined, so it is included as a parameter.
While succeed never fails, fail always does, regardless of the input
vector. It returns the empty list list() to signal this fact.
A list. succeed() returns a list with two elements named L and
R. fail() returns a marker object which is basically an empty list
with a line number n as attribute. It is printed as the icon [],
see print.marker(). Note that n will only correctly represent the line
number of failure when a parser is wrapped in the reporter() function.
succeed(y)(x): [L=[y],R=[x]] fail()(x): []
where [L=[y],R=[x]] is a named list with lists [y] and [x] as elements
and [] is an empty list.
It is very unlikely that you will ever have to use these functions when constructing parsers.
succeed("A")("abc") succeed(data.frame(title = "Keisri hull", author = "Jaan Kross"))(c("Unconsumed", "text")) fail()("abc")succeed("A")("abc") succeed(data.frame(title = "Keisri hull", author = "Jaan Kross"))(c("Unconsumed", "text")) fail()("abc")
Often, we want to assess whether a given structure can be successfully
parsed through repetitive application of a parser p. This could involve
testing the parser applied multiple times in succession or determining
its capability to be applied zero or more times.
The subsequent functions are designed to address and evaluate these scenarios.
zero_or_more(p) one_or_more(p) exactly(n, p) zero_or_one(p) match_n(n, p)zero_or_more(p) one_or_more(p) exactly(n, p) zero_or_one(p) match_n(n, p)
p |
a parser. |
n |
a positive integer, including 0. |
All these parsers with the exception of match_n exhibit greedy behavior
striving to apply p as many times as possible. If the resulting count
doesn't match the expected quantity, such as in the case of exactly(n,p)
where p successfully parses more than n times, then the parser fails.
In contrast, match_n(n,p) strictly applies p exactly n times,
preventing any further application of p even if p could potentially be
applied more often. Clearly, both functions will fail if p fails after
less than n repetitions.
A parser.
zero_or_more(p):
(p %then% zero_or_more(p)) %or% succeed(null)
one_or_more(p):
p %then% zero_or_more(p)
exactly(n,p):
count = 0
r = zero_or_more(p %using% F(x): count = count + 1; x)(x)
if count == n then
count = 0
r
else fail()(x)
zero_or_one:
exactly(1,p) %or% exactly(0,p)
match_n(n,p):
if n==0 then F(x): succeed(list())(x)
else
if n==1 then p else (p %then% match_n(n-1, p))
where null is the empty vector.
zero_or_more(literal("A"))(c("A", LETTERS[1:5])) zero_or_more(literal("A"))(LETTERS[2:5]) one_or_more(literal("A"))(c("A", LETTERS[1:5])) # success one_or_more(literal("A"))(LETTERS[2:5]) # failure exactly(2, literal("A"))(c("A", LETTERS[1:5])) # success exactly(2, literal("A"))(c(rep("A", 2), LETTERS[1:5])) # failure: too many "A" zero_or_one(literal("A"))(LETTERS[2:5]) # success zero_or_one(literal("A"))(LETTERS[1:5]) # success zero_or_one(literal("A"))(c("A", LETTERS[1:5])) # failure match_n(2, literal("A"))(c("A", LETTERS[1:5])) # success match_n(2, literal("A"))(c(rep("A", 2), LETTERS[1:5])) # successzero_or_more(literal("A"))(c("A", LETTERS[1:5])) zero_or_more(literal("A"))(LETTERS[2:5]) one_or_more(literal("A"))(c("A", LETTERS[1:5])) # success one_or_more(literal("A"))(LETTERS[2:5]) # failure exactly(2, literal("A"))(c("A", LETTERS[1:5])) # success exactly(2, literal("A"))(c(rep("A", 2), LETTERS[1:5])) # failure: too many "A" zero_or_one(literal("A"))(LETTERS[2:5]) # success zero_or_one(literal("A"))(LETTERS[1:5]) # success zero_or_one(literal("A"))(c("A", LETTERS[1:5])) # failure match_n(2, literal("A"))(c("A", LETTERS[1:5])) # success match_n(2, literal("A"))(c(rep("A", 2), LETTERS[1:5])) # success