Title: | Construct Parsers for Structured Text Files |
---|---|
Description: | Construct parser combinator functions, higher order functions that parse input. Construction of such parsers is transparent and easy. Their main application is the parsing of structured text files like those generated by laboratory instruments. Based on a paper by Hutton (1992) <doi:10.1017/S0956796800000411>. |
Authors: | Douwe Molenaar [aut, cre, cph]
|
Maintainer: | Douwe Molenaar <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5.2.9000 |
Built: | 2025-02-24 05:03:31 UTC |
Source: | https://github.com/systemsbioinformatics/parcr |
The %or%
combinator (p1 %or% p2)
returns the result of p1
if p1
is
successful or, if p1
fails that of p2
if p2
parses successfully,
otherwise it returns a fail
.
p1 %or% p2
p1 %or% p2
p1 , p2
|
two parsers. |
A parser.
(p1 %or% p2)(x): if p1(x)==[] then if p2(x)==[] then fail()(x) else p2(x) else p1(x)
where []
is the empty list.
(literal("A") %or% literal("a"))(LETTERS[1:5]) # success on first parser (literal("A") %or% literal("a"))(letters[1:5]) # success on second parser (literal("A") %or% literal("a"))(LETTERS[2:6]) # failure starts_with_a <- function(x) grepl("^a",x[1]) # success on both parsers, but returns result of p1 only (literal("a") %or% satisfy(starts_with_a)) (letters[1:5])
(literal("A") %or% literal("a"))(LETTERS[1:5]) # success on first parser (literal("A") %or% literal("a"))(letters[1:5]) # success on second parser (literal("A") %or% literal("a"))(LETTERS[2:6]) # failure starts_with_a <- function(x) grepl("^a",x[1]) # success on both parsers, but returns result of p1 only (literal("a") %or% satisfy(starts_with_a)) (letters[1:5])
Sometimes we are not interested in the result from a parser, only that the
parser succeeds. It may be convenient to return some short representation
or nothing even rather than the string itself. The %ret%
combinator is
useful in such cases. The parser (p %ret% c)
has the same behavior as p
,
except that it returns the value c
if successful.
p %ret% c
p %ret% c
p |
a parser. |
c |
string, i.e. a single-element character vector. |
A parser.
(p %xret% c)(x): if p(x)==[] then fail()(x) else succeed(c)(x[-1])
(literal("A") %ret% "We have an A!") (LETTERS[1:5]) (literal("A") %ret% NULL) (LETTERS[1:5])
(literal("A") %ret% "We have an A!") (LETTERS[1:5]) (literal("A") %ret% NULL) (LETTERS[1:5])
(p1 %then% p2)
recognizes anything that p1
and p2
would if applied in
succession.
p1 %then% p2
p1 %then% p2
p1 , p2
|
two parsers. |
A parser.
(p1 %then% p2)(x): if p1(x)==[] or x==null then fail()(x) else if p2(x[-1])==[] then fail()(x) else succeed([p1(x)$L, p2(x[-1])$L])(x[-2])
where null
is the empty vector, x[-1]
and x[-2]
are the vector x
without the first element and without the first two elements, respectively.
The discarding versions %xthen% and %thenx%
starts_with_a <- function(x) grepl("^a",x[1]) starts_with_b <- function(x) grepl("^b",x[1]) (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("ab", "bc", "de")) # success (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("bb", "bc", "de")) # failure (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("ab", "ac", "de")) # failure
starts_with_a <- function(x) grepl("^a",x[1]) starts_with_b <- function(x) grepl("^b",x[1]) (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("ab", "bc", "de")) # success (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("bb", "bc", "de")) # failure (satisfy(starts_with_a) %then% satisfy(starts_with_b)) (c("ab", "ac", "de")) # failure
The %using%
combinator allows us to manipulate results from a parser. The
parser (p %using% f)
has the same behavior as the parser p
, except that
the function f
is applied to its result value.
p %using% f
p %using% f
p |
a parser. |
f |
a function to be applied to the result of a successful |
A parser.
(p %using% f)(x): if p1(x)==[] then fail()(x) else succeed(f(p1(x)$L))(x[-1])
(literal('ab') %using% toupper) (c("ab","cdef")) # success (literal('ab') %using% toupper) (c("bb","cdef")) # failure
(literal('ab') %using% toupper) (c("ab","cdef")) # success (literal('ab') %using% toupper) (c("bb","cdef")) # failure
%then%
sequenceTwo parsers composed in sequence produce a pair of results. Sometimes we are
only interested in one component of the pair. For example in the case of
reserved words such as 'begin' and 'end'. In such cases, two special
versions of the %then%
combinator are useful, which keep either the
first or second result, as reflected by the position of the letter 'x' in
their names.
p1 %xthen% p2 p1 %thenx% p2
p1 %xthen% p2 p1 %thenx% p2
p1 , p2
|
two parsers. |
A parser.
(p1 %xthen% p2)(x): if p1(x)==[] or x==null then fail()(x) else if p2(x[-1])==[] then fail()(x) else succeed(p1(x)$L)(x[-2]) (p1 %thenx% p2)(x): if p1(x)==[] or x==null then fail()(x) else if p2(x[-1])==[] then fail()(x) else succeed(p2(x[-1])$L)(x[-2])
where null
is the empty vector, x[-1]
and x[-2]
are the vector x
without the first element and without the first two elements, respectively.
is_number <- function(x) grepl("\\d+",x[1]) # Numbers are preceded by ">" symbols, but we only want the number (literal(">") %thenx% satisfy(is_number)) (c(">", "12")) # Temperatures are followed by the unit 'C', but we only want the number (satisfy(is_number) %xthen% literal("C")) (c("21", "C"))
is_number <- function(x) grepl("\\d+",x[1]) # Numbers are preceded by ">" symbols, but we only want the number (literal(">") %thenx% satisfy(is_number)) (c(">", "12")) # Temperatures are followed by the unit 'C', but we only want the number (satisfy(is_number) %xthen% literal("C")) (c("21", "C"))
Splits a string by using a split pattern and then applies the parser p
to the resulting character vector. If finish = TRUE
then the parser should
completely consume its input, otherwise the parser fails. If
finish = FALSE
then any remaining part of the string is discarded.
by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)
by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)
p |
A parser. |
split |
a string (or object which can be coerced to such) containing regular expression(s) (unless fixed = TRUE) to use for splitting. If empty matches occur, in particular if split has length 0, x is split into single characters. |
finish |
logical. Should the parser completely consume the string?
Defaults to |
fixed |
logical. If |
perl |
logical. Should Perl-compatible regexps be used? |
The function base::strsplit()
is used to perform the splitting. The
parameters split
, fixed
and perl
are passed on to that function.
A parser.
by_split((literal("a") %then% literal("b")),"\\t") ("a\tb") # success by_split((literal("a") %then% literal("b")),"\\t") ("a\tb\tc") # failure by_split((literal("a") %then% literal("b")),"\\t", finish=FALSE) ("a\tb\tc") # success
by_split((literal("a") %then% literal("b")),"\\t") ("a\tb") # success by_split((literal("a") %then% literal("b")),"\\t") ("a\tb\tc") # failure by_split((literal("a") %then% literal("b")),"\\t", finish=FALSE) ("a\tb\tc") # success
Splits a string to individual symbols and then applies the parser p
to the
resulting character vector, otherwise the parser fails. If finish = TRUE
then the parser should completely consume its input. If finish = FALSE
then any remaining part of the string is discarded.
This function is identical to by_split(p, "", finish)
.
by_symbol(p, finish = TRUE)
by_symbol(p, finish = TRUE)
p |
A parser. |
finish |
logical. Should the parser completely consume the string?
Defaults to |
A parser.
by_symbol(exactly(3,literal("a"))) (c("aaa", "bb")) # success by_symbol(exactly(3,literal("a"))) (c("aaaa", "bb")) # failure
by_symbol(exactly(3,literal("a"))) (c("aaa", "bb")) # success by_symbol(exactly(3,literal("a"))) (c("aaaa", "bb")) # failure
An empty line is a line that consists entirely of space-like characters.
EmptyLine
is a parser that recognizes one empty line, Spacer
recognizes
one or more empty lines and MaybeEmpty
recognizes zero or more empty
lines. EmptyLine
returns the empty line but Spacer
and MaybeEmpty
discard these.
EmptyLine() Spacer() MaybeEmpty()
EmptyLine() Spacer() MaybeEmpty()
A parser.
space_like_eraser(x): d = x in which all "\\s+" are replaced by "" if d=="" TRUE else FALSE Emptyline: satisfy(space_like_eraser) Spacer: one_or_more(EmptyLine()) %ret% null MaybeEmpty: zero_or_more(EmptyLine()) %ret% null
where null
is the empty vector.
EmptyLine() (" \t ") # success EmptyLine() (" .") # failure EmptyLine() ("") # success Spacer() (c(" \t ", " ", "abc")) Spacer() (c(" ", " ", "Important text")) Spacer() (c("Important text")) # failure, missing empty line MaybeEmpty() (c(" ", " ", "Important text")) # success, just as Spacer() MaybeEmpty() (c("Important text")) # success, in contrast to Spacer()
EmptyLine() (" \t ") # success EmptyLine() (" .") # failure EmptyLine() ("") # success Spacer() (c(" \t ", " ", "abc")) Spacer() (c(" ", " ", "Important text")) Spacer() (c("Important text")) # failure, missing empty line MaybeEmpty() (c(" ", " ", "Important text")) # success, just as Spacer() MaybeEmpty() (c("Important text")) # success, in contrast to Spacer()
Tests whether the end of the input character vector has been reached,
which boils down to detection of character(0)
in the R
-element (see
succeed()
). Since the intended application of this parser is parsing of
text files the function has been called after the end of file (EOF) signal.
To indicate that an end of file has been detected, the R
-element side of
the parser output will be converted to an empty list.
eof()
eof()
A parser.
eof()(x): if x==null then succeed(x)(list()) else fail()(x)
(literal("a") %then% eof())("a") # success # Notice the difference on the R-side with literal("a")("a") eof()(character(0)) # success eof()("a") # failure
(literal("a") %then% eof())("a") # success # Notice the difference on the R-side with literal("a")("a") eof()(character(0)) # success eof()("a") # failure
Use this function to test whether your parser failed, for example in unit testing of your parsers when writing a package.
failed(o)
failed(o)
o |
Output from a parser. |
A logical value.
d <- (literal("A") %then% literal("B"))(c("A","A")) d failed(d)
d <- (literal("A") %then% literal("B"))(c("A","A")) d failed(d)
An example fasta-formatted file with a mixture of nucleotide and protein sequences. It is used in the vignette to demonstrate parsing with the tools from the package. It is not clear to me whether mixing of sequence types is allowed in a fasta file, but we demonstrate in the vignette that is is easy to parse them from a single file. The sequences used are truncated for the sake of the example.
fastafile
fastafile
fastafile
A character vector
Modified from https://bioinformatics.org/annhyb/examples/seq_fasta.html and https://en.wikipedia.org/wiki/FASTA_format
A parser has completely consumed its input when the input has satisfied
eof()
.
finished(o)
finished(o)
o |
Output from a parser. |
A logical value.
finished((literal("A") %then% eof())("A")) # TRUE finished((literal("A"))("A")) # FALSE finished((literal("A") %then% eof())(c("A","C"))) # FALSE
finished((literal("A") %then% eof())("A")) # TRUE finished((literal("A"))("A")) # FALSE finished((literal("A") %then% eof())(c("A","C"))) # FALSE
Sometimes, a parser has found the end of the text that needs to be parsed and
all following lines can be ignored. Ignore
will read and discard all
following lines until the end of the file, empty or not. So, this parser will
consume all elements until the end of the input vector.
Ignore()
Ignore()
This parser is identical to
zero_or_more(satisfy(function(x) TRUE))
A parser.
starts_with_a <- function(x) grepl("^a",x) p <- function() { one_or_more(satisfy(starts_with_a)) %then% (literal("~End") %ret% NULL) %then% Ignore() %then% eof() } p()(c("ab","abc","~End","boring stuff","more stuff"))
starts_with_a <- function(x) grepl("^a",x) p <- function() { one_or_more(satisfy(starts_with_a)) %then% (literal("~End") %ret% NULL) %then% Ignore() %then% eof() } p()(c("ab","abc","~End","boring stuff","more stuff"))
literal
tests whether a supplied string literally equals a desired value.
literal(string)
literal(string)
string |
string, a single-element character vector, or an object that can be coerced to a character vector. |
A parser.
literal(a)(x): satisfy(F(y): y==a)(x)
where F
is equivalent to the function
declarator in R. So, we have an
anonymous function in the argument of satisfy
.
literal("ab") (c("ab", "cdef")) # success literal("ab") (c("abc", "cdef")) # failure
literal("ab") (c("ab", "cdef")) # success literal("ab") (c("abc", "cdef")) # failure
match_s
matches a string using a function and returns a desired object
type.
match_s(s)
match_s(s)
s |
a string-parsing function. |
This parser short-cuts the pattern satisfy(b) %using% f
. With match_s
you do not have to write separate predicate and processing functions b
and
f
when identification and parsing can be done with a single string
parsing function s
.
The function s
will be given a non-empty single-element character vector
as its argument, so you don't have to test for empty input, like
character(0)
. These two facts also often simplify further processing with
the string functions like grep
, regmatches
and those from the stringr
package. The function s
can return any R-object when succeeding, but to
signal failure to the parser it must return the empty list()
. Note that
list()
output from s
will be turned into a marker object, the internal
object to mark failure, by match_s()
, see failed()
.
A parser.
match_s(s)(x): if x==null then fail()(x) else if s(x[1]) then succeed(s(x[1]))(x[-1]) else fail()(x)
expect_integers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- regmatches(x,m)[[1]] if (length(matches)==0) { # this means failure to detect numbers where we expected them return(list()) } else { return(as.numeric(matches)) } } match_s(expect_integers) ("12 15 16 # some comment") # success match_s(expect_integers) ("some text") # failure
expect_integers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- regmatches(x,m)[[1]] if (length(matches)==0) { # this means failure to detect numbers where we expected them return(list()) } else { return(as.numeric(matches)) } } match_s(expect_integers) ("12 15 16 # some comment") # success match_s(expect_integers) ("some text") # failure
marker
An object of class marker
is an empty list created by the function
fail()
. To indicate that this object differs from simply list()
its
print method prints []
.
## S3 method for class 'marker' print(x, ...)
## S3 method for class 'marker' print(x, ...)
x |
an object used to select a method. |
... |
further arguments passed to or from other methods. |
The marker
class is used internally to mark the largest index number of
the element (i.e. line) of the input character vector at which the parser
failed. This number is stored in the attribute n
of a marker and only
correctly corresponds to that index number if the parser is wrapped in a
reporter()
call.
The printed marker
object is returned invisibly.
d <- (literal("A") %then% literal("B"))(c("A","A")) # prints the icon [] for failed parsing d # Reveal the modest content of the marker object unclass(d)
d <- (literal("A") %then% literal("B"))(c("A","A")) # prints the icon [] for failed parsing d # Reveal the modest content of the marker object unclass(d)
Turns a parser into an error reporting parser, and when the parser is
successful returns only the L
-element of the parser output, the
successfully parsed part of the input (see succeed()
).
reporter(p)
reporter(p)
p |
a parser. |
The error object that this function returns is a list containing the
elements linenr
and linecontent
, corresponding to the line in which the
parser failed and its content. The user of this package can catch this
object to create custom error messages instead of the message generated by
this function.
A warning is issued when the parser did not completely consume the input.
Complete consumption of input is only explicitly made when the parser ends
with eof()
. Therefore, even though all elements were parsed, a zero-length
character vector will remain in the R
element if the parser does not end
with eof()
.
The L
-part of a successful parser result or an error message about
the line where the parser failed. A warning is thrown when the parser
did not completely consume the input.
at <- function() literal("a") %then% literal("t") atat <- rep(c("a","t"),2) # Yields an error message about parser failing on line 5 try( reporter(match_n(3,at()) %then% eof())(c(atat,"t","t")) ) # No error, but parser result reporter(match_n(2,at()) %then% eof())(atat) # warning: the input is not completely consumed try( reporter(match_n(2,at()))(atat) )
at <- function() literal("a") %then% literal("t") atat <- rep(c("a","t"),2) # Yields an error message about parser failing on line 5 try( reporter(match_n(3,at()) %then% eof())(c(atat,"t","t")) ) # No error, but parser result reporter(match_n(2,at()) %then% eof())(atat) # warning: the input is not completely consumed try( reporter(match_n(2,at()))(atat) )
satisfy()
turns a logical function into a parser that recognizes strings.
satisfy(b)
satisfy(b)
b |
a boolean function to determine if the string is accepted. |
Notice (see pseudocode) that satisfy
fails when presented with empty
input, so it is futile to write predicate functions that would recognize
such input.
A parser.
satisfy(b)(x): if x==null then fail()(x) else if b(x[1]) then succeed(x[1])(x[-1]) else fail()(x)
where x[1]
is the first element of x
, x[-1]
all subsequent elements
(or null
if it only has one element). null
is the empty vector,
equivalent to character(0)
in R.
# define a predicate function that tests whether the next element starts # with an 'a' starts_with_a <- function(x) grepl("^a",x) # Use it in the satisfy parser satisfy(starts_with_a)(c("abc","def")) # success satisfy(starts_with_a)(c("bca","def")) # failure # Using an anonymous function satisfy(function(x) {as.numeric(x)>10})("15") # success
# define a predicate function that tests whether the next element starts # with an 'a' starts_with_a <- function(x) grepl("^a",x) # Use it in the satisfy parser satisfy(starts_with_a)(c("abc","def")) # success satisfy(starts_with_a)(c("bca","def")) # failure # Using an anonymous function satisfy(function(x) {as.numeric(x)>10})("15") # success
Sometimes you want to use a parsed object to modify a later parser operation,
as in the example below. The store()
and retrieve()
functions provide the
tools to create such a parser.
store(name, value) retrieve(name)
store(name, value) retrieve(name)
name |
a string used as the name of the stored object. |
value |
object to be stored. |
Nothing for store()
and the stored object for retrieve()
.
parse_nr <- function(line) { m <- stringr::str_extract(line, "number=(\\d+)", group=1) if (is.na(m)) list() else store("nr", as.numeric(m)) } p <- function() { match_s(parse_nr) %then% exactly(retrieve("nr"), literal("A")) } p()(c("number=3", "A", "A", "A")) # success p()(c("number=2", "A", "A", "A")) # failure
parse_nr <- function(line) { m <- stringr::str_extract(line, "number=(\\d+)", group=1) if (is.na(m)) list() else store("nr", as.numeric(m)) } p <- function() { match_s(parse_nr) %then% exactly(retrieve("nr"), literal("A")) } p()(c("number=3", "A", "A", "A")) # success p()(c("number=2", "A", "A", "A")) # failure
These are the most basic constructors of a parser, but they are important
cogs of the parser machinery. The succeed
parser always succeeds, without
consuming any input, whereas the fail
parser always fails.
succeed(left) fail(lnr = LNR())
succeed(left) fail(lnr = LNR())
left |
any R-object constructed from a parsed vector. |
lnr |
integer. The line number (element number) at which the fail occurs |
The succeed
parser constructs a list
object with a 'left' or L
-element
that contains the parser result of the consumed part of the input vector and
the 'right' or R
-element that contains the unconsumed part of the vector.
Since the outcome of succeed does not depend on its input, its result value
must be pre-determined, so it is included as a parameter.
While succeed
never fails, fail
always does, regardless of the input
vector. It returns the empty list list()
to signal this fact.
A list. succeed()
returns a list with two elements named L
and
R
. fail()
returns a marker
object which is basically an empty list
with a line number n
as attribute. It is printed as the icon []
,
see print.marker()
. Note that n
will only correctly represent the line
number of failure when a parser is wrapped in the reporter()
function.
succeed(y)(x): [L=[y],R=[x]] fail()(x): []
where [L=[y],R=[x]]
is a named list with lists [y]
and [x]
as elements
and []
is an empty list.
It is very unlikely that you will ever have to use these functions when constructing parsers.
succeed("A")("abc") succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text")) fail()("abc")
succeed("A")("abc") succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text")) fail()("abc")
Often, we want to assess whether a given structure can be successfully
parsed through repetitive application of a parser p
. This could involve
testing the parser applied multiple times in succession or determining
its capability to be applied zero or more times.
The subsequent functions are designed to address and evaluate these scenarios.
zero_or_more(p) one_or_more(p) exactly(n, p) zero_or_one(p) match_n(n, p)
zero_or_more(p) one_or_more(p) exactly(n, p) zero_or_one(p) match_n(n, p)
p |
a parser. |
n |
a positive integer, including 0. |
All these parsers with the excception of match_n
exhibit greedy behavior
striving to apply p
as many times as possible. If the resulting count
doesn't match the expected quantity, such as in the case of exactly(n,p)
where p
successfully parses more than n
times, then the parser fails.
In contrast, match_n(n,p)
strictly applies p
exactly n
times,
preventing any further application of p
even if p
could potentially be
applied more often. Clearly, both functions will fail if p
fails after
less than n
repetitions.
A parser.
zero_or_more(p): (p %then% zero_or_more(p)) %or% succeed(null) one_or_more(p): p %then% zero_or_more(p) exactly(n,p): count = 0 r = zero_or_more(p %using% F(x): count = count + 1; x)(x) if count == n then count = 0 r else fail()(x) zero_or_one: exactly(0,p) %or% exactly(1,p) match_n(n,p): if n==0 then F(x): succeed(list())(x) else if n==1 then p else (p %then% match_n(n-1, p))
where null
is the empty vector.
zero_or_more(literal("A")) (c("A",LETTERS[1:5])) zero_or_more(literal("A")) (LETTERS[2:5]) one_or_more(literal("A")) (c("A",LETTERS[1:5])) # success one_or_more(literal("A")) (LETTERS[2:5]) # failure exactly(2,literal("A")) (c("A", LETTERS[1:5])) # success exactly(2,literal("A")) (c(rep("A",2), LETTERS[1:5])) # failure: too many "A" zero_or_one(literal("A")) (LETTERS[2:5]) # success zero_or_one(literal("A")) (LETTERS[1:5]) # success zero_or_one(literal("A")) (c("A",LETTERS[1:5])) # failure match_n(2,literal("A")) (c("A", LETTERS[1:5])) # success match_n(2,literal("A")) (c(rep("A",2), LETTERS[1:5])) # success
zero_or_more(literal("A")) (c("A",LETTERS[1:5])) zero_or_more(literal("A")) (LETTERS[2:5]) one_or_more(literal("A")) (c("A",LETTERS[1:5])) # success one_or_more(literal("A")) (LETTERS[2:5]) # failure exactly(2,literal("A")) (c("A", LETTERS[1:5])) # success exactly(2,literal("A")) (c(rep("A",2), LETTERS[1:5])) # failure: too many "A" zero_or_one(literal("A")) (LETTERS[2:5]) # success zero_or_one(literal("A")) (LETTERS[1:5]) # success zero_or_one(literal("A")) (c("A",LETTERS[1:5])) # failure match_n(2,literal("A")) (c("A", LETTERS[1:5])) # success match_n(2,literal("A")) (c(rep("A",2), LETTERS[1:5])) # success