--- title: "Discussion of some design details" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Details} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: parcr.bib csl: molecular-microbiology.csl --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, echo=FALSE} library(parcr) ``` I want to describe some thoughts about parser construction, design decisions and error messaging, mostly for reference, but also to understand how these parsers work. ## Lazy evaluation by `%or%`, and constructing parsers The parser `p1 %or% p2` performs lazy evaluation, meaning that if `p1` succeeds then `p2` is never tested. This is in contrast to the parser that @Hutton1992 describes[^1]. The choice for lazy evaluation is motivated by the realization (laziness?) that a parser that evaluates all possible alternatives may quickly become extremely complex: it splits into two alternative paths with every `%or%`. I wondered whether this decision would cause practical problems, because, in principle if both `p1` and `p2` would be successful the result of this lazy evaluating parser will depend on the order in which the two alternatives are put. However, I do not believe that this should lead to problems in practical applications, for two reasons: 1. a document format that is interpretable in multiple ways is not well-designed: it is fundamentally undecidable which of the alternative interpretations should be followed. 1. a parser that can produce multiple interpretations of a well-designed, unambiguous document format should be re-designed. There is one situation in which you could get into problems, namely when the differentiation between two document formats depends on their full consumption by a parser. For example: **Format-1** ``` XXXX YYYY ZZZZ ``` and **Format-2** ``` XXXX YYYY QQQQ ``` can only be distinguished when the last line is read. If two alternative parsers do not read beyond the `YYYY`, then both will be successful, and we have an ambiguous interpretation. This is why we urge you to put an `eof()` parser at the end of any parser that you construct. It is also the reason why a warning is issued when a parser does not completely consume the input[^2]. [^1]: In @Hutton1992 the `%or%` operator is called `$alt`. [^2]: This only happens if the parser has been converted to an error-messaging parser by wrapping it in the `reporter()` function. ## Tracking where a parser error occurs When you just get a fail `[]` as the output when a parser fails on a long input then it can become challenging to find out what went wrong. It would help a lot if you would know the index number ("line" number) of the input on which the parser fails. This means that the parser has to track where it is and where it fails. Since the only function that shifts to a next index is the `%then%` combinator, and since any other function that shifts to the right, like the repeater functions (`zero_or_more()`, *etc.*) is based on this function, we just have to count the number of times the `%then%` combinator was called up to the failure to know where the parser failed. A complication is that we may be testing alternative parsers implemented by the `%or%` combinator, and the parser may fail on different lines in the alternative "arms". In that case we will say that the parser fails in the element with the highest index when both arms of an `%or%` combinator fail. The package returns a `marker` object (printed as `[]`) with that element index number as an attribute. The principle is illustrated in the figure below. ```{r,out.width="30%", fig.align='left', echo=FALSE, fig.cap="**Illustration of marker positioning when a parser fails on an input vector of length 6**. Each filled dot represents a successful parse. A `%then%` combinator shifts to the next element whereas an `%or%` combinator applies alternative parsers to the input. The parser fails at the red crosses and the red square. A marker (red square) will be put at the parser that fails at the input element with the highest index, which is element #5 here."} knitr::include_graphics("tree.png") ``` In the current implementation we chose to track the current element index number with a state variable. The principle of using state variables is described in [section 7.4](https://r-pkgs.org/data.html#sec-data-state) of [R Packages (2e)](https://r-pkgs.org/). Note that proper tracking and error messaging only takes place when the parser is wrapped in the `reporter()` function. Below I illustrate the example above by parsing the first 6 letters of the alphabet by a parser that fails on the indicated positions. ```{r} p <- function() { literal("A") %then% literal("B") %then% (arm1() %or% arm2()) } arm1 <- function() { literal("D") %then% literal("E") %then% literal("F") %then% literal("G") } arm2 <- function() { literal("C") %then% (arm21() %or% arm22()) } arm21 <- function() { literal("D") %then% literal("F") %then% literal("G") } arm22 <- function() { literal("E") %then% literal("F") %then% literal("G") } try(reporter((p() %then% eof()))(LETTERS[1:6])) ``` An now successfully parsing on arm22 ```{r} try(reporter((p() %then% eof()))(c("A","B","C","E","F","G"))) ``` ## The package `rex` After making the first public version of this package I discovered that there exists a package called [`rex`](https://github.com/r-lib/rex) ("Friendly regular expressions") that uses a very similar approach and in a number of cases even the same commands as this package. The examples in the package show that it is used to construct parsers for strings, _i.e._ single-element character vectors, not multi-element character vectors. I haven't tested it, but if it is fast enough I can imagine that it can be used in `parcr`'s function `match_s()` to create parsers for strings that have a complex structure. ## Literature