--- title: "Making parsers with higher order functions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Making parsers with higher order functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} bibliography: parcr.bib csl: molecular-microbiology.csl --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Goal of the package The goal of this package is to simplify the creation of transparent parsers for structured text files generated by machines like laboratory instruments. For example, we use the package to construct parsers for files generated by [plate readers](https://en.wikipedia.org/wiki/Plate_reader). The data generated by these instruments can usually be exported to text or spreadsheet files. Such files consist of lines of text organized in higher-order structures like headers with metadata and blocks of measured values, _etc._. It's often convenient to analyze the data in a program like R. To be able to do that you have to have a parser that processes these files and creates R-objects as output. The `parcr`package simplifies the task of creating such parsers. ## Higher order functions in R The parsers that are created with this package make extensive use of functional programming. If this topic is new to you then please read about [Functional Programming](https://adv-r.hadley.nz/fp.html), in particular the chapters *Function Factories* and *Function Operators*, in Hadley Wickham's book [**Advanced R**](https://adv-r.hadley.nz). ## Creating parser combinators in R to parse text files The `parcr` package contains a set of functions that allow you to create simple parsers with higher order functions, functions that can take functions as input and have functions as output. These are sometimes called combinators. The ideas behind the package are described in a paper by @Hutton1992. A number of the functions described in this paper are implemented with modifications in the current package. The package was also heavily inspired by the [`Ramble`](https://github.com/8bit-pixies/Ramble) package which is written in the same vein, but without the explicit parsing of structured text files in mind. ### The output of the parsers: a `list` The parsers constructed with the functions from this package generate a `list` as an output. The parsers read the input vector from left to right. When the parser fails the output will be the empty list `list()`. However, if the parser is successful then it produces a `list` with two elements. An element called `L` (the *left* part) contains the output generated by that part of the input vector which was successfully parsed, and the element called `R` (the *right* part) which contains the remainder that was not parsed. When an entire character vector is parsed the content of the `R` element equals `character(0)`. The content of the `L` part can be shaped to your desire. This is demonstrated in the example of the *fasta* file parser later in this document. ## A simple example of using parser combinators Please realize that every function described below is a higher order function: their output is a function. In its turn, this function can take a character vector as its input. For example, `literal("a")` yields a function. To use that function as a parser you have to provide it with a character vector, the object that needs to be parsed, as input: ``` literal("a")(c("a","att")) ``` This parser tests whether the next element in its input is literally the string "a". It will succeed in the example above, but will only consume the first element of its input and then stop. However, you can also use a higher order function like `literal("a")` as input to other higher order functions to create more complex parsers. For example, the function `then` takes two parsers `p1` and `p2` as arguments, `then(p1, p2)` and applies them in sequence to the input. In the `parcr` package the function is implemented in the infix form `%then%` which makes parser constructs better readable. The composite parser: ``` literal("a") %then% literal("att") ``` looks for an element with string "a" followed by an element with string "att". Its application to the same vector: ``` (literal("a") %then% literal("att"))(c("a","att")) ``` will completely consume the input. In this way, using a number of standard parsers defined in the package, you can quickly construct flexible parsers taking complex input. Furthermore, the functions also allow you to construct a desired R object as output while parsing. ## The functions in the `parcr` package We will now discuss all of the parser combinator functions present in the package. You should also study their help pages. In particular the **Pseudocode** listed for each of them should help you to understand their properties. ## The fundamental parsers The six fundamental parsers allow you to construct a parser that will completely consume input, or to fail when the input does not satisfy the specifications of the parser. ### Succeeding and failing ```{r setup, echo=FALSE} library(parcr) ``` - `succeed(o)`: where `o` is any kind of R-object - `fail()` The `succeed` and `fail` parsers are the nuts and bolts of a parser construction. The `succeed` parser always succeeds, without consuming any input, whereas the `fail` parser always fails. The `succeed` parser constructs a `list` object with a 'left' or `L`-element that contains the parsed result of the consumed part of the input vector and the 'right' or `R`-element that contains the unconsumed part of the vector. The `L`-element can contain any R-object that is constructed during parsing. While `succeed` never fails, `fail` always does, regardless of the input vector. To signal failure it returns a special form of the empty list `list()`, namely a `marker` object printed as the icon `[]`. **Important**: It is unlikely that you will ever use these two functions to construct parsers. **Examples:** ```{r} succeed("A")("abc") succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text")) ``` ```{r} fail()("abc") ``` ### Parsers for the current element The basic functions for recognizing the content of the current element, the left-most element of the input vector. - `literal(c)`: tests whether the current element equals the string `c`. - `satisfy(b)`: tests whether the current element satisfies function `b()` where `b()` is a logical function: it takes a string as input and returns `TRUE` or `FALSE`. - `eof()` : tests whether the input is at its end. `eof()` is a special function that detects the end of a character vector, or if that character vector represents the lines of text file, the end of the file (EOF). In fact, it detects `character(0)` in the input vector and when successful it turns the `R`-side of the output into an empty list (`list()`) to signal that the end of the vector was detected. **Examples:** ```{r} literal('abc')(c('abc','def')) ``` ```{r} starts_with_a <- function(x) {grepl("^a", x)} satisfy(starts_with_a)(c('abc','def')) ``` And here is an example of an unsuccessful parser: ```{r} literal('a')(c('ab','a')) ``` An application of `eof()` to detect that we parsed the input completely. ```{r} (literal("a") %then% literal("att") %then% eof())(c("a","att")) ``` Notice how the `R`-element differs from just ```{r} (literal("a") %then% literal("att"))(c("a","att")) ``` ### The fundamental combinators - `p1 %or% p2`: applies alternative parsers `p1` and `p2` on the current element and returns the result of the first successful parser, or failure when both fail. - `p1 %then% p2`: applies parser `p1` on the current element and `p2` on the next element. The `%or%` combinator enables us to try alternative parsers on the current element, whereas the `%then%` combinator enables us to test sequences of elements in a character vector. Note that `%or%` uses lazy evaluation which means that the output of `%or%` depends on the order of `p1` and `p2`: if both would in principle succeed then only the result of `p1` is returned. We also have two variations of the `%then%` combinator, `%xthen%` and `%thenx%` which do test but then discard the result from the first or second argument: - `p1 %xthen% p2`: where `p1` and `p2` are parsers discards the result from `p2` - `p1 %thenx% p2`: where `p1` and `p2` are parsers discards the result from `p1` **Examples:** ```{r} (literal('A') %or% satisfy(starts_with_a))(c('abc','def')) ``` ```{r} (literal('A') %then% satisfy(starts_with_a))(c('A', 'abc')) ``` ```{r} (literal('>') %thenx% satisfy(starts_with_a))(c('>', 'abc')) ``` ## Modifying the output of a parser As said, the six fundamental parsers allow you to construct a parser that will completely consume input. However, when this parser succeeds its output will, apart from the fact that every element is put in a list, be equal to the input. In general, this is not very useful if you want to use the output in other code. Therefore, we have two functions that allow you to modify the output of successful parser. The basic functions for modifying output of a parser are: - `p %ret% c` : when parser `p` is successful it returns the object `c` (a string or `NULL`). - `p %using% f` : when parser `p` is successful function `f()` is applied to the input and its output is stored as the result. **Examples** ```{r} (literal('a') %ret% "We have an 'a'")(c('a','b')) ``` ```{r} (satisfy(starts_with_a) %using% toupper)(c('abc','d')) ``` ## Derived parsers Derived parsers are constructed from the six fundamental parsers. ### Iterators - `zero_or_one(p)`: where `p` is a parser. - `zero_or_more(p)`: where `p` is a parser. - `one_or_more(p)`: where `p` is a parser. - `exactly(n,p)`: where `n` is an integer and `p` is a parser. - `match_n(n,p)`: where `n` is an integer and `p` is a parser. `zero_or_one`, `zero_or_more` and `one_or_more` do exactly what their names suggest. You should realize that these are greedy parsers: they consume as many as possible strings that can be successfully parsed by `p`. Similarly, `exactly` is a greedy parser, and it fails when there are less or more than `n` consecutive strings that can be successfully parsed by `p`. On the other hand `match_n` is not greedy. It consumes `n` but no more strings that can be successfully parsed with `p`. **Examples:** This parser will fail on its input, too many strings starting with "a": ```{r} zero_or_one(satisfy(starts_with_a))(c('acc','aat','cgg')) ``` The following is a successful parse. Note that its result is not merely `[]` which would have indicated failure, but an `L,R`-list with an empty list in the `L`-element. ```{r} zero_or_more(satisfy(starts_with_a))(c('cat','gac','cct')) ``` ```{r} one_or_more(satisfy(starts_with_a))(c('att','aac','cct')) ``` ```{r} exactly(2, satisfy(starts_with_a))(c('att','aac','cct')) ``` ```{r} match_n(1, satisfy(starts_with_a))(c('att','aac','cct')) ``` ### Recognizing and processing strings with `match_s` When constructing a parser you will often need to recognize as well as process strings. For example, you want to recognize multiple integers in a line, extract these and then return them as a numeric vector. You're not interested in other elements like comments in these strings. This could be achieved by combining `satisfy()` and subsequently `%using%`, like: `satisfy(has_integers) %using% process_integers` where `has_integers` is a boolean function and `process_integers` is a function that both recognizes , extracts and rearranges numbers to a numeric vector. You will often find that `has_integers` and `process_integers` use the same regular expressions. Then it may be more efficient to combine these, which is what the `match_s()` function does. The `match_s()` parser takes a simple (not higher-order) function `s` to process the string from the current element and returns the result from that function. The function `s` has to be constructed in such a way that it returns the empty `list()` when the string does not satisfy the criteria that the user sets. **Example:** Here `numbers` is a function hat recognizes and returns numbers (in fact, positive integers) in a string: ```{r} numbers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- as.numeric(regmatches(x,m)[[1]]) if (length(matches)==0) { return(list()) # we signal parser failure when no numbers were found } else { return(matches) } } match_s(numbers)(" 101 12 187 # a comment on these numbers") ``` ### Functions that split a string and parse the substrings - `by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)`: where `p` is a parser - `by_symbol(p, finish = TRUE)`: where `p` is a parser Although you can use the string processing functions from the `base` or `stringr` packages to parse and process individual elements of a character vector it is also possible to parse substrings by first splitting a string. `by_split` uses a `split` pattern to first split the incoming string and then applies the parser `p` to it. `by_symbol` splits the incoming string to individual symbols and then applies the parser `p`. The `finish` boolean indicates whether the parser should completely consume the split string. Under the hood these functions use the function `strsplit()` and its `split`, `fixed` and `perl` arguments are passed on. **Examples:** ```{r} starts_with_a <- function(x) grepl("^a",x[1]) # don's forget to use satisfy(), it turns starts_with into a parser by_split(one_or_more(satisfy(starts_with_a)), ",", fixed = TRUE)("atggc,acggg,acttg") ``` ```{r} by_symbol(literal(">") %thenx% one_or_more(literal("b")), finish = FALSE)(">bb") ``` **Note**: Parsers become **slow** when using these two functions extensively. If that bothers you then you should use the `match_s` or `satisfy()`and `%using%` parsers together with string processing functions like `grepl` and `grep` or the ones from `stringr` to process strings. Those parsers will be much faster. ### Derived functions to recognize and modify empty lines - `EmptyLine()` - `Spacer()` - `MaybeEmpty()` The function `EmptyLine()` detects and returns empty line. Empty lines are either the string `""` or strings consisting entirely of space-like characters as identified by the regular expression `\\s`. `Spacer()` detects one or more consecutive empty lines and discards these whereas `MaybeEmpty()` detects zero or more empty lines and discards these. An additional function `Ignore()` ignores all lines, whether empty or not, until the end of the file. This is sometimes useful when the interesting part of a file has been parsed and all else can be ignored until the end of the file. Note that I write these functions with capital letters. I use this convention here and in the example below to indicate that these functions parse higher-order structures (higher than the individual strings) in the input. **Examples:** ```{r} EmptyLine()("") ``` ```{r} Spacer()(c(" ","\t\t\t", "atgcc")) ``` ```{r} MaybeEmpty()(c("ggacc","gatccg", "atgcc")) ``` ```{r} (literal("Interesting") %then% Ignore() %then% eof())(c("Interesting", LETTERS)) ``` ## Example application: a parser for *fasta* sequence files As an example of a somewhat realistic application let's try to write a parser for fasta-formatted files for mixed nucleotide and protein sequences. Such a fasta file could look like the example below ```{r, echo=FALSE, comment = NA} data("fastafile") cat(paste0(fastafile, collapse="\n")) ``` where the first two are nucleotide sequences and the last is a protein sequence[^1]. [^1]: It is not clear to me whether mixing of sequence types is allowed in the fasta format. I guess not, because a protein sequence consisting entirely of glutamate (G), alanine (A), threonine (T) and cysteine (C) would not be distinguishable from a nucleotide sequence. Such protein sequences would be extremely rare. Anyway I demonstrate here that apart from this ambiguous case it is easy to parse them from a single file. Since fasta files are text files we could read such a file using `readLines()`. Below we simulate the result of reading the file above by loading the `nuclfasta` and `protfasta` data sets present in the package. The consist of character vectors. ```{r,eval=FALSE} data("fastafile") ``` We can distinguish the following higher order components in a fasta file: - A **fasta** file: consists of one or more **sequence blocks** until the **end of the file**. - A **sequence block**: consist of a **header**[^2] and a **nucleotide sequence** or a **protein sequence**. A sequence block could be preceded by zero or more **empty lines**. - A **nucleotide sequence**: consists of one or more **nucleotide sequence strings**. - A **protein sequence**: consists of one or more **protein sequence strings**. - A **header** is a *string* that starts with a ">" immediately followed by a **title** without spaces. - A **nucleotide sequence string** is a *string* without spaces that consists *entirely* of symbols from the set `{G,A,T,C}`. - A **protein sequence string** is a *string* without spaces that consists *entirely* of symbols from the set `{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}`. [^2]: Note that real fasta headers and sequences can have more complicated formats than I pretend here. It now becomes clear what I mean when I say that the package allows us to write transparent parsers: the description above of the structure of fasta files can be translated straight into code for a `Fasta()` parser: ```{r} Fasta <- function() { one_or_more(SequenceBlock()) %then% eof() } SequenceBlock <- function() { MaybeEmpty() %then% Header() %then% (NuclSequence() %or% ProtSequence()) } NuclSequence <- function() { one_or_more(NuclSequenceString()) } ProtSequence <- function() { one_or_more(ProtSequenceString()) } ``` Notice that these elements are functions taking no input, hence the empty argument brackets `()` behind their names. They can take input when needed, for example to change their behavior (like `match_n()`, or see the other example below). Now we need to define the string-parsers `Header()`, `NuclSequenceString()` and `ProtSequenceString()` that recognize and process these elements in the character vector `fastafile`. We use functions from `stringr` to do this in three helper functions, and we use `match_s()` to create parsers from these. ```{r} # returns the title after the ">" in the sequence header parse_header <- function(line) { # Study stringr::str_match() to understand what we do here m <- stringr::str_match(line, "^>(\\w+)") if (is.na(m[1])) { return(list()) # signal failure: no title found } else { return(m[2]) } } # returns a nucleotide sequence string parse_nucl_sequence_line <- function(line) { # The line must consist of GATC from the start (^) until the end ($) m <- stringr::str_match(line, "^([GATC]+)$") if (is.na(m[1])) { return(list()) # signal failure: not a valid nucleotide sequence string } else { return(m[2]) } } # returns a protein sequence string parse_prot_sequence_line <- function(line) { # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the # end ($) m <- stringr::str_match(line, "^([ARNDBCEQZGHILKMFPSTWYV]+)$") if (is.na(m[1])) { return(list()) # signal failure: not a valid protein sequence string } else { return(m[2]) } } ``` Then we define the parsers. ```{r} Header <- function() { match_s(parse_header) } NuclSequenceString <- function() { match_s(parse_nucl_sequence_line) } ProtSequenceString <- function() { match_s(parse_prot_sequence_line) } ``` Now we have all the elements that we need to apply the `Fasta()` parser. ```{r} Fasta()(fastafile) ``` Apart from `match_s()`, we have used only the six fundamental parsers. Therefore, the output is almost the same as the parsed input. This is not very useful because it is difficult to extract the individual sequences and titles from it; we would have to write sort of parser again to process this output. To mend this, we have to modify the output of the parsers. The first thing that we will do is to let every sequence block be returned as an element of a list. To achieve this we extend the `SequenceBlock` parser by changing its output with the `%using%` operator: ```{r} SequenceBlock <- function() { MaybeEmpty() %then% Header() %then% (NuclSequence() %or% ProtSequence()) %using% function(x) list(x) } ``` Now the result is a list of three lists, one for each sequence block. ```{r} Fasta()(fastafile)[["L"]] ``` In principle, this output is easier to extract information from, but we can improve on it. First, we want the sequences to appear as one long string, not as separate character vectors corresponding to the lines in the sequence block. Therefore, we extend the `NuclSequence` and `ProtSequence` parsers collapsing their output: ```{r} NuclSequence <- function() { one_or_more(NuclSequenceString()) %using% function(x) paste0(x, collapse = "") } ProtSequence <- function() { one_or_more(ProtSequenceString()) %using% function(x) paste0(x, collapse="") } ``` Then we get ```{r} Fasta()(fastafile)[["L"]] ``` This looks much better: we know that the first element in each of these lists is the title and the second element is the complete sequence. Then why not just attach a name to these elements? This would make extracting the information even easier. Furthermore, we also report whether the sequence is a nucleotide or a protein sequence by adding a `type` tag. ```{r} Header <- function() { match_s(parse_header) %using% function(x) list(title = unlist(x)) } NuclSequence <- function() { one_or_more(NuclSequenceString()) %using% function(x) list(type = "Nucl", sequence = paste0(x, collapse="")) } ProtSequence <- function() { one_or_more(ProtSequenceString()) %using% function(x) list(type = "Prot", sequence = paste0(x, collapse="")) } ``` Finally, we have our desired output. ```{r} d <- Fasta()(fastafile)[["L"]] d ``` Let's present the result more concisely using the names of these elements: ```{r} invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")})) ``` ## Example application: parsers with parameters In the examples above we showed how to create parsers without parameters. It is easy and useful to sometimes create parsers with parameters. The parameters are used to change the behavior of the parsers. For example, when writing online course material I use a simple structured question template that is converted to html when the syllabus is generated. It consists mostly of markdown content. Its parser makes use of parametrized parsers. The structure of such a question template document is as follows[^3]: [^3]: I simplified the template and code for this example. In fact the content is processed differently depending on the type of element, meaning that `Content()` is a function of `type`. Furthermore, questions are automatically numbered. ```{r, echo=FALSE} qtemp <- c( "#### INTRO", "## Title about a set of questions", "", "This is optional introductory text to a set of questions.", "Titles preceded by four hashes are not allowed in a question template.", "", "#### QUESTION", "This is the first question", "", "#### TIP", "This would be a tip. tips are optional, and multiple tips can be given. Tips are", "wrapped in hide-reveal style html elements.", "", "#### TIP", "This would be a second tip.", "", "#### ANSWER", "The answer to the question is optional and is wrapped in a hide-reveal html element.", "", "#### QUESTION", "This is the second question. No tips for this one", "", "#### ANSWER", "Answer to the second question" ) ``` ```{r, echo=FALSE, comment=NA} cat(paste0(c(qtemp,"",""), collapse="\n")) ``` I stored this example content in a vector `qtemp` to parse it later. You notice the recurring structure of a header with four hashes `####` and some text following it. These headers represent four types of elements: intro, question, tip and answer. Instead of writing separate parsers we could create a generic parser for such elements as: ```{r} HeaderAndContent <- function(type) { (Header(type) %then% Content()) %using% function(x) list(list(type=type, content=unlist(x))) } ``` Then we define each of the four parsers as: ```{r} Intro <- function() HeaderAndContent("intro") Question <- function() HeaderAndContent("question") Tip <- function() HeaderAndContent("tip") Answer <- function() HeaderAndContent("answer") ``` The function `Header(type)` is defined as ```{r} Header <- function(type) satisfy(header(type)) %ret% NULL # This must also be a generic function: a function that generates a function to # recognize a header of type 'type' header <- function(type) { function(x) grepl(paste0("^####\\s+", toupper(type), "\\s*"), x) } ``` The content consists of one or more lines not starting with `####`, which includes empty lines. We discard trailing empty lines. ```{r} Content <- function() { (one_or_more(match_s(content))) %using% function(x) stringr::str_trim(paste0(x,collapse="\n"), "right") } content <- function(x) { if (grepl("^####", x)) list() else x } ``` The complete template is defined as follows ```{r} Template <- function() { zero_or_more(Intro()) %then% one_or_more(QuestionBlock()) %then% eof() } ``` where `QuestionBlock()` is defined using the previously defined elements as ```{r} QuestionBlock <- function() { Question() %then% zero_or_more(Tip()) %then% zero_or_one(Answer()) %using% function(x) list(x) } ``` We can now parse the input. We wrap the `Template()` parser in the `reporter()` function to have proper error messaging and warnings, if applicable. Furthermore only the `L`-element, the parsed input, is returned. ```{r} reporter(Template())(qtemp) ``` ## Literature