Asterix language description
============================

Program structure
-----------------
Programs consist of an entry point definition followed by any number of
functions. Each function starts with the function's name and ends with
a semicolon. The number sign (#) starts a line comment. For example:

    # Start execution at the "main" function
    entry main;

    # Define "main" function
    main = ... ;

Outside of individual tokens (such as strings, names, and operators),
whitespace is not significant, and indentation does not affect the
meaning of the program.

Functions
---------
Functions can declare parameters, lexically scoped variables, and
dynamically scoped variables, all of which are optional.

    # Define function "f1" with the two parameters "a" and "b"
    f1(a, b) = ... ;

    # f2 has one argument and declares variables of both types
    f2(arg) : lexvar1, lexvar2 : dynvar = ... ;

    # f3 has no arguments, but dynamically scopes the "mode" variable
    f3 : : mode = ... ;

The "dynvar" variable is globally visible, but any changes made to it by
f2 or functions called by it are undone when f2 returns. Similarly with
"mode" in f3.

Lexically scoped variables will shadow dynamically scoped ones in cases
where they have the same name.

The equals sign (=) designates the start of a function's body.
A function body is an _alternation_ -- a list of alternatives separated
by vertical bar signs (|) -- and each alternative is a sequence of
_items_ such as conditions and/or operations:

    testsign(n) =
        { n < 0 } negative
      | { n > 0 } positive
      | zero ;

Here, the "testsign" function checks the sign of its argument "n", and
calls one of the functions "negative", "positive", or "zero" depending
on the sign of n. Conditions are one of several kinds of _statements_,
which are explained below. A name appearing by itself as an item is
a function call with no arguments. Function calls can also appear inside
of statements, as part of an expression. For example, the following
version of "testsign" is equivalent to the one above:

    testsign(n) =
        { n < 0 } { negative() }
      | { n > 0 } { positive() }
      | { zero() } ;

(Note: the function body structure presented here is a simplification;
there are actually 2 kinds of alternations, each corresponding to
a different failure mode. See the "Hard failure" section below.)

Statements
----------
Braces delimit statements, which can be assignments, comparisons, or
expressions. The last function call or expression evaluated by
a function determines the value it returns. So, for example, here's
a more conventional "sign" function:

    # Returns -1, 0, or 1 depending on the sign of n.
    sign(n) =
        { n < 0 } { -1 }
      | { n > 0 } {  1 }
      | zero      {  0 } ;

Assignments can modify variables or memory:

    f(addr, x) : var =
        { var = x + 1 }         # Set the variable "var" to x + 1
        { [addr] = 2 }          # Store 2 at the location pointed to by "addr"
        { [addr+8] = var } ;    # Store var (x + 1) 8 bytes after addr

Assignments also return the value that was assigned (or stored), so in
the last example, "f" would return the value of "var" that was stored to
addr+8 in the third assignment, which is x + 1 in this case.

Comparisons, however, do not evaluate to any (meaningful) value.
Instead, they may "succeed" and cause subsequent items to be evaluated,
or "fail" and cause the next alternative (if any) to be evaluated. If
all alternatives fail, the function fails.

    # Return the argument n, but only if it is positive
    pos(n) = { n > 0 } { n } ;

If n is zero or negative, the comparison { n > 0 } will fail, causing
"pos" itself to fail, since it has no alternative.

    absnonzero(n) =
        { pos(n) }
      | { n < 0 } { -n } ;

This would return the absolute value of n if it was either positive or
negative. But if n was zero, absnonzero would also fail.

Values
------
Asterix is untyped, and all values are represented as numbers one way
or the other. However, they can still be specified in several different
ways:

    Notation    Value
    --------    ------------------------------------------------
    'c'         ASCII value of the character 'c'
    '\n'        ASCII newline (10)
    '''         39
    '\''        39
    '\x2a'      42
    42          42
    0x2a        42
    -100        -100
    -0x64       -100
    "Line\n"    Memory address of a string "Line" with a newline
                and terminated by a null

Available escape characters are: \0 (0), \a (7), \b (8), \t (9),
\n (10), \v (11), \f (12), \r (13), \e (27), \" (34), \' (39), \\ (92),
and the \x<nn> notation shown above.

String constants are compiled inline and will be read-only. Although
they are null-terminated for the sake of sanity, nulls get no special
treatment and are usually not necessary.

Function addresses are stored in variables, and those variables can also
be accessed (and manipulated) at runtime. See the section on "Redefining
functions" below.

Operators
---------
The following operators are available and have the same meaning as
usual:

    Arithmetic: +, -, *, /, %
    Bitwise:    &, |, ^, ~
    Left shift: <<
    Comparison: <=, <, ==, !=, >=, >

Remember that, as mentioned above under "Statements", the comparison
operators do not return values, but instead "succeed" or "fail"
according to whether the condition is met. Failure behavior is one of
the most important parts of the language, and is described in more
detail under "Hard failure" and several sections following it.

Note: all comparison operators treat their operands as signed numbers.
There are currently no unsigned versions of them.

For bit shifting to the right, >>, and >>> can be used. >> performs
a logical right shift (always shifting zeroes in from the left) whereas
>>> is an arithmetic (sign-preserving) right shift (the meaning of these
two is reversed compared to Java, while C only has the >> operator and
its behavior for negative operands may vary between implementations).

Unary & can be used to obtain the address of a variable, like in C.

$ is an operator similar to "sizeof", but not quite the same. See
"Memory allocation" below.

Values and operators are normally used within expressions in statements,
but there are a few other types of items, such as output sequences,
where they can also be used.

Output sequences
----------------
Angle brackets (<, >) enclose an _output sequence_, which causes data
to be written at an output location somewhere in memory, which is then
incremented to point past the data just written. Here's an alternative
to "testsign" that writes a string to the current output buffer instead
of calling a function:

    outsign(n) =
        { n < 0 } <"Negative!">
      | { n > 0 } <"Positive!">
      | zero      <"Zero!"> ;

Output sequences are actually a form of syntactic sugar, and the same
result could be achieved with function calls and statements instead.
Also, characters and strings are treated differently within output
sequences than in expressions. More on this below.

Memory access
-------------
As demonstrated earlier, memory references are surrounded by square
brackets ([, ]). They can also have a prefix to indicate the word size
of the memory being addressed:

    # Load consecutive values from memory and return their sum
    f1(addr) = {
        [q:addr]                # Load an 8-byte "quadword"
      + [d:addr + 8]            # Load a 4-byte "doubleword"
      + [w:addr + 8 + 4]        # Load a 2-byte "word"
      + [b:addr + 8 + 4 + 1]    # Load a byte
    } ;

If an "s" is prepended to the prefix in a load, the value will be
sign-extended:

    # Negative result if the byte at "addr" is between 0x80 and 0xff
    f2(addr) = { [sb:addr] } ;

The "s" prefix is not valid for stores or quadword loads.

Grouping
--------
Not every function is just a simple list of alternatives. Parentheses
can be used to group items (function calls, statements, etc) together:

    writeflags(n) =
        ({ n & 1 == 1 } <"Bit 0 is set"> | <"Bit 0 is not set">)
        ({ n & 2 == 2 } <"Bit 1 is set"> | <"Bit 1 is not set">)
        # ...etc...

Note: in C, comparison operators like == have higher precedence than
bitwise operators like &. That is _not_ the case in Asterix. In the
example above, "{ n & 1 == 1 }" is the same as "{ (n & 1) == 1 }".

Loops
-----
Asterix is _not_ tail-recursive. Iteration can be achieved by prefixing
an item with a _star_ or _asterisk_ (*), as in the following example:

    # Divide n by 10 until it is less than 1000
    f(n) =
        *({ n >= 1000 } { n = n / 10 })
        { n } ;

The item prefixed by the star is repeated until it fails; then the loop
as a whole succeeds. Because of this, loops do not evaluate to any
meaningful value themselves, and instead always rely on side effects. In
the above example, the last statement evaluated before the loop exits
would be "{ n >= 1000 }"; hence the statement "{ n }" at the end to
cause the function to return the final value of n.

Repeated items need not all be grouped sequences. This "all" function
simply calls "next" until it fails:

    all = *next ;

Loops only succeed if the looped item (or one of the enclosed items, in
case of a group) fails normally. If a _hard failure_ occurs (which are
described in the next section), the loop itself will also fail hard. On
the other hand, if the looped item(s) cannot fail, the loop is infinite.

Hard failure
------------
There are actually 2 different ways in which items can fail: they can
"fail normally" (or "soft-fail"), or they can "fail hard".

When a so-called hard failure occurs, regular alternatives will be
skipped in addition to the sequence that (hard-)failed. But there's
another kind of alternation, the _hard alternation_, that can provide
alternative evaluation sequences in case a hard failure occurs. These
_hard alternatives_ are delimited by a double vertical bar (||) instead
of a single one:

    f1 = (a | b) || c ;

If "a" fails normally (or "soft-fails"), "b" will be tried (evaluated)
next. But if it hard-fails, "b" will _not_ be tried and "c" will be
tried instead. If "b" fails, "c" will also be tried, regardless of
whether it soft-failed or hard-failed; hard alternatives catch both
types of failure. They have lower precedence than soft alternatives:

    # Same as the previous example
    f2 = a | b || c ;

    # Catches hard failure _only_ from b, not from a!
    f3 = a | (b || c) ;

Strictly speaking, every function body is a hard alternation (containing
at least one soft alternation, with at least one sequence, with at least
one item).

Ignoring failures
-----------------
It is possible to ignore failure, forcing items to succeed, using the
question mark (?) and double question mark (??) prefixes to either
ignore soft failures only or ignore both types of failure, respectively:

    # f1 will fail hard if "a" fails hard, otherwise it will succeed
    f1 = ?a ??b ;

If "a" succeeds or fails normally, the "?a" item will succeed and "b"
will be evaluated. But if "a" fails hard, "?a" will also fail hard and
cause "f1" to fail hard. The "??b" item will always succeed, regardless
of whether "b" succeeds, fails normally, or fails hard.

Catching soft failures allows for conditional evaluation where no
alternative is needed, similar to "if-then" with no "else" block in many
other languages:

    # Always returns an odd number
    f2(n) =
        ?({ n % 2 == 0 } { n = n + 1 })
        { n } ;

Implicit failure conversion
---------------------------
So far, we have seen no more than one comparison in every sequence, and
it was always at the beginning. But when subsequent items compare values
or call functions that may fail, it becomes important to take into
account the _implicit failure conversion_ that occurs when an item fails
and that item is not the first item in a sequence.

The conversion happens by default, but can be suppressed by prepending
a dot (.) to the item. This is only valid for subsequent items, not the
first one in a sequence (where the dot would be redundant anyway).

    # Get character from string, but check the index first
    getchar(str, i) =
        { i >= 0 } . { i < 10 } { [b:str + i] } ;

If the dot were omitted, this would probably not do what you want:

    # Fails normally if i < 0, but fails hard if i >= 10!
    getchar(str, i) =
        { i >= 0 } { i < 10 } { [b:str + i] } ;

Based on the above example, failure conversion seems like a useless and
annoying mechanic, but it can be helpful when performing a sequence of
actions based on a condition that was met earlier, but where some of the
actions themselves may fail and require error handling:

    f =
        condition1 action1 action2 action3 ;
      | condition2 action4 action5 action6 ;

If both "condition1" and "condition2" fail (normally), "f" will also
fail normally; but if any of the actions fail (normally or hard), "f"
will fail hard.

(This behavior is inspired by the META II language, wherein failure of
a subsequent item in a list causes the program to abort. This is done in
order to avoid having to backtrack when parsing multiple tokens of
input. See also "Parsing input" further below.)

Reject items
------------
An exclamation mark (!) item can be used to explicitly trigger a (soft)
failure. A double exclamation mark (!!) triggers a hard failure. Beware
that alternatives can still catch such failures, and that implicit
failure conversion can still occur:

    # Hard-fails if "bad" either succeeds or hard-fails,
    # otherwise (if "bad" soft-fails) succeeds and return n
    f1(n) = { bad(n) } . !! | { n } ;
    # Equivalent to f1 due to implicit failure conversion
    f2(n) = { bad(n) } ! | { n } ;

    # Always fails, since there is no alternative
    f3(n) = { bad(n) } . ! ;

    # Returns n unless "bad" hard-fails
    # (the exclamation mark does _not_ skip the alternative!)
    f4(n) = { bad(n) } . ! | { n } ;
    # Same thing
    f5(n) = ?{ bad(n) } { n } ;

    # Soft-fails if "bad" succeeds; succeeds and returns n
    # if "bad" soft-fails; hard-fails if "bad" hard-fails
    f6(n) : c =
        (
            { bad(n) } { c = 1 }
          | { c = 0 }
        )
        . { c == 0 } { n } ;

    # Equivalent to f6
    f7(n) : c =
        { c = 0 }
        ?({ bad(n) } { c = 1 })
        . { c == 0 } { n } ;

Reject items are not strictly necessary, but may let you avoid writing
things like "{ 0 == 1 }" to trigger a failure. As the above examples
demonstrate, they tend to make things messy and confusing and should be
avoided if possible. They may occasionally be useful in cases where you
need to act upon a failure (to backtrack, for example) but then
propagate it. Here's one more example to demonstrate this:

    trynext : tmpx =
        { tmpx = x }
        # "next" can modify x
        . next
      | { x = tmpx } . ! ;

If "next" fails, this will reset "x" back to its original value (before
"next" was called), then fail. If "trynext" had instead declared "x" as
a dynamically scoped variable, it would _always_ be reset even when
"next" succeeds.

Memory allocation
-----------------
The easiest way to allocate some unused memory and obtain the address to
it is when declaring function variables. Adding a number between square
brackets ([, ]) after the variable name causes the given amount in bytes
of memory to be reserved on the stack. The variable will be initialized
with the address of that memory. This also works for dynamically scoped
variables. Allocation sizes are rounded up if necessary to preserve
stack alignment.

    # Initialize "a" with a 16-byte buffer and "b" with a 64-byte buffer
    f1 : a[16] : b[64] = ... ;

Since memory is allocated on the stack, it is effectively freed when the
function returns.

There is a special operator, the dollar sign ($), that can return the
memory size for variables declared this way. It will return the declared
size, not the rounded up size:

    # Returns 42
    f2 : buf[42] = { $buf } ;

Beware that it does not track assignments, only declarations. Also, for
function arguments and variables not initialized to allocated memory,
the "size" returned will be zero:

    f3 : : buf[128] = { f4() + f5() } ;

    # Returns 128 if it was called by f3
    f4 = { $buf } ;

    # This will return zero, not 128
    f5 : var =
        { var = buf }
        { $var } ;

There's one other use of the $ operator. When it appears by itself as
a (sub-)expression, it evaluates to the number of characters in the last
string constant before it (within the same function):

    # I know it's a bit late for this, but you won't mind, would you?
    main = { write(1, "Hello world\n", $) } ;

Alloca
------
Memory allocation by variable declarations works only when the amount of
memory to be allocated is a constant.

    # Error!
    f1(size) : var[size] = ... ;

To allocate a variable amount of memory on the stack, an "alloca" item
needs to be used instead:

    f2(size) : var =
        alloca:{size} -> var (
            ...
        ) ;

Keep in mind that "$var" would not return size here, but zero; it only
works for declarations.

There are no heap management functions like "malloc" or "free". You'll
have to either roll your own, or... let the operating system do it for
you.

System calls
------------
Invoking kernel system calls is easy once you know the system call
number and what arguments it expects:

    write(fd, buf, len) = { syscall(1, fd, buf, len) } ;

    malloc(n) = { syscall(9, 0, n+0xfff & ~0xfff, 3, 34, 0, -1) } ;

Note: although this looks like an ordinary function call, "syscall" is
a special form, not a function; you cannot use "syscall" as a variable
and system calls are not modified or otherwise wrapped like libc does.

Of course, one slight issue with doing "malloc" like this is that the
amount of memory being allocated must be rounded up to the page size.
Also...

    # munmap needs both the address and length
    free(addr, n) = { syscall(11, addr, n+0xfff & ~0xfff) == 0 } ;

There are various ways to deal with this, of course. Be creative...

Parsing input
-------------
Asterix's main strength is actually the way in which it lets you parse
arbitrary data. Statements aren't the only way of doing comparisons; an
item consisting of just a character or string will actually attempt to
match that character or string at the current position of an input
buffer:

    expr(arg1, arg2) =
        '+' { arg1 + arg2 }
      | '-' { arg1 - arg2 }
      | '*' { arg1 * arg2 }
      | '/' { arg1 / arg2 } ;

    keywords = "if" | "while" | "for" ;

    nulls = *'\0' ;

Further down below we'll explain how this works and where input comes
from. Besides characters and strings, you can also match a range of
characters and even binary data:

    digit1 = [0-9] ;

    elf = %b:0x7f "ELF" <"Found an ELF header\n"> ;

Don't forget to use sign-extension where necessary:

    # %d:-1 would zero-extend the loaded doubleword and never match
    minus1 = %sd:-1 <"Found dword -1\n"> ;

Note: character range unions (as seen in _regular expressions_ in some
other languages) are not valid in Asterix. Use alternation instead:

    # Error -- this will not work
    # hexdigit = [0-9A-Fa-f] ;

    # This works just fine
    hexdigit = [0-9] | [A-F] | [a-f] ;

Note 2: binary data is always read in native byte order. Use byte,
character, or string matching to read data independent of endianness.

Binary ranges also use square brackets:

    nonascii1 = %b:[0-31] | %b:[128-255] ;

It is possible to use expressions in binary matches, but if they use
operators, in most cases they must be put inside parentheses:

    digit2 = %b:['0'-'9'] ;

    digit3 = %b:[(48+0) - (48+9)] ;

Binary matches can also be used to match variable data, and there's one
other special form that lets you match variable strings:

    # Match a specific digit n, where 0 <= n <= 9
    digitn(n) = %b:(n + '0') ;

    # Match a range of digits
    digitrange(m, n) = %b:[(m + '0')-(n + '0')] ;

    # Match a specific string constant
    strconst(str, len) = '"' %s:str,len '"' ;

Matching operators can be modified to invert the match range using
a caret (^), or to look ahead (preventing the input position from
advancing even if the match was successful) with the ampersand (&),
or both:

    nondigit = ^[0-9] ;

    nonascii2 = ^%b:[32-127] ;

    # Avoid matching left shift (<<)
    lessthan = &^"<<" . '<' ;

    # The opposite of [0-9] | [A-Z] | [a-z] | '_'
    endofname = &^[0-9] . &^[A-Z] . &^[a-z] . &^'_' ;

    # See also the "keywords" example above
    keyword = keywords endofname ;

    notminus1 = ^%sd:-1 ;

Note: unlike the star and question mark, the caret and ampersand do not
prefix arbitrary items; instead, they modify the behavior of matching
operators directly. So they can only be used on match items.

The endofname example acts like [^0-9A-Za-z_] would in regular
expressions (which Asterix does not have as such).

    # Skip any amount of spaces, tabs, and/or newlines
    ws = *(' ' | '\t' | '\n') ;

    # Parse expr2 + expr2 + ... + expr2
    expr1 = expr2 *(ws . '+' ws expr2) ;

Note how the loop exits (successfully) when it no longer finds a plus
sign after an expr2, but if it finds a plus without a valid expr2
following it, the loop (and therefore expr1) will hard-fail.

Generating output
-----------------
Some basic examples of output sequences with strings have already been
shown. They can also write characters and binary data:

    outnewline = <'\n'> ;

    outdigit(n) = <%b:(n + '0')> ;

Note: it is not necessary to distinguish between signed and unsigned
values in binary output. Output and memory stores may truncate values,
but they never sign-extend them.

Actually, the reason for calling them "output sequences" is that they
can contain multiple _output items_:

    outdigit2(n) = <"Here's a " %b:(n + '0') '\n'> ;

    # Same thing, but not necessary
    outdigit3(n) = <"Here's a "> <%b:(n + '0')> <'\n'> ;

Bare, unqualified expressions are also allowed in an output sequence,
although like binary range matches, they must be parenthesized if they
use operators.

    something(x) = <"What is an " x "...?\n"> ;

The meaning of this form is explained in the "Parsing and output
details" section below.

Finally, output sequences can copy arbitrary strings, which are
specified as a pair of expressions (the string's address and its length)
separated by a comma. Again, expressions with operators must be
parenthesized.

    copyfrom(start, end) = <start,(end - start)> ;

Assigning and capturing items
-----------------------------
Character range and binary range matches wouldn't be very useful without
being able to see which character or value was actually matched. There
are two ways to do this. First, character and binary matching items
return the character or value they matched. So the result of the "digit"
functions in the examples above can be assigned to a variable or even
used directly in an output sequence:

    digit1 = [0-9] ;

    # "c" will be assigned the ASCII value of the digit
    showdigit1 : c =
        { c = digit1() }
        <"Found digit " %b:c '\n'> ;

There's also a shorter form available for use with arbitrary items:

    # Same as showdigit1
    showdigit2 : c =
        c=digit1
        <"Found digit " %b:c '\n'> ;

    # There's no need for functions like digit1, really
    showdigit3 : c =
        c=[0-9]
        <"Found digit " %b:c '\n'> ;

    # In case you're sure you'll find a digit...
    showdigit4 = <"Found digit " %b:digit1() '\n'> ;

Be careful with that last one. Output sequences are not "atomic", and
the string "Found digit " will be written to the output buffer before
digit1 is invoked to check for a digit.

    # Hard-fails if digit1 fails, unlike showdigit4
    showdigit5 = <"Found digit "> <%b:digit()> <'\n'> ;

This isn't exactly the same as showdigit4, either. Output sequences do
count as single items, so if digit1 fails in showdigit4, the function
will fail normally. However, in showdigit5, the digit1 function is
invoked in a subsequent item, not the first. So if it doesn't find
a digit, showdigit5 will hard-fail instead.

Assigning items can assign the result of any other item, not just
functions and match items. However, string matches do not actually
return any meaningful value. Also, assigning from a group of items would
only give you the result of the last one. In such cases, _capturing
items_ may be more useful instead:

    # See also the "keyword" example above
    showkeyword : kw, len =
        kw,len=keyword <"Found keyword: " kw,len '\n'> ;

Capturing items (or _captures_) mark the current input buffer position
before and after an item is invoked, and assign the initial address and
the match length (that is, the displacement of the input position after
the captured item returns) to the chosen pair of variables. The example
also demonstrates how this works together with string copies in output
sequences.

Note: captures are not atomic either; if the item being invoked fails,
the first variable will get assigned, but the second will not.

Parsing and output details
--------------------------
Matching items and output sequences have been introduced and described
as though their functionality is built directly into the language. But
that's not truly the case. Although the syntactic forms are part of the
language, they actually work by invoking functions with predetermined
names, with the data to be matched (or written) passed as parameters to
those functions.

These functions must be part of the program, and are usually taken from
the file "compile.ax" to provide the functionality described above,
although it would be possible for a program to define its own versions
of them instead.

For example, the "matchchar" function is called to match a character or
character range:

    alpha1 = [a-z] ;

    # Same thing
    alpha2 = { matchchar('a', 'z', 0) } ;

(Note: the third argument to matchchar tells it which modifiers (caret
and/or ampersand) are to be applied. See the "Syntactic sugar overview"
below.)

One such function, called "outval", is left for the program itself to
define. This function gets called for unqualified expressions in output
sequences (see "Generating output" above), and takes the value of that
expression as its single argument:

    outval(x) = ... ;

    # Calls outval(x)
    display(x) = <"I know what an " x " is now!\n">

Again, beware the consequences of outval failing (if that's
a possibility).

Besides these functions, the input and output buffers that they act upon
are also defined by variables with specific names. The most important
two of these are "in" and "out", which are the locations where input
will be matched and output will be written to, respectively. There is
also "inmax", which should point to the end of the input buffer (after
which no more input will be matched); "outmax", the end of the output
buffer; and "instart" and "outstart", the starting addresses of these
buffers.

With the exception of "in" by capturing items, these variables are
referenced by match functions and output functions, not directly. The
functions refer to them globally (like dynamically scoped variables);
they are not passed as parameters. Declaring them as lexical variables
would usually just result in confusing behavior.

The next section shows the complete list of function names and variables
used by all the syntactic-sugary matching and output items and the
functions implementing them.

Syntactic sugar overview
------------------------
The following table shows the functions that are called by code that
uses match items and output items:

    Code                Equivalent item/statement           Notes
    -----------------   ------------------------------      -----------
    'c'                 { matchchar('c', 'c', 0) }
    ^'c'                { matchchar('c', 'c', 1) }
    &'c'                { matchchar('c', 'c', 2) }
    &^'c'               { matchchar('c', 'c', 3) }
    [0-9]               { matchchar('0', '9', 0) }
    %b:[x-y]            { matchb(x, y, 0) }
    %w:[x-y]            { matchw(x, y, 0) }
    %d:[x-y]            { matchd(x, y, 0) }
    %q:[x-y]            { matchq(x, y, 0) }
    %sb:[x-y]           { matchsb(x, y, 0) }
    %sw:[x-y]           { matchsw(x, y, 0) }
    %sd:[x-y]           { matchsd(x, y, 0) }
    %b                  inb                                 [1]
    &%b                 peekb                               [1]
    "str"               { matchstr("str", 3, 0) }           [2]
    &^"<<"              { matchstr("<<", 2, 3) }            [3]
    <'c'>               { outchar('c') }
    <"str">             { outstr("str", 3) }                [2]
    <%b:c>              { outb(c) }                         [1] [4]
    <str,len>           { outstr(str, len) }
    <x>                 { outval(x) }
    func                { func() }
    var1,var2=item      { var1 = in } item { var2 = in }    [5]

[1] Similar for %sb, %w, %sw, etc.
[2] The second parameter is the length of the string to match or write
[3] Flags (third parameter) are the same as for character/binary matches
[4] %sb etc. are accepted, but outsb, outsw etc. are not defined
[5] Implicit failure conversion applies to item only if it would apply
    to the var1,var2=item sequence as a whole

Variables referenced by these functions:

    Variable            Purpose
    ----------          -----------------------------------------
    in                  Input cursor (memory address)
    inmax               Input limit (no match at or beyond this address)
    instart             Start of input buffer [1]
    out                 Output cursor
    outmax              Output limit (no output at/beyond this address)
    outstart            Start of output buffer [1]

[1] These are only referenced by some error handling functions, but are
    useful to have so you know where your own input/output starts

Redefining functions
--------------------
Since all functions are referenced through variables, it's possible to
redefine them, effectively changing the behavior of functions (such as
those implementing the matching and/or output items).

For example, to change the way characters or character ranges are
matched:

    mymatchchar(min, max, flags) = ... ;

    # Calls mymatchchar('x', 'x', 0)
    f : matchchar =
        { matchchar = mymatchchar }
        . 'x' ;

Usually, it's a good idea to scope such redefinitions by declaring the
redefined function as a lexical or dynamic variable. A lexically scoped
redefinition (such as in the above example) affects calls in the
redefining function only (by shadowing the global/dynamic function
variable), whereas a dynamically scoped redefinition would affect both
the redefining function _and_ all functions called by it.

Common traps and pitfalls
-------------------------
This section describes a few mistakes that are easily made when
programming in this language, and how to avoid them where possible.

* A function call with one or more arguments can only be made inside
  a statement; depending on how the code is structured and formatted,
  it can be very easy to forget the braces:

      { inlen = read(0, in, $in) }
      # ...
      ({ inlen >= 0 } | die(4, "Read error\n", $))
                            ^ --- Compile error!

      # Fixed version
      ({ inlen >= 0 } | { die(4, "Read error\n", $) })

  When the mistake results in a syntax error (as in the above example),
  it should be easily found and fixed and will not be too much of
  a problem. But it can become much harder if the wrongly written code
  is syntactically still valid:

      list(element) = element *(',' element) ;
      argument : n0, n1 = n0,n1=name { defarg(n0, n1) } ;
      # ...
      arguments = '(' list(argument) ')' ;

      # Same thing! The compiler does not check function arguments.
      arguments = '(' list argument ')' ;

      # What was intended
      arguments = '(' { list(argument) } ')' ;

* The function defined as the entry point (i.e. named by the "entry"
  line at the beginning) must never return. Use the exit(n) system call
  to terminate the program. In addition, it should catch any otherwise
  unhandled failures. Hence, "main" functions would usually end like
  this (6 is the Asterix compiler's exit code for unhandled errors;
  a more or less arbitrary choice):

      { exit(0) }

   || { die(6, "Unexpected error\n", $) } ;

* Asterix's BNF-like control structure can be useful, but also difficult
  and confusing; a failing item causes the remainder of that sequence to
  automatically be skipped and the next alternative in the group to be
  tried; if there isn't any, the failure will propagate upwards out of
  the group, possibly out of the function or even multiple functions,
  until it is caught by an alternative where execution will then proceed.

  Because of this, failures that you didn't expect can be confusing --
  but so can items that _succeed_ when they shouldn't. For example, if
  a function accidentally parses too much input, a failure may happen
  later on in a different function or even at the end of input (EOF):

      string = '"' *strchar '"' ;

      strchar : c =
          # Handle some escaped characters specially
          '\\' (
              't' <%b:9>
            | 'n' <%b:10>
              # etc...
              # Other escaped characters are just copied as-is
            | c=%b <%b:c>
          )
          # Non-escaped characters are simply copied to the output
          # (In this version, unfortunately, that also includes the
          # string's terminating quote and all remaining input)
        | c=%b <%b:c> ;

  That string function, if it finds the initial quote, will consume all
  input until it finally hits EOF, after which it will hard-fail on its
  third item (the second quote that is supposed to mark the end of the
  string), leaving the input cursor at EOF as the only indication of
  what could have caused this hard failure.