Asterix language description ============================ Program structure ----------------- Programs consist of an entry point definition followed by any number of functions. Each function starts with the function's name and ends with a semicolon. The number sign (#) starts a line comment. For example: # Start execution at the "main" function entry main; # Define "main" function main = ... ; Outside of individual tokens (such as strings, names, and operators), whitespace is not significant, and indentation does not affect the meaning of the program. Functions --------- Functions can declare parameters, lexically scoped variables, and dynamically scoped variables, all of which are optional. # Define function "f1" with the two parameters "a" and "b" f1(a, b) = ... ; # f2 has one argument and declares variables of both types f2(arg) : lexvar1, lexvar2 : dynvar = ... ; # f3 has no arguments, but dynamically scopes the "mode" variable f3 : : mode = ... ; The "dynvar" variable is globally visible, but any changes made to it by f2 or functions called by it are undone when f2 returns. Similarly with "mode" in f3. Lexically scoped variables will shadow dynamically scoped ones in cases where they have the same name. The equals sign (=) designates the start of a function's body. A function body is an _alternation_ -- a list of alternatives separated by vertical bar signs (|) -- and each alternative is a sequence of _items_ such as conditions and/or operations: testsign(n) = { n < 0 } negative | { n > 0 } positive | zero ; Here, the "testsign" function checks the sign of its argument "n", and calls one of the functions "negative", "positive", or "zero" depending on the sign of n. Conditions are one of several kinds of _statements_, which are explained below. A name appearing by itself as an item is a function call with no arguments. Function calls can also appear inside of statements, as part of an expression. For example, the following version of "testsign" is equivalent to the one above: testsign(n) = { n < 0 } { negative() } | { n > 0 } { positive() } | { zero() } ; (Note: the function body structure presented here is a simplification; there are actually 2 kinds of alternations, each corresponding to a different failure mode. See the "Hard failure" section below.) Statements ---------- Braces delimit statements, which can be assignments, comparisons, or expressions. The last function call or expression evaluated by a function determines the value it returns. So, for example, here's a more conventional "sign" function: # Returns -1, 0, or 1 depending on the sign of n. sign(n) = { n < 0 } { -1 } | { n > 0 } { 1 } | zero { 0 } ; Assignments can modify variables or memory: f(addr, x) : var = { var = x + 1 } # Set the variable "var" to x + 1 { [addr] = 2 } # Store 2 at the location pointed to by "addr" { [addr+8] = var } ; # Store var (x + 1) 8 bytes after addr Assignments also return the value that was assigned (or stored), so in the last example, "f" would return the value of "var" that was stored to addr+8 in the third assignment, which is x + 1 in this case. Comparisons, however, do not evaluate to any (meaningful) value. Instead, they may "succeed" and cause subsequent items to be evaluated, or "fail" and cause the next alternative (if any) to be evaluated. If all alternatives fail, the function fails. # Return the argument n, but only if it is positive pos(n) = { n > 0 } { n } ; If n is zero or negative, the comparison { n > 0 } will fail, causing "pos" itself to fail, since it has no alternative. absnonzero(n) = { pos(n) } | { n < 0 } { -n } ; This would return the absolute value of n if it was either positive or negative. But if n was zero, absnonzero would also fail. Values ------ Asterix is untyped, and all values are represented as numbers one way or the other. However, they can still be specified in several different ways: Notation Value -------- ------------------------------------------------ 'c' ASCII value of the character 'c' '\n' ASCII newline (10) ''' 39 '\'' 39 '\x2a' 42 42 42 0x2a 42 -100 -100 -0x64 -100 "Line\n" Memory address of a string "Line" with a newline and terminated by a null Available escape characters are: \0 (0), \a (7), \b (8), \t (9), \n (10), \v (11), \f (12), \r (13), \e (27), \" (34), \' (39), \\ (92), and the \x notation shown above. String constants are compiled inline and will be read-only. Although they are null-terminated for the sake of sanity, nulls get no special treatment and are usually not necessary. Function addresses are stored in variables, and those variables can also be accessed (and manipulated) at runtime. See the section on "Redefining functions" below. Operators --------- The following operators are available and have the same meaning as usual: Arithmetic: +, -, *, /, % Bitwise: &, |, ^, ~ Left shift: << Comparison: <=, <, ==, !=, >=, > Remember that, as mentioned above under "Statements", the comparison operators do not return values, but instead "succeed" or "fail" according to whether the condition is met. Failure behavior is one of the most important parts of the language, and is described in more detail under "Hard failure" and several sections following it. Note: all comparison operators treat their operands as signed numbers. There are currently no unsigned versions of them. For bit shifting to the right, >>, and >>> can be used. >> performs a logical right shift (always shifting zeroes in from the left) whereas >>> is an arithmetic (sign-preserving) right shift (the meaning of these two is reversed compared to Java, while C only has the >> operator and its behavior for negative operands may vary between implementations). Unary & can be used to obtain the address of a variable, like in C. $ is an operator similar to "sizeof", but not quite the same. See "Memory allocation" below. Values and operators are normally used within expressions in statements, but there are a few other types of items, such as output sequences, where they can also be used. Output sequences ---------------- Angle brackets (<, >) enclose an _output sequence_, which causes data to be written at an output location somewhere in memory, which is then incremented to point past the data just written. Here's an alternative to "testsign" that writes a string to the current output buffer instead of calling a function: outsign(n) = { n < 0 } <"Negative!"> | { n > 0 } <"Positive!"> | zero <"Zero!"> ; Output sequences are actually a form of syntactic sugar, and the same result could be achieved with function calls and statements instead. Also, characters and strings are treated differently within output sequences than in expressions. More on this below. Memory access ------------- As demonstrated earlier, memory references are surrounded by square brackets ([, ]). They can also have a prefix to indicate the word size of the memory being addressed: # Load consecutive values from memory and return their sum f1(addr) = { [q:addr] # Load an 8-byte "quadword" + [d:addr + 8] # Load a 4-byte "doubleword" + [w:addr + 8 + 4] # Load a 2-byte "word" + [b:addr + 8 + 4 + 1] # Load a byte } ; If an "s" is prepended to the prefix in a load, the value will be sign-extended: # Negative result if the byte at "addr" is between 0x80 and 0xff f2(addr) = { [sb:addr] } ; The "s" prefix is not valid for stores or quadword loads. Grouping -------- Not every function is just a simple list of alternatives. Parentheses can be used to group items (function calls, statements, etc) together: writeflags(n) = ({ n & 1 == 1 } <"Bit 0 is set"> | <"Bit 0 is not set">) ({ n & 2 == 2 } <"Bit 1 is set"> | <"Bit 1 is not set">) # ...etc... Note: in C, comparison operators like == have higher precedence than bitwise operators like &. That is _not_ the case in Asterix. In the example above, "{ n & 1 == 1 }" is the same as "{ (n & 1) == 1 }". Loops ----- Asterix is _not_ tail-recursive. Iteration can be achieved by prefixing an item with a _star_ or _asterisk_ (*), as in the following example: # Divide n by 10 until it is less than 1000 f(n) = *({ n >= 1000 } { n = n / 10 }) { n } ; The item prefixed by the star is repeated until it fails; then the loop as a whole succeeds. Because of this, loops do not evaluate to any meaningful value themselves, and instead always rely on side effects. In the above example, the last statement evaluated before the loop exits would be "{ n >= 1000 }"; hence the statement "{ n }" at the end to cause the function to return the final value of n. Repeated items need not all be grouped sequences. This "all" function simply calls "next" until it fails: all = *next ; Loops only succeed if the looped item (or one of the enclosed items, in case of a group) fails normally. If a _hard failure_ occurs (which are described in the next section), the loop itself will also fail hard. On the other hand, if the looped item(s) cannot fail, the loop is infinite. Hard failure ------------ There are actually 2 different ways in which items can fail: they can "fail normally" (or "soft-fail"), or they can "fail hard". When a so-called hard failure occurs, regular alternatives will be skipped in addition to the sequence that (hard-)failed. But there's another kind of alternation, the _hard alternation_, that can provide alternative evaluation sequences in case a hard failure occurs. These _hard alternatives_ are delimited by a double vertical bar (||) instead of a single one: f1 = (a | b) || c ; If "a" fails normally (or "soft-fails"), "b" will be tried (evaluated) next. But if it hard-fails, "b" will _not_ be tried and "c" will be tried instead. If "b" fails, "c" will also be tried, regardless of whether it soft-failed or hard-failed; hard alternatives catch both types of failure. They have lower precedence than soft alternatives: # Same as the previous example f2 = a | b || c ; # Catches hard failure _only_ from b, not from a! f3 = a | (b || c) ; Strictly speaking, every function body is a hard alternation (containing at least one soft alternation, with at least one sequence, with at least one item). Ignoring failures ----------------- It is possible to ignore failure, forcing items to succeed, using the question mark (?) and double question mark (??) prefixes to either ignore soft failures only or ignore both types of failure, respectively: # f1 will fail hard if "a" fails hard, otherwise it will succeed f1 = ?a ??b ; If "a" succeeds or fails normally, the "?a" item will succeed and "b" will be evaluated. But if "a" fails hard, "?a" will also fail hard and cause "f1" to fail hard. The "??b" item will always succeed, regardless of whether "b" succeeds, fails normally, or fails hard. Catching soft failures allows for conditional evaluation where no alternative is needed, similar to "if-then" with no "else" block in many other languages: # Always returns an odd number f2(n) = ?({ n % 2 == 0 } { n = n + 1 }) { n } ; Implicit failure conversion --------------------------- So far, we have seen no more than one comparison in every sequence, and it was always at the beginning. But when subsequent items compare values or call functions that may fail, it becomes important to take into account the _implicit failure conversion_ that occurs when an item fails and that item is not the first item in a sequence. The conversion happens by default, but can be suppressed by prepending a dot (.) to the item. This is only valid for subsequent items, not the first one in a sequence (where the dot would be redundant anyway). # Get character from string, but check the index first getchar(str, i) = { i >= 0 } . { i < 10 } { [b:str + i] } ; If the dot were omitted, this would probably not do what you want: # Fails normally if i < 0, but fails hard if i >= 10! getchar(str, i) = { i >= 0 } { i < 10 } { [b:str + i] } ; Based on the above example, failure conversion seems like a useless and annoying mechanic, but it can be helpful when performing a sequence of actions based on a condition that was met earlier, but where some of the actions themselves may fail and require error handling: f = condition1 action1 action2 action3 ; | condition2 action4 action5 action6 ; If both "condition1" and "condition2" fail (normally), "f" will also fail normally; but if any of the actions fail (normally or hard), "f" will fail hard. (This behavior is inspired by the META II language, wherein failure of a subsequent item in a list causes the program to abort. This is done in order to avoid having to backtrack when parsing multiple tokens of input. See also "Parsing input" further below.) Reject items ------------ An exclamation mark (!) item can be used to explicitly trigger a (soft) failure. A double exclamation mark (!!) triggers a hard failure. Beware that alternatives can still catch such failures, and that implicit failure conversion can still occur: # Hard-fails if "bad" either succeeds or hard-fails, # otherwise (if "bad" soft-fails) succeeds and return n f1(n) = { bad(n) } . !! | { n } ; # Equivalent to f1 due to implicit failure conversion f2(n) = { bad(n) } ! | { n } ; # Always fails, since there is no alternative f3(n) = { bad(n) } . ! ; # Returns n unless "bad" hard-fails # (the exclamation mark does _not_ skip the alternative!) f4(n) = { bad(n) } . ! | { n } ; # Same thing f5(n) = ?{ bad(n) } { n } ; # Soft-fails if "bad" succeeds; succeeds and returns n # if "bad" soft-fails; hard-fails if "bad" hard-fails f6(n) : c = ( { bad(n) } { c = 1 } | { c = 0 } ) . { c == 0 } { n } ; # Equivalent to f6 f7(n) : c = { c = 0 } ?({ bad(n) } { c = 1 }) . { c == 0 } { n } ; Reject items are not strictly necessary, but may let you avoid writing things like "{ 0 == 1 }" to trigger a failure. As the above examples demonstrate, they tend to make things messy and confusing and should be avoided if possible. They may occasionally be useful in cases where you need to act upon a failure (to backtrack, for example) but then propagate it. Here's one more example to demonstrate this: trynext : tmpx = { tmpx = x } # "next" can modify x . next | { x = tmpx } . ! ; If "next" fails, this will reset "x" back to its original value (before "next" was called), then fail. If "trynext" had instead declared "x" as a dynamically scoped variable, it would _always_ be reset even when "next" succeeds. Memory allocation ----------------- The easiest way to allocate some unused memory and obtain the address to it is when declaring function variables. Adding a number between square brackets ([, ]) after the variable name causes the given amount in bytes of memory to be reserved on the stack. The variable will be initialized with the address of that memory. This also works for dynamically scoped variables. Allocation sizes are rounded up if necessary to preserve stack alignment. # Initialize "a" with a 16-byte buffer and "b" with a 64-byte buffer f1 : a[16] : b[64] = ... ; Since memory is allocated on the stack, it is effectively freed when the function returns. There is a special operator, the dollar sign ($), that can return the memory size for variables declared this way. It will return the declared size, not the rounded up size: # Returns 42 f2 : buf[42] = { $buf } ; Beware that it does not track assignments, only declarations. Also, for function arguments and variables not initialized to allocated memory, the "size" returned will be zero: f3 : : buf[128] = { f4() + f5() } ; # Returns 128 if it was called by f3 f4 = { $buf } ; # This will return zero, not 128 f5 : var = { var = buf } { $var } ; There's one other use of the $ operator. When it appears by itself as a (sub-)expression, it evaluates to the number of characters in the last string constant before it (within the same function): # I know it's a bit late for this, but you won't mind, would you? main = { write(1, "Hello world\n", $) } ; Alloca ------ Memory allocation by variable declarations works only when the amount of memory to be allocated is a constant. # Error! f1(size) : var[size] = ... ; To allocate a variable amount of memory on the stack, an "alloca" item needs to be used instead: f2(size) : var = alloca:{size} -> var ( ... ) ; Keep in mind that "$var" would not return size here, but zero; it only works for declarations. There are no heap management functions like "malloc" or "free". You'll have to either roll your own, or... let the operating system do it for you. System calls ------------ Invoking kernel system calls is easy once you know the system call number and what arguments it expects: write(fd, buf, len) = { syscall(1, fd, buf, len) } ; malloc(n) = { syscall(9, 0, n+0xfff & ~0xfff, 3, 34, 0, -1) } ; Note: although this looks like an ordinary function call, "syscall" is a special form, not a function; you cannot use "syscall" as a variable and system calls are not modified or otherwise wrapped like libc does. Of course, one slight issue with doing "malloc" like this is that the amount of memory being allocated must be rounded up to the page size. Also... # munmap needs both the address and length free(addr, n) = { syscall(11, addr, n+0xfff & ~0xfff) == 0 } ; There are various ways to deal with this, of course. Be creative... Parsing input ------------- Asterix's main strength is actually the way in which it lets you parse arbitrary data. Statements aren't the only way of doing comparisons; an item consisting of just a character or string will actually attempt to match that character or string at the current position of an input buffer: expr(arg1, arg2) = '+' { arg1 + arg2 } | '-' { arg1 - arg2 } | '*' { arg1 * arg2 } | '/' { arg1 / arg2 } ; keywords = "if" | "while" | "for" ; nulls = *'\0' ; Further down below we'll explain how this works and where input comes from. Besides characters and strings, you can also match a range of characters and even binary data: digit1 = [0-9] ; elf = %b:0x7f "ELF" <"Found an ELF header\n"> ; Don't forget to use sign-extension where necessary: # %d:-1 would zero-extend the loaded doubleword and never match minus1 = %sd:-1 <"Found dword -1\n"> ; Note: character range unions (as seen in _regular expressions_ in some other languages) are not valid in Asterix. Use alternation instead: # Error -- this will not work # hexdigit = [0-9A-Fa-f] ; # This works just fine hexdigit = [0-9] | [A-F] | [a-f] ; Note 2: binary data is always read in native byte order. Use byte, character, or string matching to read data independent of endianness. Binary ranges also use square brackets: nonascii1 = %b:[0-31] | %b:[128-255] ; It is possible to use expressions in binary matches, but if they use operators, in most cases they must be put inside parentheses: digit2 = %b:['0'-'9'] ; digit3 = %b:[(48+0) - (48+9)] ; Binary matches can also be used to match variable data, and there's one other special form that lets you match variable strings: # Match a specific digit n, where 0 <= n <= 9 digitn(n) = %b:(n + '0') ; # Match a range of digits digitrange(m, n) = %b:[(m + '0')-(n + '0')] ; # Match a specific string constant strconst(str, len) = '"' %s:str,len '"' ; Matching operators can be modified to invert the match range using a caret (^), or to look ahead (preventing the input position from advancing even if the match was successful) with the ampersand (&), or both: nondigit = ^[0-9] ; nonascii2 = ^%b:[32-127] ; # Avoid matching left shift (<<) lessthan = &^"<<" . '<' ; # The opposite of [0-9] | [A-Z] | [a-z] | '_' endofname = &^[0-9] . &^[A-Z] . &^[a-z] . &^'_' ; # See also the "keywords" example above keyword = keywords endofname ; notminus1 = ^%sd:-1 ; Note: unlike the star and question mark, the caret and ampersand do not prefix arbitrary items; instead, they modify the behavior of matching operators directly. So they can only be used on match items. The endofname example acts like [^0-9A-Za-z_] would in regular expressions (which Asterix does not have as such). # Skip any amount of spaces, tabs, and/or newlines ws = *(' ' | '\t' | '\n') ; # Parse expr2 + expr2 + ... + expr2 expr1 = expr2 *(ws . '+' ws expr2) ; Note how the loop exits (successfully) when it no longer finds a plus sign after an expr2, but if it finds a plus without a valid expr2 following it, the loop (and therefore expr1) will hard-fail. Generating output ----------------- Some basic examples of output sequences with strings have already been shown. They can also write characters and binary data: outnewline = <'\n'> ; outdigit(n) = <%b:(n + '0')> ; Note: it is not necessary to distinguish between signed and unsigned values in binary output. Output and memory stores may truncate values, but they never sign-extend them. Actually, the reason for calling them "output sequences" is that they can contain multiple _output items_: outdigit2(n) = <"Here's a " %b:(n + '0') '\n'> ; # Same thing, but not necessary outdigit3(n) = <"Here's a "> <%b:(n + '0')> <'\n'> ; Bare, unqualified expressions are also allowed in an output sequence, although like binary range matches, they must be parenthesized if they use operators. something(x) = <"What is an " x "...?\n"> ; The meaning of this form is explained in the "Parsing and output details" section below. Finally, output sequences can copy arbitrary strings, which are specified as a pair of expressions (the string's address and its length) separated by a comma. Again, expressions with operators must be parenthesized. copyfrom(start, end) = ; Assigning and capturing items ----------------------------- Character range and binary range matches wouldn't be very useful without being able to see which character or value was actually matched. There are two ways to do this. First, character and binary matching items return the character or value they matched. So the result of the "digit" functions in the examples above can be assigned to a variable or even used directly in an output sequence: digit1 = [0-9] ; # "c" will be assigned the ASCII value of the digit showdigit1 : c = { c = digit1() } <"Found digit " %b:c '\n'> ; There's also a shorter form available for use with arbitrary items: # Same as showdigit1 showdigit2 : c = c=digit1 <"Found digit " %b:c '\n'> ; # There's no need for functions like digit1, really showdigit3 : c = c=[0-9] <"Found digit " %b:c '\n'> ; # In case you're sure you'll find a digit... showdigit4 = <"Found digit " %b:digit1() '\n'> ; Be careful with that last one. Output sequences are not "atomic", and the string "Found digit " will be written to the output buffer before digit1 is invoked to check for a digit. # Hard-fails if digit1 fails, unlike showdigit4 showdigit5 = <"Found digit "> <%b:digit()> <'\n'> ; This isn't exactly the same as showdigit4, either. Output sequences do count as single items, so if digit1 fails in showdigit4, the function will fail normally. However, in showdigit5, the digit1 function is invoked in a subsequent item, not the first. So if it doesn't find a digit, showdigit5 will hard-fail instead. Assigning items can assign the result of any other item, not just functions and match items. However, string matches do not actually return any meaningful value. Also, assigning from a group of items would only give you the result of the last one. In such cases, _capturing items_ may be more useful instead: # See also the "keyword" example above showkeyword : kw, len = kw,len=keyword <"Found keyword: " kw,len '\n'> ; Capturing items (or _captures_) mark the current input buffer position before and after an item is invoked, and assign the initial address and the match length (that is, the displacement of the input position after the captured item returns) to the chosen pair of variables. The example also demonstrates how this works together with string copies in output sequences. Note: captures are not atomic either; if the item being invoked fails, the first variable will get assigned, but the second will not. Parsing and output details -------------------------- Matching items and output sequences have been introduced and described as though their functionality is built directly into the language. But that's not truly the case. Although the syntactic forms are part of the language, they actually work by invoking functions with predetermined names, with the data to be matched (or written) passed as parameters to those functions. These functions must be part of the program, and are usually taken from the file "compile.ax" to provide the functionality described above, although it would be possible for a program to define its own versions of them instead. For example, the "matchchar" function is called to match a character or character range: alpha1 = [a-z] ; # Same thing alpha2 = { matchchar('a', 'z', 0) } ; (Note: the third argument to matchchar tells it which modifiers (caret and/or ampersand) are to be applied. See the "Syntactic sugar overview" below.) One such function, called "outval", is left for the program itself to define. This function gets called for unqualified expressions in output sequences (see "Generating output" above), and takes the value of that expression as its single argument: outval(x) = ... ; # Calls outval(x) display(x) = <"I know what an " x " is now!\n"> Again, beware the consequences of outval failing (if that's a possibility). Besides these functions, the input and output buffers that they act upon are also defined by variables with specific names. The most important two of these are "in" and "out", which are the locations where input will be matched and output will be written to, respectively. There is also "inmax", which should point to the end of the input buffer (after which no more input will be matched); "outmax", the end of the output buffer; and "instart" and "outstart", the starting addresses of these buffers. With the exception of "in" by capturing items, these variables are referenced by match functions and output functions, not directly. The functions refer to them globally (like dynamically scoped variables); they are not passed as parameters. Declaring them as lexical variables would usually just result in confusing behavior. The next section shows the complete list of function names and variables used by all the syntactic-sugary matching and output items and the functions implementing them. Syntactic sugar overview ------------------------ The following table shows the functions that are called by code that uses match items and output items: Code Equivalent item/statement Notes ----------------- ------------------------------ ----------- 'c' { matchchar('c', 'c', 0) } ^'c' { matchchar('c', 'c', 1) } &'c' { matchchar('c', 'c', 2) } &^'c' { matchchar('c', 'c', 3) } [0-9] { matchchar('0', '9', 0) } %b:[x-y] { matchb(x, y, 0) } %w:[x-y] { matchw(x, y, 0) } %d:[x-y] { matchd(x, y, 0) } %q:[x-y] { matchq(x, y, 0) } %sb:[x-y] { matchsb(x, y, 0) } %sw:[x-y] { matchsw(x, y, 0) } %sd:[x-y] { matchsd(x, y, 0) } %b inb [1] &%b peekb [1] "str" { matchstr("str", 3, 0) } [2] &^"<<" { matchstr("<<", 2, 3) } [3] <'c'> { outchar('c') } <"str"> { outstr("str", 3) } [2] <%b:c> { outb(c) } [1] [4] { outstr(str, len) } { outval(x) } func { func() } var1,var2=item { var1 = in } item { var2 = in } [5] [1] Similar for %sb, %w, %sw, etc. [2] The second parameter is the length of the string to match or write [3] Flags (third parameter) are the same as for character/binary matches [4] %sb etc. are accepted, but outsb, outsw etc. are not defined [5] Implicit failure conversion applies to item only if it would apply to the var1,var2=item sequence as a whole Variables referenced by these functions: Variable Purpose ---------- ----------------------------------------- in Input cursor (memory address) inmax Input limit (no match at or beyond this address) instart Start of input buffer [1] out Output cursor outmax Output limit (no output at/beyond this address) outstart Start of output buffer [1] [1] These are only referenced by some error handling functions, but are useful to have so you know where your own input/output starts Redefining functions -------------------- Since all functions are referenced through variables, it's possible to redefine them, effectively changing the behavior of functions (such as those implementing the matching and/or output items). For example, to change the way characters or character ranges are matched: mymatchchar(min, max, flags) = ... ; # Calls mymatchchar('x', 'x', 0) f : matchchar = { matchchar = mymatchchar } . 'x' ; Usually, it's a good idea to scope such redefinitions by declaring the redefined function as a lexical or dynamic variable. A lexically scoped redefinition (such as in the above example) affects calls in the redefining function only (by shadowing the global/dynamic function variable), whereas a dynamically scoped redefinition would affect both the redefining function _and_ all functions called by it. Common traps and pitfalls ------------------------- This section describes a few mistakes that are easily made when programming in this language, and how to avoid them where possible. * A function call with one or more arguments can only be made inside a statement; depending on how the code is structured and formatted, it can be very easy to forget the braces: { inlen = read(0, in, $in) } # ... ({ inlen >= 0 } | die(4, "Read error\n", $)) ^ --- Compile error! # Fixed version ({ inlen >= 0 } | { die(4, "Read error\n", $) }) When the mistake results in a syntax error (as in the above example), it should be easily found and fixed and will not be too much of a problem. But it can become much harder if the wrongly written code is syntactically still valid: list(element) = element *(',' element) ; argument : n0, n1 = n0,n1=name { defarg(n0, n1) } ; # ... arguments = '(' list(argument) ')' ; # Same thing! The compiler does not check function arguments. arguments = '(' list argument ')' ; # What was intended arguments = '(' { list(argument) } ')' ; * The function defined as the entry point (i.e. named by the "entry" line at the beginning) must never return. Use the exit(n) system call to terminate the program. In addition, it should catch any otherwise unhandled failures. Hence, "main" functions would usually end like this (6 is the Asterix compiler's exit code for unhandled errors; a more or less arbitrary choice): { exit(0) } || { die(6, "Unexpected error\n", $) } ; * Asterix's BNF-like control structure can be useful, but also difficult and confusing; a failing item causes the remainder of that sequence to automatically be skipped and the next alternative in the group to be tried; if there isn't any, the failure will propagate upwards out of the group, possibly out of the function or even multiple functions, until it is caught by an alternative where execution will then proceed. Because of this, failures that you didn't expect can be confusing -- but so can items that _succeed_ when they shouldn't. For example, if a function accidentally parses too much input, a failure may happen later on in a different function or even at the end of input (EOF): string = '"' *strchar '"' ; strchar : c = # Handle some escaped characters specially '\\' ( 't' <%b:9> | 'n' <%b:10> # etc... # Other escaped characters are just copied as-is | c=%b <%b:c> ) # Non-escaped characters are simply copied to the output # (In this version, unfortunately, that also includes the # string's terminating quote and all remaining input) | c=%b <%b:c> ; That string function, if it finds the initial quote, will consume all input until it finally hits EOF, after which it will hard-fail on its third item (the second quote that is supposed to mark the end of the string), leaving the input cursor at EOF as the only indication of what could have caused this hard failure.