.HSZ and .HSX files are lumped inside HSP files. There is no difference between .HSX and .HSZ files, except 32-bit versions of HSpeak use .HSZ to hide the scripts from 16-bit scripting versions of Game which do not check the script format version (and silently fail on a 32-bit script). When loading script id n, load whichever of n.HSX or n.HSZ exists.
- 1 Structure
- 2 Header
- 3 Command Data
- 3.1 Kind 1: Integer
- 3.2 Kind 2: Flow Structure
- 3.3 Kind 3: Global Variable
- 3.4 Kind 4: Local Variable
- 3.5 Kind 5: Math/Builtin Function
- 3.6 Kind 6: Builtin Function
- 3.7 Kind 7: Script
- 3.8 Kind 8: Nonlocal Variable
- 4 Variable IDs
- 5 String Table
- 6 Debug Information
- 7 See Also
HSZ files have a header followed by a serialised AST and optionally other sections which appear in the order below.
(NOTE: The table of local variable names is not yet checked into SVN; it only exists in TMC's hspeak4 branch)
|0||4 - 28||Header|
|header||Extends until string section or end of file||#Command Data|
|header||header, or until end of file||(optional) #String Table|
|header||Until end of file||(optional) Table of #Local variable names (This incomplete feature is not in nightlies)|
The header is of variable length, depending on the version of HSpeak used. If the header is too short, all fields after the first 2 have defaults (all old HSX files always have at least two fields). Optional features can be added by extending the header: if there are additional data fields not understood by the interpreter they are ignored. A format version number bump is used to indicate mandatory features. In practice we have never bumped the version number when increasing the number of globals or adding new math functions or flow constructs. Addition of new built-in functions is indicated in the HSP header, and it would be a good idea to add fields the HSP header for math and flow constructs too.
Script format meanings
|0||16-bit script. Script header must be at most 8 bytes. Hence has no string table.|
|1||32-bit script with 16-bit string table offset|
|2||32-bit script with 32-bit string table offset|
|3||Addition of nonlocal variables/subscript support. Parent ID, nest depth and number of non-local fields added to header|
Script format 2 or 3
|Offset (bytes)||Type||Meaning||Default if missing|
|0||INT||Offset in bytes to the command data, and also length of header in bytes||N/A|
|2||INT||Number of local variables the script has. Both regular local variables and script arguments count towards this, but non-locals do not. See #Variable IDs.||N/A|
|4||INT||Number of arguments a script takes||N/A|
|6||INT||Script format version. Current version is 3||N/A|
|8||LONG||Offset in bytes to the string table from the start of the file. Must be a multiple of 4 bytes after the end of the header. 0 if no string table is present||0|
|12||INT||Parent script id number, or 0 if none||0|
|14||INT||Nesting depth: scripts are 0, subscripts 1, sub-subscripts 2. Is at most 4||0|
|16||INT||Number of non-local variables (equal to parent's non-locals plus locals (including arguments))||0|
|18||LONG||Length of the string table in int32 words||String table, if any, extends until the end of the file|
|22||INT||Bitsets for optional features.
bit 0: nodes aside from kind 1 (integer) are appended with srcpos debugging information (This incomplete feature is not in nightlies)
|24||LONG||Offset in 32-bit ints from the end of the header to the local variable name table, or 0 if not present. (This incomplete feature is not in nightlies)||Variables names not available|
|28||LONG||Script Position, of the start of the script in the source file (specifically, the start of the "script", "plotscript" or "subscript" token). See #srcpos for how to translate an position to a character in a source file. Also, positions of all srcpos's in this HSZ file are offset to the script position. (This incomplete feature is not in nightlies)||N/A|
Script format 0 or 1
|Offset (bytes)||Type||Meaning||Default if missing|
|0||INT||same as above||N/A|
|2||INT||same as above||N/A|
|4||INT||same as above||"any amount" (bypass checks)|
|6||INT||same as above||0 (So the format version might not be specified)|
|8||INT||string table offset given as 16-bit INT instead||0|
The script is described as a serialised abstract syntax tree. Each node of the tree is a "command", and the command may have any number of arguments, each of which is a pointer to a child node. Here "command" refers to all objects in the language, including numbers, variables, functions, scripts, and flow control. Each command returns a value to its calling command, though some are garbage, and some are discarded by commands.
For the remainder of this article, we shall refer to WORDs of data. A WORD a signed, 16-bit INT for .HSX lumps, and a signed, 32-bit LONG for .HSZ lumps.
There are two types of nodes with slightly different structure. Each command is a series of WORDs and begins with:
- The command "kind": which can be 1 - 7 (see below)
- A command "id", the meaning of which varies on the command kind
This is the total content of command kinds 1, 3 and 4. Kinds 2,5,6,7 might have arguments, so have:
- Number of arguments
- One WORD per argument, which is an offset/pointer to that command in the command data. It is in WORDs from the beginning of the command data.
- If the relevant bit is set in the header, then a 32-bit LONG srcpos debug datum follows (nodes with arguments only - even when stepping through commands in the debugger, the debugger doesn't usually stop on leaf nodes)
Execution begins at the root (at offset 0). The root command is always a do loop containing the script's top-level commands. For each encountered command, all arguments to it are evaluated in order if there are any, and then all the return values are fed into the specified function which spits out a result, which is returned. This is not the case with flow control statements (kind 2), where evaluation of the arguments is selective. Arguments may not be all executed, so processing must be done between the execution of each argument.
A summary of the different command kinds:
|1||Integer||The value of the number||None|
|2||Flow control||ID of the flow construct||Yes|
|3||Global variable||ID of the global to return||None|
|4||Local variable||ID of the variable to return (0 to # of variables - 1)||None|
|5||Math function- a special group of functions which are basic math functions||ID of the function||Yes|
|6||"Builtin" function- a plotscripting function such as show textbox||ID of the command||Yes|
|7||Script- load a script and execute it for a return value||ID number of the script||Yes|
|8||Non-local variable||256*frame number + ID||None|
Several commands take a (reference to a) local or global variable as an argument. When variables appear as command kinds 2 and 3 they are NOT lvalues. So variables references are given as constants, specifying a variable like this:
- If value < 0, local with ID abs(value + 1)
- If value >= 0, global with ID value
Kind 1: Integer
No argument, the id is the value of the number (a 32 bit int), and is placed on the list of return values.
Kind 2: Flow Structure
The Type of flow control depends on the command id, and the different structures vary.
Ids 1 and 2, begin and end, never occur in a compiled script, but are not rejected by the interpreter.
Id 0: do
A set of commands to execute in order. Each argument is a command. These may appear anywhere in a script (if a scripter is cheeky), but are normally the arguments to while, for and switch The return value is 0.
Id 3: return
Sets the script's return value, does nothing else (doesn't actually stop the script).
Id 4: if
If always has these 3 arguments:
- a conditional expression
- a then
- an else
If a then or else were not specified in the script, then empty then/elses are created with no arguments.
Id 5: then
Exactly the same as do, but only called from if.
Id 6: else
Exactly the same as do, but only called from if. Likely to be empty.
Id 7: for
For has 5 arguments:
- ID of a variable to use as counter, see #Variable IDs
- counter start value
- counter end value
- counter step per loop
- a do block
The do block is called over and over until start > end if step is positive, or < if negative.
As soon as the start value is read, the variable is set to it before executing other arguments. For example:
for(var, start, var + 10, 1) do()
is equivalent to
for(var, start, start + 10, 1) do()
Id 10: while
While's 2 arguments are:
- a do block
The do is repeated until the conditional evaluates to false.
Id 11: break
Break will cause the script interpreter to abort do blocks, according to its one argument, the number of do blocks to exit.
The only blocks that "count" towards this limit are actual do blocks (ID 0), not then, else or other such blocks. It also breaks out of switch blocks (nothing special happens; the do and therefore the whole switch terminates). Likewise if it is not attached to a for or while (it's a "floating" do), then the do is terminates as you would expect.
Id 12: continue
Continue will restart the nth do block higher than the current command, after evaluating conditionals, etc. n is continue's parameter. It defaults to 1, and is best explained with an example:
script, foo, begin variable(i) for(i,1,10,1) do, begin if(i == 5) then (continue) #do something with i end end
In this script, i will have something done to it when it's 1, 2, 3, 4, 6, 7, 8, 9 and 10. 5 is skipped, due to the continue.
Note that only do blocks are restarted. Then, else and other such blocks don't count toward this.
If the do the continue refers to is an argument of a switch, then it gets special treatment, see below. Otherwise if it is not attached to a for or while (it's a "floating" do), then the do is restarted anyway, without checking any conditional!
Id 13: exitscript
Immediately stops the script. The current return value is returned to any calling script.
Id 14: exitreturning
As exitscript, but accepts a parameter to explicitly set the return value.
Id 15: switch
Note: the new switch syntax introduced in Beelzebufo did not change the binary format.
A switch flow command has a variable number of arguments. The first argument is the value to 'switch' on; it is an expression, which is evaluated once and the result stored.
The remaining arguments are either do commands or something else, in which case it is a 'case' expression. Normally these expressions are just integers, but it is possible for them to be something more complex. The final argument must be a do, and is the 'else' clause of the switch (even if there is none). Multiple expressions can occur a row, but multiple dos should not, with the exception of the final 'else'.
After evaluating the first argument (call the result the key), each following argument is inspected. If it is a do, it is ignored. Otherwise it is evaluated and compared to the key. If they are equal, then the next do block is evaluated, then the switch command finishes. If there are no matches, the final do is evaluated. If a continue is evaluated inside one of these dos, then execution skips to the next do block after that one.
For example, the following code
switch (true) do ( case (hero X (me) == 4, false) case (3) do (show value (3)) )
is compiled to
Flow ID switch Number ID 1 Math ID equal Function ID hero X Number ID 1 Number ID 4 Number ID 0 Number ID 3 Flow ID do Function ID show value Number ID 3 Flow ID do
Id 16: case
Never appears in a compiled script
Kind 3: Global Variable
The ID is the number of the global variable (0 to whatever it currently is)
Kind 4: Local Variable
The ID is the number of the local variable, from 0 to number of locals - 1 (see [[#Variable IDs]). This is actually a special case of Kind 8.
Kind 5: Math/Builtin Function
Much like the Flow Control kind, the Math Function Kind is indexed by ID.
With a few exceptions, operators work on 2 operands, which will be its two arguments. The Left Hand Side (LHS) will be the first argument, while the Right Hand Side (RHS) is the second. Some builtins use the LHS as an l-value.
In this list, math operations are listed using standard C-family operators/functions, as opposed to weird HamsterSpeak native syntax :-)
Id 0: random
- 0 - Returns a random integer between LHS and RHS, inclusive.
Ids 1 to 6: Basic arithmetic
- 1 - exponent (LHS ** RHS)
- 2 - modulus (LHS % RHS)
- 3 - division (LHS / RHS)
- 4 - multiplication (LHS * RHS)
- 5 - subtraction (LHS - RHS)
- 6 - addition (LHS + RHS)
Ids 7 to 9: Bitwise Operators
- 7 - XOR (LHS ^ RHS)
- 8 - OR (LHS | RHS)
- 9 - AND (LHS & RHS)
Ids 10 to 15: Logical Comparison
- 10 - equal (LHS == RHS)
- 11 - not equal (LHS != RHS)
- 12 - less than (LHS < RHS)
- 13 - greater than (LHS > RHS)
- 14 - less than or equal (LHS <= RHS)
- 15 - greater than or equal (LHS >= RHS)
Ids 16 to 18: Variables
Note: The LHS is an encoded variable number. See #Variable IDs
- 16 - set variable (LHS = RHS)
- 17 - increment variable (LHS += RHS)
- 18 - decrement variable (LHS -= RHS)
Ids 19 to 22: Logical Operators
- 19 - not (! LHS)
- 20 - logical and (LHS && RHS) -- NOTE: This operator short-circuits. I.e. if LHS = 0, then RHS is not evaluated, as the answer must be false.
- 21 - logical or (LHS || RHS) -- NOTE: This operator short-circuits. I.e. if LHS != 0, then RHS is not evaluated, as the answer must be true.
- 22 - logical xor (!LHS ^ !RHS)
Ids 23 to 25: More Math Functions
- 23 - absolute value (abs (LHS))
- 24 - sign ((LHS > 0) - (LHS < 0))
- 25 - squareroot, rounded to nearest integer ((int)(sqrt(LHS)+0.5))
Kind 6: Builtin Function
The ID is the number of the built in function to run. A full list of these functions, and their IDs, may be found in Plotscr.hsd.
The arguments to this command are the parameters to the function.
Kind 7: Script
Runs a script. The script is loaded and run. When it is done, it returns a value.
Kind 8: Nonlocal Variable
The lowest 8 bits of the value is the number of a variable (which is always in the range 0-99) within a lexical scope (frame), and the higher bits store the frame number shifted up 8 bits. Frame 0 is the current script, frame 1 the parent, frame 2 its parent. The frame number is at most the nesting depth, stored in the header. See #Variable IDs. Note that "parent" here refers to the script with enclosing lexical scope rather than the script which called this script. For example 514 = 2*256+1 is the second local in the grandparent script.
If the frame is 0 then the variable is encoded using kind 4 (local variable) instead of kind 8.
A script has access to local and non-local variables. The non-locals are all of the parent script's locals plus all of it's parent's locals, etc. Note that "parent" here refers to the script with enclosing lexical scope rather than the script which called this script. Within each lexical scope (frame) the variables are numbered from zero, first the script arguments in order, and then the non-argument locals in order of declaration.
The setvariable, etc., and for commands use an extended numbering scheme which supports globals too. An ID >= 0 is a global variable, while ID < 0 is a local/non-local, with ABS(ID + 1) being the local/nonlocal variable number as used in kind 4/8.
An example indicating bytecode results
global variable(10, global10) script, foo, arg1, begin variable(outer1) subscript, nested, arg2, begin variable(var2) func(arg1) # func(nonlocal(256)) func(outer1) # func(nonlocal(257)) func(arg2) # func(local(0)) func(var2) # func(local(1)) func(global10) # func(global(10)) outer1 += 1 # increment(-258,1) for(arg2,0,1)do() # for(-1,0,1,1,do()) var2 := 42 # setvariable(-2,42) global10 += 1 # increment(10,1) end end
This contains any string literals used by setstringfromtable, appendstringfromtable and tracevalueinternal function calls (IDs 251, 252, 466) within the script.
setstringfromtable and appendstringfromtable have 2 arguments: a string ID, and the offset of the string literal from the beginning of the string table in 4-byte words.
The string literals can be of any length, and are in a 4-byte length and 1-byte-per-character format. They are followed by up to 3 bytes of padding before the next string, or before the end of the table. The table must be a multiple of 4 bytes in size.
These are optional features, see the header. In fact this feature has not been checked into SVN yet.
A "srcpos" is a 32-bit word that encodes the position of the token in a source file to which a command corresponds. The line number of the token is retrieved by counting newlines in the source file (see HSP#SOURCE.LUMPED) upto the token point.
A srcpos is either relative or absolute. srcpos's inside a HSZ file have positions which are relative to the start of the script. Currently there are no absolute ones.
|0-7||Length of the token in characters, including whitespace (aside from trailing whitespace), capped at 255. Included to remove the need to parse the source to print the original token.|
|8||"Virtual" flag: this command was actually inserted by HSpeak, and does not exist in the source (eg. an empty else() added to an if). The srcpos indicates the parent construct (continuing the example, the if)|
|9-31||Position: The character number of the token in the file (counting from 1) + the "offset" of that file. The file and its offset is retrieved by scanning SRCFILES.TXT. The file's offset + 0 is reserved for referring to the whole file but not any position in it specifically.
If this srcpos is relative, then add the Script Position in the header to the Position decode from the srcpos.
Local variable names
If present, local variable (and argument) names are stored in a table of the same format (with alignment and padding requirements) as the string table. They are normally stored after the string table, but aren't needed at runtime; they are just for debugging. Skip over the 1st..(n-1)th names to reach the nth.