Talk:RELOAD: Difference between revisions
No edit summary |
No edit summary |
||
Line 493: | Line 493: | ||
The answer is that different integer sizes are a space-saving technique only. It's a bit messy that VLI encoding was not used instead (on the other hand, that would be slower) | The answer is that different integer sizes are a space-saving technique only. It's a bit messy that VLI encoding was not used instead (on the other hand, that would be slower) | ||
== | ==Comments about file format== | ||
I like this file format but there is two things wrong: | |||
* There should be separate string and blob types (this allows easier interaction with XML and SQL for example). So far it seems only strings are used and not blobs, so it shouldn't result a big problem. Programs that don't care the difference (such as OHRRPGCE) can treat them as equivalent when reading. | |||
* VLI numbers can be negative, although there is no use to make them negative since negative numbers aren't applicable to any place where they are used in this file. Therefore you could reuse the sign bit for other purposes if you need to I suppose? | |||
--[[User:Zzo38|Zzo38]] ([[User talk:Zzo38|talk]]) 16:34, 28 November 2014 (PST) |
Revision as of 17:34, 28 November 2014
Floating point format
Quick comment by Neo:
DOUBLE: Chances are you must specify the exact format SIGN:MANTISSA(N):EXPONENT(N) here (and check the CPU native format at compile time and possibly mangle native double-precision values when I/Oing). If FB does this for you, you are extremely lucky :)
Mike C.: I'm pretty sure that FB munges the format for me. It writes it out as an IEEE Double on my computer, anyway.
The Mad Cacti: I checked, FB copies straight from memory to file buffer. Is it actually a problem? How many CPUs actually use something other than IEEE? Is it endian sensitive?
NeoTA: A) not many B) Yes, IEEE doubles can be either little or big-endian (little-endian == intel of course; big-endian == Mac), matching the integer +endianness for that platform. http://mindprod.com/jgloss/endian.html has a good explanation.
Mike C.: Then, I could just declare that little-endian IEEE format is the standard format, since all platforms that the OHR officially runs on use that architecture. I, in all honestly, could give two craps about other platforms, since: A) The OHR doesn't run on them, and B) If I were to change to another format to accommodate them, then we have the exact same problem in reverse, since now I have to convert the format.
In short, when the OHR runs on a SPARC or a PowerPPC platform, then I'll care.
Porting
The Mad Cacti: Soon hopefully ;) But I've never written endian-portable code, and am not sure how much work the rest of the engine will be to port.
Bob the Hamster: The top two most likely target platforms for porting would be the iPhone or some member of the GP2X family. Both of those devices use ARM processors, which is a bi-endian architecture. The only big endian platform I can imagine us making a port for is old pre-intel Macintoshes. Anyway, I would hope that any freebasic port to these platforms would insulate us from having to worry about most of this junk.
The Mad Cacti: Bi-endian? Cool!
FB port? Oh, I was talking about us creating our own C++ port. The FB devs haven't made progress on their gcc emitter for 5 months and I am confident that we can beat them to it. Simon expressed interest in the idea too. You might have heard me mention that I've been working on classes to wrap FB's strings and arrays; nearly finished.
Unfortunately, the FB runtime library (written in C) might need endian-proofing too.
Bob the Hamster: This is very interesting to me! Can you tell me more? or maybe we can discuss it on the mailing list?
The Mad Cacti: I wrote a FBstring class and a family of various FB array class templates. They wrap functions from the FB rtlib, and also wrap the weird ways in which strings and arrays are accessed, created, destroyed, and all that. I had to write an enormous number of different versions of the FBarray class template to cover all the totally different types of arrays that FB has but you don't notice (an array of strings is different to any other array type, seems to be a holdover from before types were implemented).
I aimed to mimic fbc-generated code closely so that C++ functions can call and be called from FB.
Using these classes, FB *should* then be translatable to C++ on a line-by-line basis, except for the issue of scope of undeclared variables, and a few problems like optional arguments, overloaded functions like put, and module-level implicit functions. I'll discuss starting on the FB to C++ translator on the mailing list later.
In the course of doing this, I've discovered several annoyances, such as the fact that there is no fundamental reason that you can't have dynamic arrays in types (I suspect that it just hasn't been implemented yet), that FB's type system isn't safe and doesn't distinguish static and dynamic arrays, that -lang deprecated is identical to -lang fb behind the scenes afaik (and all UDT's are actually C++-like classes, not plain-old-data), and bugs, including one very disturbing one where fbc secretly generates rubbish assembly.
Bob the Hamster: Wow! That is some nifty stuff! I look forward to seeing it. Have you reported any of the bugs you have found to counting_pine or yetifoot?
Mike C.: Well, to be fair, in C++, all structs are identical to classes, other than the default visibility of members. But, I assume you mean something like this:
class SimpleUDT { private: int _a; FBString _b; public: SimpleUDT() { _a = 0; _b = ""; } int get_a() { return _a; } //... };
The Mad Cacti: I mean that FB's object orientation is C++ with BASIC syntax. UDT's have a default constructor, a copy constructor, a destructor, and an assignment operator method generated by the compiler (if needed for the current module), using full-blown C++ (g++) function name mangling, and everything else seems to be C++ binary compatible too. You should be able to share classes and objects across C++ and FB code, if the fact that FB initialises POD members as Mike mentioned but C++ doesn't is not a problem. If only -lang deprecated wouldn't have these dang artificial limitations.
Well, I filed one bug so far, rest still need proper investigation.
Mike C.: Er, I don't know how it does it. I've never looked beneath the hood to find out.
However, I do remember some bugfixes for C++ compatibility, so they're working towards sharing classes.
However, until they support polymorphism (any day now...), it won't work.
The Mad Cacti: Huh? How what does what? What's insane?
I can't remember, I think I only did some really minor tests with sharing classes.
Why not to load the whole file if it's not necessary
The Mad Cacti: Mike, you need to put the total size in bytes of each parent element back in the format. It's unnecessary only if you always load the whole file into memory. However, we shouldn't dictate that all utilities ever should work in that way.
Actually I'm uncomfortable with the idea of always loading the whole file into memory. It seems like we'd inevitably want functions to load only a certain branch. What if we end up using RELOAD for, say, a new textbox format? I assume all textboxes would be put into a single file (and with the text data in another lump for now). 3000 textboxes happens.
Damn, now I got carried away by over the top back of the envelope calculations!
Size of a node:
By sight (can't be bothered to check), a Node would be 13 ints. Add 8 bytes minimum for malloc overhead is 60 bytes The Node str member should be a null pointer if it's never initialised (string valued Nodes only). However, every Node will contain a pointer to a FBSTRING containing the name, 12 bytes plus 8 byte malloc overhead, which points to the string array, size of which is a multiple of 32 bytes, so take that: 32 + 8 + 12 + 8 bytes used by the name string. 120 bytes a Node.
Looking at SAY, I'd conservatively guess an average of about 7 elements per textbox: position, size, text colour, border colour, next box, and text would be nearly universal. Add a container Node for the textbox, 8 Nodes. I'd guess more than 96 bytes average per box (2 1/2 lines), so FB allocating 128 byes on average. This extra FBSTRING would be 12 + 8 + 128 + 8 = 156. 156 + 8 * 120 ~ 1100 bytes a box
3000 textboxes is thus conservatively around ~3.5MB, with ~24000 Nodes needing reading/creating. Kind of unimpressive, but I'm trying to argue that this would be slow to load.
Mike C.: Yes, you're right. I've re-added the size field. Eventually, I think I'm going to allow something like this:
path = "/saves/save[1]/party/gold" node = RPathSingle("gamename.sav", path) gold = GetInteger(node) FreeNode(node)
This would only load the "gold" node, and any children (although, I suspect it won't have any in this case). You could also do this:
path = "/heroes/hero/picture" nodelist = RPath("heroes.rld", path) 'do whatever with the list of pictures FreeNodeList(nodelist)
Which, again, loads only the relevant nodes.
It should be noted, though, that in libxml2 you must load a full document before you can use XPath on it. The reason is that you can still navigate around with the query results. You can't do this unless the whole document is loaded.
Importing and exporting from XML
S'orlok Reaves: Mike, I know there was some initial discussion on the mailing list about XML/YML/JSON, before the RELOAD format was chosen. Do you have any plans to provide an "export" feature to XML (any 3rd-party developer could then convert this to the format of his choice)? If so, how will you define the tag names (implicitly, probably) in RELOAD?
Mike C.: Well, as it happens, I've already got the "convert from XML" feature done (via the xml2reload.exe utility), and I plan to also add export to it as well. In fact, the code for doing so is mostly done. It just needs cleaning up.
I'm not sure what you mean about the tag names. Each node already has a name, and is used both for navigation, and for [im/ex]porting from/to XML.
S'orlok Reaves: Oh, I just wondered what the node names were. For example....
<root> <version>1</version> <children> <integer bits="16">118</integer> <double>25.7</double> </children> </root>
...is it something like this? What I mean is, if I were to write my own RELOAD to XML exporter, how would I know what to name each node? Is it listed somewhere? (Sorry, I missed the tail end of the RELOAD/XML debate because I was away.)
Mike C.: Well, sure, except it would look like this:
<myroot> <whatever>1</whatever> <somethingelse>25.6</somethingelse> </myroot>
The node names are what you give them. The type is wholly independent of it. I'm considering allowing type hints, but it would be pointless. Why? Well, because the importer will pick the right type to represent the number. It's not infaliable, but it does something like this:
- If the element is empty, type is null. Goto finish.
- If the element has kids, type is "children". Goto finish.
- Then, cast to integer. If the string version of the integer I just created matches the trimmed version of the source string, it matches, and type is integer. Go to finish.
- Else, cast to double. If (see condition above), then it matches. Type is float, goto finish.
- Otherwise, it's a string.
Now, here's where it gets complicated. When it runs across a text node (which is not all spaces, tabs and new lines), it creates an anonymous string node as the child of the "real" parent. I.e, it does this:
<textnode> <>Contents!</> </textnode>
I am aware that empty tags like that are illegal, but let me finish. This is necessary because a RELOAD node cannot have more than one kind of contents. So, if it turns out that it's all text, then later the anonymous node is optimized away.
However, let's say you have this document:
<textnode> Contents <b>are</b> fun! </textnode>
As I mentioned before, a node cannot have more than one type of content. So, here we have a string, an element (with a string inside), and then another string. Well, it gets parsed to this:
<textnode> <>Contents</> <b> <>are</> </b> <>fun!</> </textnode>
Which then gets optimized to:
<textnode> <>Contents</> <b>are</b> <>fun!</> </textnode>
I guess all this is a long way of saying that you can take any XML document, run it through the converter, and get an equivalent RELOAD document, with typing. This document can then be converted back, ideally with no lost information (except, maybe some whitespace).
NOTE ON ATTRIBUTES: RELOAD has no concept of attributes. I plan to support them by having them as ordinary nodes, prefixed with "@". Otherwise, they would be normal nodes. Currently, however, the converter just ignore attributes because of an oversight in my code.
NOTE ON NAMESPACES: RELOAD is namespace ignorant. You should not use namespaces in documents intended to become RELOAD documents. It won't hurt anything, but it won't do what you expect.
<foo:mydocument xmlns:foo="http://whatever"> <foo:bar /> </foo:mydocument>
Will turn into the equivalent of:
<mydocument> <@xmlns:foo>http://whatever</@xmlns:foo> <bar /> </mydocument>
So, I guess, caveat namespacor.
S'orlok Reaves: Whoah....
<textnode> <>Contents</> <b> <>are</> </b> <>fun!</> </textnode>
...why don't you just use a CDATA node? Like:
<mytext> <![CDATA[Contents <b>are</b> fun!]]> </mytext>
I like the way you dynamically determine everything else, so it'd be a shame to have to require "textnode" on all text data.
Also, by the way, "root" is always "root", not "myroot" or anything else. But this is a minor point.
Mike C.: What I was saying is that you're missing the point. When I export to XML, I'm not going to reproduce all the headers and stuff in XML, as all the headers and stuff are reproductions of XML! XML only has a string data type. I'm nice enough to type it for the purposes of efficient storage, but that's all. When I export it back to XML, it's exported all in text again, just as XML likes it. I'm not going to add arbitrary elements for typing, since those are not part of XML, nor are they part of the original data. 0xFF is "255". "foo" is "foo". etc.
As for the CDATA sections, why would I do that? The original markup was whatever, not <b>whatever</b>
The key point to remember is that RELOAD ~= XML, in binary format.
EDIT: "it'd be a shame to have to require "textnode" on all text data." Oh, I get what you misunderstand. Let me rewrite that last example:
<foo> Contents <bar>are</bar> fun! </foo>
Remember: XML attaches no special meaning to any tag at all.
S'orlok Reaves: The key point to remember is that RELOAD ~= XML, in binary format.
Oh, ok, I think I get it now. So game utilities can deal with either XML or RELOAD, and converting back and forth multiple times will add no ambiguity or subtle differences to the data, right? While the XML stylesheets that define the LUMPS of the OHR will be generated independently, and have nothing whatsoever to do with RELOAD?
Seems a clean separation to me. Go for it.
Mike C.: That is correct, short of the caveats I mentioned... uh, about halfway down this page.
strings that look like ints?
Bob the Hamster: What happens if you load a string which looks like an int? Like "23523" as a string of digits. In a dynamically typed language, who cares if it comes across as an integer, but in a statically typed language like FreeBasic, couldn't that be a problem?
Mike C.: Easy. GetString returns a string, no matter how it has to make it up. If it's an int, it'll return STR(node->num). If it's a float, it becomes STR(node->flo). If it's a null... It returns "". If it's children... well, I'm not sure how to handle this. Maybe return "<" & node->name & ">"?
Conversely, if it's a string, and GetInteger is called, it'll return CInt(node->str). If the string doesn't parse right, you get a 0, which is what would happen in a dynamically typed language.
Bob the Hamster: GetString() == cool. Also, GetInteger on something that doesn't parse right should probably be an error, not a zero. Only sloppy dynamically typed languages do that :)
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> n = int("foo") Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: 'foo'
Actually, since freebasic provides pretty weak exception handling, it would probably be best to return a 0 as you suggested, and just log it as a warning in g_debug.txt
Mike C.: Well, the rationale behind 0 is that the FB library does it. And, error handling does suck, yes. That said, the only type I'm not sure about is the null type. In RELOAD, a null type (that is, a node without any content) is useful only for flags/bitsets. So, maybe converting that to an int should be 1, instead of 0? I dunno.
Also, note that in python, "foo" is a valid number in base 25 and up ;)
The Mad Cacti: Actually, 0xFF would not be interpreted as an integer in the way that you propose to parse input. Well, I mean, it would stored as a string in RELOAD. I guess that might be a good thing: you can tell 0xFF and 255 apart.
Mike C.: First off, it's not a proposal, since I already do it.
Second, I was thinking about that. Well, not that specifically, but about whether I should bother to write a custom str2int routine. I decided "no". Why? FB already has one. These are all valid numbers, which SHOULD be recognized as such:
- 1234
- &h4D2
- &o2322
- &b10011010010
That said, zero in these bases doesn't work. FB does not have an "IsInt" function like .Net does, and since Val() returns 0 on an invalid string, detecting a real zero from an invalid zero is... tough. I have a very fragile setup right now, but it's hardly perfect.
On the other hand, this will also parse as a number:
"1234 is a rather unusual number."
I... have no idea how I would handle this situation.
if ValLng(node->str) <> 0 || Str(ValLng(node->str)) = "0" then
I don't understand what Str(ValLng(node->str)) = "0" is supposed to do. Maybe you mean node->str = "0"?
Here's my suggestion:
if ValLng(node->str) <> ValLng(node->str & "1") then
This doesn't optimise strings which are number followed by nonnumber, or floating point, or integers too large to store, but does catch numbers (including 0) in other bases. Also, if the appended "1" forms a number larger than 64-bits, vallng returns -1, so it works too!
NeoTA Give this man a regexp:) If the regexp (something like '^(&[bho])?[0-9a-fA-F-]+$') matches, it's an int. this doesn't handle cases like 'beeff001' (ie. it wrongly allows hex characters without a &h prefix.) Restructuring the regexp can fix that.
If regexps are not available to you, the simplest criterion is 'anything with a space in it cannot be a number'.
Bob the Hamster: The regexp I would use is '/^([0-9]+|&h[0-9A-F]+|&o[0-7]+|&b[01]+)$/i' which would be perfectly accurate for all cases of all three types of numbers. FreeBasic does come with headers for PCRE, but that would be yet another library dependence :(
Mike C.: I have nothing against Regular expressions. However, I don't really want to add them for something which should be as easy as this...
So, I'll combine the suggestions:
IsNumber = ValLng(node->str) <> ValLng(node->str & "1") AND instr(node->str, " ") = 0
Spaces at the beginning and end of the node are removed by a previous function, so I only have to worry about embedded spaces.
The Mad Cacti: Why would instr(node->str, " ") = 0 be necessary? vallng doesn't seem to skip over any spaces except those at the beginning. The reason that my suggestion would work perfectly afaik (even better than James' regex, which will cause strings of numbers too big to store to be translated to -1) is that appending "1" will only change the result if vallng has reached the end of the string with everything so far being a valid 64-bit number.
On the other hand, what about translating doubles?
RELOAD != ASN.1
NeoTA :
As far as I can see, RELOAD is a slight rehash of [ASN.1 DER], especially given the way you separate out the xml headers and built a tag table. IMO, a dependency on [libtasn1] would be a lot less work, and less bug prone, than reimplementing this particular wheel :) And of course would be more generally readable.
Mike, what do you say to this? Did you check out ASN.1 originally, is there a good reason not to use it?
Bob the Hamster: No important code uses reload format yet, so there is still time to make changes. A dependency on libtasn1 seems like more work while we are working in freebasic, but less work for after we have a working c++ translation (and less work for porters or authors of other tools)
NeoTA: Actually, after closer inspection, ASN.1 appears to be oriented towards predefined structures; it does have support for tagged data, but this introduces a certain amount of overhead. There may not be a direct equivalent in standards (only non-standards such as the various binary XML variations, some of which handle typing.)
Arrays
I also noticed that arrays / lists are not specified -- and currently not possible (read element type 7 spec -- it looks like it might suit, but the definition of an element thwarts it). Is this a problem (eg for TMCs dynamic type plans?)
The Mad Cacti: I had thought that arrays in RELOAD had been discussed, but it's not on this talk page, and there's not much in the IRC chat log Mike posted on the mailinglist either, aside from two snippets:
<MikeCaron> an array is an element with a bunch of children
<tmc> arrays of homogeneous data <tmc> like, an array of 10 ints <tmc> oh <tmc> store as a string? <tmc> yeah you already covered that <MikeCaron> not really pretty, but yeah <MikeCaron> the other way of expressing that would be to have a seprate "array" type for each data type <MikeCaron> so, you have Byte and Byte Array <MikeCaron> etc <MikeCaron> I'm not adverse to doing this, but I don't know if we have a need for it <tmc> you have inline strings already <MikeCaron> yes, but looking at it from the other size, encoding blobs of data like that is rather inefficient
(I'm not sure what Mike was talking about in that last line.)
I think it would be great to be able to serialize HS values as RELOAD trees, in save files and for specifying script arguments in Custom.
I assume you said "the definition of an element thwarts it" because each child needs a tag name. Yes, I think that some minor changes would have to be made to RELOAD to make it more suitable for arrays and associative arrays: make tag names optional. And if they're optional, we can do better than just setting them to "" or 0; add a bit to the element type that indicates whether the tag is present. Now you've got one byte of overhead per array element!
I think the best way to store an associate array is as an an array with an even number of elements, implicitly paired up.
<array> <>1</> <>2.0</> <>foo</> </array>
<dict> <>1</> <>foo</> <>bar</> <>42</> </dict>
However, this doesn't cover references to objects. How they are stored will probably depend on how they implemented.
Long tag names
Currently, a string in the string table is specified as SHORT BYTEBYTEBYTE.. Do we need a short for tag names (even assuming they can be written in any language using UTF-8), considering that typically tag names are likely to be less than 32 characters in size (in english, 5-15 characters seems to be the normal range)? If we aren't going to use this ability in implementation of associative arrays (I agree with TMC about the sensible way to implement them), I see no need to support tag names >255 bytes in length, so I suggest using a BYTE instead of SHORT.
Bob the Hamster: But what do we gain from limiting them? Saving number-of-unique-tags bytes? Doesn't seem that big a deal.
RPath
Teh Em Sea: Firstly, very pleased that you are working on this (but not that I planned on using RPath any time too soon), however, comments:
How do you check for a node named foo with value bar? This seems rather important! You only mentioned checking for children!
Also, modifying one of your examples:
node1[node2=2]/node2[foo=1]
how would you search for a node1 with a node2 with value 2 and child foo=1? The way its specified, the two matched node2's could be different.
Also, I assume this is allowed?:
node1[a=2][b=3]
Finally, have a look up the page at the arrays section for a suggestion.
Mike C.: Well, the key to remember is that a node that has a value cannot have children. A node with children cannot have a value. Keeping that in mind, many problems go away.
Another thing is that the query will select the last listed node. Think of it like a directory path. The file is the last one in the chain, right?
Also, eventually I was going to have this:
node1[a=2&&(foo="lalala"||bar=baz)]
Which would select node1s where
- a = 2, and
- either child foo = "lalala" or child bar = child baz
Regarding arrays, my stance hasn't changed. We don't need an explicit array mechanism, as we can just specify more than one node with the same name! Problem solved! (I'm also going to allow choosing, say, the third node of a given name, etc)
node1[a=2][b=3]
What would this do?
The Mad Cacti: I know container nodes don't have values (not personally the way I would design things, but I see that you won't be swayed), I meant, what if you are looking for a leaf node with a certain name and value and don't care what type its parent is? For example, scanning a RELOAD tree for all enemy=32 references.
Re: Arrays: no, actually, I wanted to draw attention to:
- ...make tag names optional. And if they're optional, we can do better than just setting them to "" or 0; add a bit to the element type that indicates whether the tag is present. Now you've got one byte of overhead per array element!
I think's that's better than script data RELOAD files being a billion 'data' tags, wasting space.
... OK, you would write that as
node1[a=2&&b=3]
Mike C.: You're missing the point. If you're searching for a node which has a specific value, you're wasting time, since you already have the value!
Note: obviously, if you wish to modify the value, this does not apply)
Now, I'm not opposed to allowing value nodes to have children (or, more accurately, nodes with children to have values). Just, that it makes it less like XML.
As for arrays, I'm sticking to my guns. RELOAD is designed for semantic data, not arrays. If you need to store a large array, you're either doing it wrong, or you're using the wrong format.
The Mad Cacti: What, I'm not missing the point, you're just thinking much too narrowly about uses.
Shock. Wasn't expecting that. But why did you change the format to that? You could have used a bit in the Type to specify whether a node is a container or not. That saves 8 bytes of extra overhead on all leaf node - the majority. 12 bytes to store a node with value '0' instead of 4. So I would guess that it would double the size of file containing mostly integers, as opposed to strings, and a low number of unique tag names. On-disk, a node with 0 children would have the same meaning as one without the children bit set all; and they would both be loaded in-memory identically. It's just a storage optimisation - I don't see how you could disagree!
Maybe RELOAD isn't the right format for storing script data afterall, although I mentioned giving all nodes children because it does make RELOAD more suitable. For example, weakly typed integers could be integer nodes with an 'annotation' child.
It would be a lost opportunity to have two different formats, but it looks like I would write my own RELOAD-loading and -saving functions anyway: it would be incredibly inefficient to translate script data <-> RELOAD tree <-> file. It would probably not be any more work to write them than to use the reload.bas functions anyway.
Mike C.: I took care of the efficiency problem by using variable-length integers. Unfortunately, I can't use them for the node size, as I have to fill it in after the fact...
Anyway, I don't know that RELOAD is a good choice for script data. It certainly wasn't what I was going for, anyway.
The Mad Cacti: There's an error in either your specification or your code:
|INT || Total size of content, not including this INT (If number of children > 0)
That VLI scheme is a good improvement as it reduces the typical extra overhead on leaf nodes from 8 to 4 bytes. You could use VLI for the node size too by using a buffer of 64 bytes or more, initially writing the node size assuming it'll take 4 bytes, and revising later by shifting the buffer if it's less. Assuming you are not keen on this short but complicated optimisation, it would still be good to be able to use VLI for the node size as well and lose an extra 3 bytes of overhead; so alternatively: for leaf nodes you can easily work out the size of the node before writing it, and just assume 4 bytes/27 bits needs to be used otherwise.
Or alternatively, you could use that bit, because it'd take about 4 lines of code to implement, and would outdo the improvements due to VLI. However, I now see that the way things are now, you can skip over a node without having to examine its type; and this bit would make you lose that.
I just remembered that storing multiple variables referencing the same object was going to be very difficult in RELOAD, so it looks like I should definitely change plans and bother you somewhat less.
Mike C.: Oops, that's a remnant from "all nodes have kids" v0.1. I'll fix it.
The proper way to solve the size issue is to have a buffer that each node writes in to. Then, you can get the size, and write the buffer. This is not very complicated, except that:
1. I have to worry about buffer-size management 2. Each "level" of the tree would require a buffer of its own
In Java, this is dead easy! Just whip out a class like MemoryStream, and go nuts...
I'm going to leave it alone, for now, maybe ruminate on it for a bit...
(Also, references would be fairly simple in RELOAD:
<whatever> <something id="foo">...</something> <somethingelse> <ref>foo</ref> </somethingelse> </whatever>
Or, whatever kind of scheme you would want.)
The Mad Cacti: If you're not against it, then I'd be happy to implement it. It's certainly more interesting than cleaning up old code.
(Hmm, that would work, but there would still be various complications while loading/saving. I guess I'm going to put off thinking about it until the interpreter internals are more concrete.)
Int size fixedness
NeoTA: Something that is *not* in this spec is whether it's OK for an 8bit integer to be read, and later output as a 16bit integer (for example). In short, is the bit size a hard limit or simply a convenience? If I write integers in the smallest possible type (eg values -128<v<128 as BYTE, -32768<v<32768 as SHORT, -(2**31)<v<(2**31) as INT, and larger values as BIGINT (what I call the 64bit int type), will that annoy OHRRPGCE code (eg, by causing it to try to store a 16bit value in an 32bit variable)?
Whatever the answer, I think the spec should be explicit about it.
(The reason I ask is that I have recently implemented a mostly-working RELOAD reader/writer, (http://gitorious.org/nohrio/nohrio/blobs/master/nohrio/reload.py) The present implementation autodetects the type that integers should be written with (see write_element()))
The Mad Cacti: Neat! I'll look it over some other time.
The answer is that different integer sizes are a space-saving technique only. It's a bit messy that VLI encoding was not used instead (on the other hand, that would be slower)
Comments about file format
I like this file format but there is two things wrong:
- There should be separate string and blob types (this allows easier interaction with XML and SQL for example). So far it seems only strings are used and not blobs, so it shouldn't result a big problem. Programs that don't care the difference (such as OHRRPGCE) can treat them as equivalent when reading.
- VLI numbers can be negative, although there is no use to make them negative since negative numbers aren't applicable to any place where they are used in this file. Therefore you could reuse the sign bit for other purposes if you need to I suppose?