Talk:Plan for gfx directx

From OHRRPGCE-Wiki
Jump to navigation Jump to search

The Mad Cacti 05:49, 16 May 2011 (PDT):

Some stuff I definitely would like to expose to plotscripting is the ability to scale (with non-integer multiples) and rotate sprites. Why stop there? More general (matrix) transformations are no extra work to code (here I mean non-hardware-accelerated) (though implementations already exist of course for fbgfx and SDL, so this point is moot). An extra step is to actually project the surfaces as quads in 3D space -- I think the only difference is that this means perspective correction is performed on the texture, though I don't know all that much about 3D rendering. But that requires asking for the transformation in a completely different way, so I doubt that would happen...

About the suggested "Hardware Accelerated Rendering" interface:

At the moment, the engine is 2D and simply wouldn't use most of these suggested features -- in particular, the concepts of world and view transforms. However, gfx_directx is somewhat independent, and there is no reason not to add features to it which the OHR won't use, as they might be used in future (eg. if we get an OpenGL backend and ditch the non-3D backends), by forks, and by people reusing just the backend. I've already once (or twice?) created a fork of the engine for a specific game and have considered doing it again (and have been asked to!).

I know I will have fun playing around with any 3D capabilities in the backend, and the possibility of optional eye candy where lack of support by other backends wouldn't matter. But on the other hand I don't want to make the interface for the other graphics backends convoluted, which faking a bunch of 3D stuff certainly would do.

So I think there are two alternatives: give all graphics backends the same interface (ie. drastic changes to what you've proposed), or use a two-layer interface: One interface which abstracts the differences between OpenGL and DirectX, and is provided only by certain backends, on top of which we write a wrapper to implement the 2D general graphics interface, which gfx_directx.dll won't itself provide. I've written such a wrapper before, so I already know how not to write one ;). Also, we could use C++ instead of FreeBasic there.

The more I think about it, the more I like the second option. In which case I agree in spirit with the interface you suggested.

What's the benefit of ID3DXSprite? The gfx_Sprite* functions in the interface look like unnecessary overhead if surfaces don't contain multiple images. With the exception of tilesets, individual frames/images won't be stored in sprite sheets internally by the engine. Couldn't we have calls to directly draw (enqueue) one surface onto another? Maybe gfx_Sprite* is still useful for drawing map layers.

Would creating lots of 20x20 surfaces result in lots of overhead? But I guess "lots" is very unlikely to be more than, say, 4000, which is 6MB. Also, is fragmentation of video memory through creating and destroying surfaces in random order a problem?

I doubt we would use z buffers, because we already sort everything (ie. slices) into the right order before rendering: the order of children in the slice tree must match the order in which they're drawn, so we can't stop doing that. (Though if I'd implemented slices, I might not have done it that way. Z-ordering of slices is still somewhat akward and seems like a tacked-on capability.) So a z-buffer seems totally useless to us right now, but if we're designing an abstraction around OpenDL and DirectX, it can stay in the interface.

I guess having lots of getters and setters is a better approach than our current gfx_getwindowstate, because it's simpler to add more functions than to expand a struct.

I notice the window isn't specified until the gfx_RenderPresent call, which I found surprising. But on other backends it might be necessary to specific which window you're drawing to before you draw anything, so despite it being potentially useful to be able to cheaply duplicate to multiple windows, we might have to change that.

I think I'll leave feedback on the window interface and other details to a separate post, this is getting pretty big.

What exactly would gfx_RenderGet/SetTarget do? Would these copy the render buffer to another surface, or just return pointers? How expensive actually is it to render the scene and then copy the rendered image into main memory?


Bob the Hamster 09:16, 16 May 2011 (PDT): Sorry to go off-topic here because this isn't really backend related, but since you mentioned sprite rotation, I was wondering if you have ever seen what RotSprite can do?

The Mad Cacti 14:51, 16 May 2011 (PDT):I haven't, I was thinking of just plain bad looking nearest-neighbour rotation. RotSprite is closed source, but there's information about a reimplemention here, and there will be alternatives. Also it's not intended for real time use ("80 times more memory as normal rotation" sounds pretty suspicious!), however we could probably get away with using it. Also, we could add a rotation tool to the sprite editor! Too bad 'R' is taken :P

Hieran Del 10:06, 16 May 2011 (PDT):

"An extra step is to actually project the surfaces as quads in 3D space -- I think the only difference is that this means perspective correction is performed on the texture, though I don't know all that much about 3D rendering. But that requires asking for the transformation in a completely different way, so I doubt that would happen..."

I don't think I understand this. Are you referring to gfx_RenderSetTransform()?

Matrix transformations are not hardware accelerated by directx (unless performed in a shader, which is not the intent). It could potentially be done in FB, and I'd be fine with that except OpenGL and DirectX order the matrices differently: row-major vs. column-major. If we could guarantee the backend would receive it in one of the forms (ie, we make all matrices in the row-major format (which I think is OpenGL? I get those mixed up)) the backend could either use the matrix directly or transpose the matrix before using it. That would remove a huge chunk of interfaces...

"At the moment, the engine is 2D and simply wouldn't use most of these suggested features"

True, but wouldn't it be nice if the engine could support the situation where you pass under a bridge while an npc is passing over it? World and view matrices simplify that task.

"But on the other hand I don't want to make the interface for the other graphics backends convoluted, which faking a bunch of 3D stuff certainly would do."

No point in trying to fake it. I plan on working on an OpenGL version (the gfx_sdl++) after I prove the concept with the gfx_directx version.

"So I think there are two alternatives: give all graphics backends the same interface (ie. drastic changes to what you've proposed), or use a two-layer interface: One interface which abstracts the differences between OpenGL and DirectX, and is provided only by certain backends, on top of which we write a wrapper to implement the 2D general graphics interface, which gfx_directx.dll won't itself provide. I've written such a wrapper before, so I already know how not to write one ;). Also, we could use C++ instead of FreeBasic there."

I like the first option, giving all graphics backends the same interface (as an augmentation of interfaces v1.0). It won't require drastic changes to the interfaces I've proposed, as long as we treat everything as "handles", not pointers to actual data (except for initialization data, or updating a surface with some engine data, etc.). The exception to this might be matrices, in case we decide to have FB do the computations. No reason to abstract OpenGL and DirectX, the interfaces will do that well enough.

"What's the benefit of ID3DXSprite? The gfx_Sprite* functions in the interface look like unnecessary overhead if surfaces don't contain multiple images. With the exception of tilesets, individual frames/images won't be stored in sprite sheets internally by the engine. Couldn't we have calls to directly draw (enqueue) one surface onto another? Maybe gfx_Sprite* is still useful for drawing map layers."

You're right, without multiple images to the surface, gfx_Sprite* functions are unnecessary, and so gfx_Surface* should be used. In the short term, I anticipate gfx_Surface* being used by the engine, and gfx_Sprite* by the plotscripting for accessing sprite sheets that were created outside the engine's editors. Eventually, the engine might want to think in terms of sprite sheets instead from an efficiency standpoint.

"Would creating lots of 20x20 surfaces result in lots of overhead? But I guess "lots" is very unlikely to be more than, say, 4000, which is 6MB. Also, is fragmentation of video memory through creating and destroying surfaces in random order a problem?"

Yes. Lots. :P If it's possible to combine many of these images into common surfaces (like sprite sheets for walkabouts, attack animations, etc.), obviously there's performance gain when the texture stages don't have to switch very frequently.

I have not run into fragmentation problems in video memory. Then again, I've never created 4000 20x20 surfaces and deleted/created them in random order. I think it's just wiser to "pool" surfaces into sprite sheets. Even if the engine thinks of them separately, as long as it has a sprite sheet "handle" that this frame belongs to the backend can manage the pool.

"I doubt we would use z buffers, because we already sort everything (ie. slices) into the right order before rendering: the order of children in the slice tree must match the order in which they're drawn, so we can't stop doing that. (Though if I'd implemented slices, I might not have done it that way. Z-ordering of slices is still somewhat akward and seems like a tacked-on capability.) So a z-buffer seems totally useless to us right now, but if we're designing an abstraction around OpenDL and DirectX, it can stay in the interface."

In 3d rendering, it's desirable to sort through all your assets for rendering order, and then render them from front to back (transparent objects obviously in back to front). The depth buffer's purpose is to account for when assets are not parallel with each other and orthogonal with the look-at vector. So for the engine's current purposes, it is useless. Everything is a surface that is drawn parallel with each other and orthogonal to the look-at vector. Is there any instance where this might change?

"I notice the window isn't specified until the gfx_RenderPresent call, which I found surprising. But on other backends it might be necessary to specific which window you're drawing to before you draw anything, so despite it being potentially useful to be able to cheaply duplicate to multiple windows, we might have to change that."

Well, only swap chains can present, because they are collections of surfaces that rotate for presentation. Swap chains are also bound to windows, so it seems appropriate to treat a window as a swap chain whenever rendering is concerned. Swap chains can be used for only a few other gfx_Surface* functions (since swap chains are specialized surfaces) such as Fill, Stretch, GetCopy, SaveToFile, and RenderSetTarget, RenderGetTarget.

"What exactly would gfx_RenderGet/SetTarget do? Would these copy the render buffer to another surface, or just return pointers? How expensive actually is it to render the scene and then copy the rendered image into main memory?"

SetTarget will set the rendering target for a series of draw calls. A swap chain (window surface) can be the target, and so can a render target surface (dynamic surface) which is an intermediate target that will be used as input in another series of draw calls.

GetTarget returns a pointer (or handle) to the current render target. Using gfx_SurfaceGetCopy(), the contents can be copied out of it. This handle can be either to a render target surface or a swap chain backbuffer.

The Mad Cacti 16:35, 16 May 2011 (PDT):

I don't think I understand this. Are you referring to gfx_RenderSetTransform()?

No, I was talking about the fact that a rectangle rendered in 3D space looks like a quadrilateral. I made a mistake, I thought that a quadrilateral could be described by a 2x2 transformation matrix, but no, that's a parallelogram. It's not too expensive to draw textured parallelograms in software (I've written C code which managed about 1 pixel/6 CPU cycles). A rectangle rendered in 3D space requires completely different data: vertex coordinates and view transform matrix. Anyway, dumb aside.

I don't want to reimplement matrix transformations in FB. Sounds like a waste of time, plus FB is slow.

True, but wouldn't it be nice if the engine could support the situation where you pass under a bridge while an npc is passing over it? World and view matrices simplify that task.

I don't really see the argument, since first we would have to give everything in-game (including layers and "map objects") z values. And once we do that, we can just sort the slice tree and get layering without backend involvement. However, sorting the slice tree is going to be more expensive than just sorting the onscreen sprites... (Also, we'd want multiple wallmap layers. Anyway, off-topic.)

No point in trying to fake it. I plan on working on an OpenGL version (the gfx_sdl++) after I prove the concept with the gfx_directx version.

Awesome! And of course SDL 2/1.3 looks very close to official release, after a decade. But my point was that I don't want to drop any of the existing backends (except allegro). Backends always seem to have unsolvable bugs, and I wouldn't assume everyone has a working OpenGL or DirectX implementation (not uncommon on Linux with broken graphics drivers). There's also a lot of effort invested in fixing platform-specific bugs in gfx_sdl. And to keep gfx_sdl, gfx_fb around they will need to support the new interface, aside from any optional 3D functions it might contain. It really shouldn't be much too work to update them. So this is why I proposed wrapping the 3D interface to produce a lowest-common-denominator 2D one, which would look like:

gfx_RenderPresent( pWindow )
gfx_RenderSetTarget( pSurface )
gfx_RenderGetTarget( pSurfaceOut )

gfx_SurfaceCreate( width, height, usage, format, pInitialData, pSurfaceOut )
gfx_SurfaceDestroy( pSurfaceIn )
gfx_SurfaceUpdate( pInputData, pSurfaceDest )
gfx_SurfaceFill( color, pSurfaceDest )
gfx_SurfaceDraw( pRectSrc, pSurfaceSrc, pRectDest, pSurfaceDest, alpha, blendType )
// Where pMatrix is a 2x2 matrix, allowing rotation and scaling
gfx_SurfaceDrawTransformed( pRectSrc, pSurfaceSrc, pRectDest, pSurfaceDest, alpha, blendType, pMatrix )
gfx_SurfaceGetCopy( pSurface, pOutputData )

Funny, now that I sketch it out I realise that the only difference is the restriction to 2D transformations. This would be a very small wrapper.

You're right, without multiple images to the surface, gfx_Sprite* functions are unnecessary, and so gfx_Surface* should be used.

You didn't describe a function for drawing one surface onto another, aside from gfx_SurfaceStretch. Also, transparency isn't included in the sketch interface.

In the short term, I anticipate gfx_Surface* being used by the engine, and gfx_Sprite* by the plotscripting for accessing sprite sheets that were created outside the engine's editors. Eventually, the engine might want to think in terms of sprite sheets instead from an efficiency standpoint.

I don't see the point of supporting a "foreign" graphics interface in addition to the native slices/sprite sets one, with the sprite editor for organising graphics, which will be flexible enough to import arbitrary sprite sheets.

Yes. Lots. :P If it's possible to combine many of these images into common surfaces (like sprite sheets for walkabouts, attack animations, etc.), obviously there's performance gain when the texture stages don't have to switch very frequently.
I have not run into fragmentation problems in video memory. Then again, I've never created 4000 20x20 surfaces and deleted/created them in random order. I think it's just wiser to "pool" surfaces into sprite sheets. Even if the engine thinks of them separately, as long as it has a sprite sheet "handle" that this frame belongs to the backend can manage the pool.

We could glue all the frames in each sprite set into a single sprite sheet. (Note that these might be all different sizes.) Well, someone could, and this is pretty reasonable to do in the engine; I probably want to do that anyway because the current sprite set implementation is a mess. If textures must be square, then we better not glue them into long thin strips. For example, 500 160x20 32bpp surfaces = 125MB instead of 6MB if each is expanded to a square texture! If 500 textures (this is the number of currently allowed NPC definitions per map) is still too many, then I would prefer the backend handle further gluing into larger textures, because that's a pretty low-level detail.

(transparent objects obviously in back to front)

Oh right, I forgot transparency required that.

Everything is a surface that is drawn parallel with each other and orthogonal to the look-at vector. Is there any instance where this might change?

Nope.

SetTarget will set the rendering target for a series of draw calls. A swap chain (window surface) can be the target, and so can a render target surface (dynamic surface) which is an intermediate target that will be used as input in another series of draw calls.

Ah, so this means that missing from the interface is a function to get a window surface for a window which you can then pass to gfx_RenderSetTarget? I assume you meant a swap chain can be treated as a single surface: the next one to be drawn on. As I understand it, drawing would work like this:

gfx_RenderSetTarget(gfx_WindowGetSurface(targetWindow));
gfx_RenderBegin();
... gfx_SurfaceDraw calls...
gfx_RenderEnd();
gfx_RenderPresent(targetWindow);

Also, what's gfx_RenderSetTransform? Set the view transformation?

Also, is a "system" surface one that's in system memory?

Hieran Del8 06:44, 17 May 2011 (PDT):

first we would have to give everything in-game (including layers and "map objects") z values. And once we do that, we can just sort the slice tree and get layering without backend involvement. However, sorting the slice tree is going to be more expensive than just sorting the onscreen sprites.

Good point. I was thinking without the source data, which was dumb. :S As regards the slice tree sorting with z values, I don't know how intensive that sorting might be--but I can't imagine that'd be the bottle-neck.

the only difference is the restriction to 2D transformations.

You've convinced me: there's really no point to 3d transformations for this engine. It might look cool for a surface to come tumbling towards the screen, but it's a gimmick, and the point of the engine is 2d. I was thinking about some games like Chrono Trigger and Terranigma that used some 3d effects, but I keep forgetting the engine has already been well defined. So I concur: there shouldn't be 3d effects.

We could glue all the frames in each sprite set into a single sprite sheet.

I'm in line with that.

so this means that missing from the interface is a function to get a window surface for a window which you can then pass to gfx_RenderSetTarget? I assume you meant a swap chain can be treated as a single surface

Sure. I actually meant the window can be treated as a surface, because the swap chain is bound to it. Since everything is a handle, passing a window handle could be translated to a surface handle in those functions, and the appropriate resource be retrieved from the handle.

what's gfx_RenderSetTransform? Set the view transformation?

It was to set either world, view, or projection transforms, specified by the passed in 'type'.

is a "system" surface one that's in system memory?

Yes. There's not a point to it, as far as I can tell. I think I'll remove it from the list of usages.

There's still use for a separate world, view, and projection transform. The world transform (even with the restriction to 2d) will scale, rotate, and translate the surface into the world. The view transform will position the camera so the camera can do intersection tests to determine what gets rendered, and then sort it. The projection transform will define the dimensions of the visible area (width and height in pixels).

It'd be easier if all those things were calculated in the engine (which they are right now, just without matrices)--would guarantee a consistent activity across the backends. I don't know how the culling tests are working in the engine, but perhaps there's some efficiencies that can be discussed, such as frustum-aabb intersection tests (since everything is a quad, this is pretty easy). Perhaps the code can be written in c/++ and linked. Then the backend would be dealing with only transformed quads, which I would probably lean towards for consistency. That way, the previous paragraph would be a moot point, as would all matrix discussions.

Actually, the more I think about it, the more I like the idea of only receiving transformed draw instructions. It'd be something like:

gfx_SurfaceDraw( pSurfaceSrc, pCornersVec2, argbModifier, pPalette, pRect )

where pCornersVec2 is an array of 4 2d points describing the transformed corners of the object, argbModifier is some array of 4 floats that scales the color components, pPalette is whatever palette being referenced, and pRect is the rectangle of the image being sampled. If pRect is specified, the upper-left corner of pRect will be considered the origin. This same method would apply for sprite sheets as well.

A note about transformed vertices: we need to define what the ranges are. The device coordinates of the client area are between -1 and 1. The upper left corner of the client is (-1, 1), and the bottom right corner of the client is (1, -1). When specifying what part of the client area to draw, we can either transform the coordinates to that frame of reference, or transform it to some arbitrary range that the backend transforms. For forwards-planning, I think it's better if it receives the said frame of reference.

So right now, I'm leaning towards:

  • Engine performs all tree-sorting
  • Engine queues objects for rendering
  • Engine transforms all objects into device coordinates
  • Backend renders transformed objects
  • Backend does not perform depth-tests

I'm going to start modifying the plans with this in mind.

Hieran Del8 09:33, 18 May 2011 (PDT): After planning out the internal transform functions, I've realized the corners of the objects must be 3d, but not xyz: xyw. The w component identifies whether the vector is a position vector or normal. Also realized the pRect parameter of gfx_SurfaceDraw(), which must be specified whenever the entire surface is not being used, will only be useful in specifying the texture coordinates. So the comments about "the upper-left corner of pRect will be considered the origin" is bogus.

The Mad Cacti 10:16, 21 May 2011 (PDT):

I was thinking about some games like Chrono Trigger and Terranigma that used some 3d effects, but I keep forgetting the engine has already been well defined. So I concur: there shouldn't be 3d effects

I like to think that we're expanding the engine's scope little by little though. I didn't really mean that there shouldn't 3D effects at all (and it's probably inevitable that I'll attempt Wolfenstien 3D graphics once scripting can draw to surfaces :) but they should be optional for the sake of graphics backends. For example, the 3D perspective option in Diablo 2. And that almost pointless and unnoticable 3D affect on FF6's overworlds. Oh now I remember CT's 3D effects: rotating primitive solids. Also a pretty pointless use of 3D. OK I guess this is a bad argument: something minor and optional is not worth that much effort. Diablo 2-style perspective is the only good example I can think of.

It was to set either world, view, or projection transforms, specified by the passed in 'type'.

Heh, didn't notice that argument.


Wait, hold on. Receiving an array of four arbitrary vertex positions has problems. How are non-3D graphics backends meant to handle that? They would have to do a tricky test to see if this draw is just a plain untransformated blit in which case they can be fast, or if it's described by a 2D transformation (rotate, scale, skew) which can be special cased to be reasonably efficient, or whether it's a general equilateral, in which case they would have to fall back to some slow and complicated software polygon renderer probably pulled out of TinyGL or something like that, or perhaps just fail (if we never get around to using this possibility). I think these different cases should be split into different backend functions because it makes life much easier. The third case is definitely not going to be commonly used.

So right now, I'm leaning towards:
  • Engine performs all tree-sorting
  • Engine queues objects for rendering
  • Engine transforms all objects into device coordinates
  • Backend renders transformed objects
  • Backend does not perform depth-tests

I totally agree, except maybe about the device coordinates: those depend on the graphics library (which is what you're really targetting), so I'm not sold on -1.0,-1.0 to 1.0,1.0. For other than DirectX and OpenGL, 0.0,0.0 top left to window_width_pixels - 1.0,window_height_pixels - 1.0 would be more natural (still floating point for smooth transformations).

Speaking of windows, we need a function for querying whether the backend actually supports creating multiple windows, either due to backend limitations (SDL 1.2) or due to running fullscreen (or on a handheld). Also, user-resizeable (by dragging on the window) windows would be very nice (especially for something like a script debugger window displaying a big table). Of course this clashes with gfx_directx's zoom feature, but I think it would be really useful. The main window would probably only be marked resizeable temporarily (eg. just in the map editor), so maybe zooming could just be temporarily totally disabled, or some override key added...

Hieran Del8 00:25, 22 May 2011 (PDT):

didn't really mean that there shouldn't 3D effects at all

Oh, I see. Then there'd still be use of the backend performing the vertex transforms, though I like the idea of the engine being able to do all the transforms itself. I'm a little timid about exposing too much 3d capability to the engine, but we can create functions that create and fill vertex buffers to render with hardware transform and texturing.

Receiving an array of four arbitrary vertex positions has problems. How are non-3D graphics backends meant to handle that?

Hmm. I was reading through Windows GDI (for Windows computers that can't support directx9 or opengl) and I found PlgBlt(), which blits a bitmap image onto an arbitrarily transformed parallelogram. It has issue with the homogenous, normalized device coordinates (which I address later in this talk page). I'm sure there are other api's that expose features like those, and if not, what system are you thinking of in particular that doesn't support a form of opengl? I don't see the point for those per-pixel transforms to happen in the engine.

0.0,0.0 top left to window_width_pixels - 1.0,window_height_pixels - 1.0 would be more natural (still floating point for smooth transformations).

I agree. I almost want to say from (0.0, 0.0) to (1.0, 1.0), but as long as the engine is kept up to date on the window size (through callbacks, etc.), your suggestion is good.

Of course this clashes with gfx_directx's zoom feature, but I think it would be really useful.

I'm afraid to say it'd clash with the mouse input, too (which was an absolute beast to manage!). I think it's actually pretty easy to adjust now, but we definitely need callbacks for sizing, fullscreen, etc, as you stated. And actually, we could then get the engine to maintain aspect ratio for the frame presentation at that point (for backcompat)! :D

The Mad Cacti 05:16, 24 May 2011 (PDT):

I'm a little timid about exposing too much 3d capability to the engine, but we can create functions that create and fill vertex buffers to render with hardware transform and texturing.

Me too. Vertex buffers? I'm not really familiar with them. Would this mean that the GPU does the vertex transforming (which I don't think would help, since we want to calculate and store the screen space coordinates of all slices), or is this just a method to get vertex data to the GPU more efficiently that drawing individual quads -- and couldn't the backend do that itself?

I found PlgBlt(), which blits a bitmap image onto an arbitrarily transformed parallelogram.

Well as you said, it draws parallelograms. I was complaining about general equilaterals, which are much harder to draw.

As for platforms not supporting openGL, the GP32 is one which I wanted to target, though it's getting increasingly out of date, and the CPU may be too slow anyway (Edit: well, the 8MB of RAM is a bigger concern). Fringe OSes (eg. Haiku), or even Linux without the correct drivers (one of my computers until recently) may not support OpenGL. Also, the graphics backend abstracts a graphics library, in all probability not raw hardware, and there are many 2D graphics libraries we could currently potentially run on top of.

But the best argument for not ditching support for 2D-only graphics libraries is that we shouldn't suddenly replace old graphics backends. There will be problems, so we should try to keep old, currently working backends around for at least a release cycle. Interface functions which assume 3D hardware acceleration (by which I am mostly referring to the equilaterals) can be added later (or just not used).

It has issue with the homogenous, normalized device coordinates (which I address later in this talk page)

Err... did you?

I'm afraid to say it'd clash with the mouse input

Ouch. Yes, it would be good to move much of the window management into the engine, but preferably none of these nightmares ;)

Hieran Del8 10:36, 24 May 2011 (PDT):

Vertex buffers? I'm not really familiar with them. Would this mean that the GPU does the vertex transforming

It can do the transforms, but the point of them here is to send a batch list of primitives to be drawn in certain state. As example, a 3d model of Bob the Hamster. Another example, 100 identical sprites (at least, from the same sprite sheet) being rendered on screen. Regardless of whether they're exposed to the engine or not, the backend will use them for sprite rendering.

I was complaining about general equilaterals, which are much harder to draw.

You mean like rhombus? Isn't that a specialized type of parallelogram? Or do you mean n-sided equilaterals? None of the functions I'm proposing even consider a non-quad type. Is that a goal, though?

there are many 2D graphics libraries we could currently potentially run on top of.

I'm going to go out on a limb here, suggesting that all useful graphics libraries can render pre-transformed coordinates of quads (or at least triangles). So I don't think the engine should perform per-pixel transformations (except, of course, in the editors).

Err... did you?

I did, but I didn't say too much. :P It's my agreement with you on throwing out the device coordinates in favor of some measurement from (0,0) to (clientSize.x-1, clientSize.y-1). That just makes more logical sense as you suggested other graphics libraries will be used, and don't necessarily measure in device coordinates.

The mouse input won't be hard to adjust: I can just report actual client position of the cursor (0,0)-(clientSize.x-1, clientSize.y-1).

The Mad Cacti 01:44, 26 May 2011 (PDT): Oh, I see. It sounds like vertex buffers would be an obvious optimisation for displaying map layers, since the grid never changes.

You mean like rhombus? Isn't that a specialized type of parallelogram?

(Opps, I meant to say quadrilateral, not equilateral.) Parallelograms are more specific: 3 vertices is exactly the amount of information needed to specify them (or one point and a 2x2 transform matrix), as opposed to 4 for quadrilaterals.

I'm going to go out on a limb here, suggesting that all useful graphics libraries can render pre-transformed coordinates of quads (or at least triangles)

For example SDL (not counting actually using OpenGL) and fbgfx don't support any kind of transforming. Turns out allegro apparently does support blitting with 2D transforms (it's pretty advanced and can also be used with OpenGL or DirectX, while remaining very DOS centric.) And it depends about which transforms we're talking about: rotation and scaling support is pretty ubiquitous (but in the case of SDL and fbgfx, only as 3rd party addons (sofware implementations)).

So I don't think the engine should perform per-pixel transformations

More confusion: I was talking about doing software transforms in the backends, I'm not sure what you mean by "engine".

I did, but I didn't say too much.

Heh, OK.

Hieran Del8 06:59, 26 May 2011 (PDT):

I meant to say quadrilateral, not equilateral.

Ok, you're right, GDI is not flexible in that respect, and SDL is not helpful at all there. Why isn't that at least implemented in SDL? Oh well... we'll have to implement our own.

For example SDL (not counting actually using OpenGL) and fbgfx don't support any kind of transforming.

We'll have to use OpenGL, or write our own pixel transforming rendering, which the engine does already. The proposal to per-vertex transforms means all the backends must be rewritten. With appropriate planning and a timely development of interfaces, the implementation could be pretty quick. The SDL/OpenGL backend is going to be developed first because of it's cross-platform goodness.

I was talking about doing software transforms in the backends, I'm not sure what you mean by "engine".

All the pixel rendering should take place in the backends, in case they can benefit from hardware acceleration. Otherwise, it'll be software rasterizing as it is now--just in the backends instead, which is in line with your statement. I got confused in the discussions, but initially I was stating the pixel rasterization is currently occurring in the engine, not the backends, which is reversed to how it should eventually be. And this isn't in reference to the editors, where individual pixels are drawn onto a surface, etc. In that case, the engine will do the drawing to a temporary surface and update a backend surface with that data.

The Mad Cacti 02:24, 28 May 2011 (PDT):

Why isn't that at least implemented in SDL? Oh well... we'll have to implement our own.

Because it's very difficult to write such an (efficient) general purpose polygon rendering in software, and anyway it's way outside of SDL's purpose of providing the highest-common-denominator. If you want to attempt an implementation, best of luck.

The proposal to per-vertex transforms means all the backends must be rewritten.

I think that's a little harsh. Almost none of the existing code would have to be thrown out, rather it would require writing new functions.

And this isn't in reference to the editors, where individual pixels are drawn onto a surface, etc. In that case, the engine will do the drawing to a temporary surface and update a backend surface with that data.

Yes, we will keep all of the existing graphics functions in allmodex.bas, for the sprite editors and a couple other tasks.

Hieran Del8 16:43, 28 May 2011 (PDT):

If you want to attempt an implementation, best of luck.

Heh, it is challenging, but we could just model the pixel rasterization. It's heavily reliant on floating point, but there's a way to do it with integer decimals... What are those called again, where an integer represents a decimal increment? Well, it's using those. Actually, I'm quite intrigued in doing this.... dang it Ralph! Enticing me! :P

I think that's a little harsh. Almost none of the existing code would have to be thrown out, rather it would require writing new functions.

You're right, poor word choice on my part. The 'retrofit' word would be more fitting.

The Mad Cacti 22:30, 28 May 2011 (PDT):

I've always wanted to code such a thing myself -- so I can't help but become totally distracted by this discussion. :) They're called fixed point numbers. And that's what I used to write my fast blitting routine (the task was to draw an 8-bit 100x100 rect by reading from a parallelogram-shaped portion of an 8-bit texture). First, here's a straightforward use of fixed point arithmetic in the inner loop:

   for (ox = 0; ox < 100; ox++) {
     *dest++ = *src;
     x_accul += x_dfrac;
     src += (x_accul >> 8);
     x_accul &= 0xff;
     y_accul += y_dfrac;
     src += sprite->pitch * (y_accul >> 8);
     y_accul &= 0xff;
   }

Here's a trick: if you use 8 bits for the fractional part and 8 bits for the integer part (use a 16-bit integer for x/y_accul), then on x86 the shifts can be accomplished by reading from 8 bit registers (ah, bh, etc), and C compilers are clever enough to do this! Here's my optimised version of the above:

 unsigned int offsets[2];
 offsets[0] = y_delta * sprite->pitch + x_delta;
 offsets[1] = (y_delta + 1) * sprite->pitch + x_delta;
   for (ox = 0; ox < 100; ox++) {
     *dest++ = *src;
     accul += dfrac;
     src += (accul >> 24);
     src += offsets[(unsigned char)(accul >> 8)];
     accul &= 0xff00ff;
   }

This loop compiles to 12 instructions, including the loop counter decrement and branch. I just timed functions using the above two blocks on this Athlon 7750 (tweaking optimisation options for maximum speed), and the top function draws the 10,000 pixels in 131,000 cycles, and the bottom runs in 49,000 cycles, including all overheads. In comparison, a version which uses SSE and draws 4 pixels at once runs in 37,000 cycles: but my code's not too efficient, it's written in C with xmmintrin.h instead of assembly (and this old GCC version isn't too good at optimising that), and it's crippled by not using SSE2 (SSE/MMX can't read from non-consecutive memory locations in parallel, so you still need to draw one pixel at a time). Anyway, you can find all my code and 3 other slower implementations here: [1]

Anyway, that's the easy part. :) The hard part is drawing quadrilaterals, and all the edge calculations, bilinear interpolation, and edge cases that involves.

Hieran Del8 11:06, 31 May 2011 (PDT):

Wow. Frankly, that's so far out of my region of understanding that, even after I implemented 32bit fixed point integer mathematics, built a texture sampler, and a rasterizer, I still can't understand what's going on in that code. In fact, it's so far out I'm afraid of actually finding out why it's fast. :P

Anyway, I did write up a texture sampler and rasterizer for software implementations of the hardware acceleration of arbitrary transformations (I've yet to test it, though!). The restriction with rendering, however, is that the quads cannot be concave. However, I want them to be possibly concave, so I'm sketching an intermediate approach--basically it involves breaking a quad into 4 triangles that all share a common generated "center" vertex. Before, I was breaking the quads into 2 triangles, but there's a graphical error with that. The 4-triangle approach will fix it all, just leads to 4 draws per quad (not too much more expensive than 2 draws, though, since the rasterizer only evaluates pixels that will be drawn to).

Is there a better approach to concave quads?

The Mad Cacti 00:28, 1 June 2011 (PDT):

Done already? Great. I'd be very happy to attempt to optimize your code. :)

That second loop sticks two 16 bit (each with 8 bits fractional) accumulators into one 32 bit word. The the number of pixels moved in the x direction each step is x_delta + x_frac/256, and in the y direction is y_delta + y_frac/256, and dfrac is equal to x_frac<<16 + y_frac. (I just noticed that 'accul' is a spelling mistake: I meant to write 'accum'!) The "(unsigned char)(accul >> 8)" is equivalent to "(accul >> 8) & 0xff" but is the magic sauce that tells the compiler it can just read an 8 bit register. The lookup table is to save one multiplication and one addition.

Even if a quad is convex, the two ways of splitting it into two triangles can give very different results. Were you referring to that at all, or only to the fact that for concave quads, only one of these splittings is valid (which you could work around by checking which splitting is valid), or something else?

I'm not sure how textured polygons are normally split into textured triangles to get around these problems. I'd have to do some research to find out.

Hieran Del8 08:19, 1 June 2011 (PDT):

Done already? Great. I'd be very happy to attempt to optimize your code. :)

I've submitted it to the repository. fpInt.h is the fixed point integer math. The rasterizer is still untested, but you can see what I've got so far in rasterizer.h and rasterizer.cpp. surface.h defines structures important to a surface.

Regarding the FPInt's, if you can optimize it, that'd be great. I don't want to lose the template overloads, if that's possible. I definitely don't want to lose the 16bit precision on the fraction. Other than that, tear it apart if you like. :P

Interesting about the optimization in your rasterization procedure, though if you're familiar with the low-level code, why not code assembly?

Even if a quad is convex, the two ways of splitting it into two triangles can give very different results. Were you referring to that at all,

Yes, that point exactly. The solution as I can tell is simply generating 4 triangles for the quad. The overhead for drawing 4 instead of 2 is not much greater.

The Mad Cacti 01:47, 5 June 2011 (PDT):

Regarding the FPInt's, if you can optimize it, that'd be great. I don't want to lose the template overloads, if that's possible.

Template overloads? You mean the templated operator overloads?

That's an odd statement; there's nothing to optimise in FPInt (except for possibly the division function?). I finally had a look through all of your code, and was pretty surprised at how many features it supports (and you'll add transparency as well?). It would be time better spent to only optimise cases which will actually be used (rasterTexture, ignoring rasterColor and rasterTextureColor). Tex2DSampler::sample will be the focus of attention. Transparency might be best done with a separate simple pass which is more amendable to (automatic, hopefully) vectorisation.

Interesting about the optimization in your rasterization procedure, though if you're familiar with the low-level code, why not code assembly?

It was an exercise in translating assembly to C :)

The overhead for drawing 4 instead of 2 is not much greater.

Actually I would not be surprised if I manage to optimise rasterTexture to the point that most of the time is spent in calculateRasterPixels. I've seen people go to a lot of trouble to avoid calculating that directly (calculating a multiple of 3 halfspace functions over the bounding rectangle for the triangle and using it as a bitmask).

Hieran Del8 08:59, 8 June 2011 (PDT):

you'll add transparency as well?

Oh yeah.. I forgot about that! Should be pretty easy to add.

You mean the templated operator overloads?

Yes, that's what I meant.

there's nothing to optimise in FPInt

Oh. Well, good. :)

only optimise cases which will actually be used (rasterTexture, ignoring rasterColor and rasterTextureColor)

Hmm, after thinking about this for a while, I agree that the other two aren't really useful. That will save several interpolation calculations per line! Of course, the intent of rasterTextureColor can be combined with rasterTexture, that is, allowing an argb modifier on the pixels. Probably better if that data wasn't interpolated between every vertex.

Tex2DSampler::sample will be the focus of attention

Ah, so that's the slowest part? If you can improve it, that'd be great. I'll be focusing on removing the extra baggage being interpolated, namely the color and position. Color should be a modifier on the whole object (not individual vertices), and position should be set--not interpolated as it is right now.

Mike mentioned getting pixel shaders. They'd operate on each pixel, so any operation here would tremendously impact performance. The 'shader' that's running now is simply sampling the texture. A simple shader could be of the form:

Color TriRasterizer::psNegative(const TexCoord& tex, const Surface* pTexture)
{
   return ( 0xffffffff - m_sampler.sample(pTexture, tex.u, tex.v) );
}

What do you think? Then shaders could be callable by the plotscripts.

The Mad Cacti 04:10, 9 June 2011 (PDT):

Ah, so that's the slowest part? If you can improve it, that'd be great.

Specifically, manually inlining it into the raster functions. A quick timing showed that for a 32x32 quad, it's spending nearly half the time in sample (the FPInt math seems to translate into a surprisingly large number of instructions).

Hmm, after thinking about this for a while, I agree that the other two aren't really useful. That will save several interpolation calculations per line!

They are useful. It immediately occurs to me that per-vertex colors would be great for drawing first-person perspective dungeons. So I suggest keeping everything you currently have in case we end up exposing more of these flexible drawing operations to scripting other than just rotation and scaling of slices. I imagine it would be best added as an additional 'Quad' slice type. And I imagine it'll happen once everyone gets used to the idea :)

However it's quite true that if we're not actually using the per-vertex colour data, interpolating them is a waste of time. So I suggest templating the relevant classes by Vertex type! We can have Vertex variants with and without colour information. By implementing the interpolation with Vertex operator overloads, you might not need to specialise anything else.

Of course, per-quad tints are also useful, and also faster, but still a lot of overhead if you're not using it. One possibility could be to split transparency and tinting into a separate pass over each raster line (so, draw the raster line to a temporary buffer, and then draw that onto the image). This makes sense since they are both very similar highly vectorisable operations (modern C++ compilers really should be able to do the vectorisaiton automatically). If a quad has either transparency or tinting, then the 2nd pass can be performed. Far less work than coding multiple versions, and still probably almost as fast.

Oh, one more thing: what about colour-keyed transparency (colour 0)? It looks like this is currently missing.

Then shaders could be callable by the plotscripts.

I'm not sure what you mean. That you could specify a shader for a 'Quad' slice? Would there be a small library of available shaders (in which case they can be portable across all backends), like inverting colours and Phong-like shading? Could they include smoothing algorithms (similar to hq4x, but something intended to clean up scaled/transformed pixel art - the smoother in gfx_sdl works like this: it's applied to a scaled image rather than scaling it itself)? I admit, I know very little about pixel shaders.

Hieran Del8 22:02, 9 June 2011 (PDT):

A quick timing showed that for a 32x32 quad, it's spending nearly half the time in sample (the FPInt math seems to translate into a surprisingly large number of instructions).

Hmm. I don't think the assembly for sampling should be written in the raster functions, but if it was coded in assembly, it'd be appropriate to be encapsulated in a function, and perhaps that function set to inline. Maybe that's what you were saying... such is the ambiguous nature of my communication skills. :P

It immediately occurs to me that per-vertex colors would be great for drawing

Ah. I just deleted it all, but it can be reconstructed. However, your suggestion of a so-called "flexible vertex format" is an interesting idea, and would definitely be more efficient than interpolating all data. It'd require a rewrite of a lot of the code anyways.

So I suggest templating the relevant classes by Vertex type! We can have Vertex variants with and without colour information. By implementing the interpolation with Vertex operator overloads, you might not need to specialise anything else.

That's a brilliant idea! I like the notion, but honestly, I'm going back and forth on using it. I've spent literally the whole day thinking about it (well, and thinking of other things too :P). Would you like to suggest a possible pseudo-implementation, so that I can see what you have in mind?

I think there's probably 4 vertex formats we'd readily use:

  • Position
  • Position and Color
  • Position and Texture
  • Position and Texture and Color

The rasterizer, in calculating raster lines, only needs positional data. For shading each pixel, the other data would be necessary. I'll consider this later (in time, not here).

split transparency and tinting into a separate pass over each raster line
Would there be a small library of available shaders
like inverting colours and Phong-like shading? Could they include smoothing algorithms

Sounds good, yes, and yes. Alpha blending (transparency other than color-key) should be dealt with separately, but not in a separate buffered pass. It'll operate per-successful pixel (one not clipped by color-keying or some masking tests). Also, it'll only work on 32bit render targets, not 8bit.

I'd like to make available a small set of shaders: perhaps we can talk about that on the mailing list.

The Mad Cacti 06:45, 10 June 2011 (PDT):

Hmm. I don't think the assembly for sampling should be written in the raster functions, but if it was coded in assembly, it'd be appropriate to be encapsulated in a function, and perhaps that function set to inline.

Highly optimised code often doesn't resemble the original code. That's the nature of things. Always keep the original around, commented out, to compare!

Alpha blending (transparency other than color-key) should be dealt with separately, but not in a separate buffered pass. It'll operate per-successful pixel

Opps, I forgot about colour-keying. I'm not familiar with all the blending modes, but for all the ones that I know, it would still be possible to buffer a line before blending (if that does turn out to be significantly faster) by using a certain value for excluded pixels which causes no effect. I suspect this is false for more exotic ones? And if multiply is implemented as r1 * r2 / 256 instead of r1 * r2 / 255, for example, you're in (a small but probably visible amount of) trouble. Anyway, by "separately", I assume you mean "not in the sampler/shader".

The rasterizer, in calculating raster lines, only needs positional data.

So the bare positional Vertex could be a superclass of the others, with non-virtual operator overloads. That seems quite convenient.

Ah. I just deleted it all, but it can be reconstructed. ... It'd require a rewrite of a lot of the code anyways.

Yes, templating may allow merging all the raster* and draw* functions anyway.

Would you like to suggest a possible pseudo-implementation, so that I can see what you have in mind?

I'm not sure what you have in mind. :) Implementation of what? Operator overloads? Interpolation code? For the later, see the sample code I posted on a mailing list a couple days ago (which wasn't psuedocode).

Hieran Del8 16:35, 10 June 2011 (PDT):

still be possible to buffer a line before blending

I've never looked into vectorization before, but now studying it, there's a lot of potential optimizations to make. And now I understand what you were getting at. :)

In hardware, the pixel processing is usually done across multiple ROP units. The color ROP units are what we're imitating, so if it's possible to divide the workload by giving each "ROP" unit a line to raster, when that line finishes, it can request a new line to process. The requests for new lines to process would be the bottleneck with critical sections. With that in mind, I think the blending should still be done as per-successful pixel. So not multiple passes.

And if multiply is implemented as r1 * r2 / 256 instead of r1 * r2 / 255, for example, you're in (a small but probably visible amount of) trouble.

I don't know what you're talking about. Alpha blending 8bit surfaces? No way. It wouldn't make any sense.

Anyway, by "separately", I assume you mean "not in the sampler/shader".

Yes, that's what I meant. :)

templating may allow merging all the raster* and draw* functions anyway.

Before we start templating these, I'd like to shift the focus to this vectorization you were talking about. Have you done much with it? My only experience is a brief study of compute shaders, and what I've now pulled up from wikipedia.

Implementation of what? Operator overloads? Interpolation code?

Not important to me right now, but I was asking about how you would design it, like struct Vector is the base, with only positional data. These members get implemented. Etc.

The Mad Cacti 23:06, 10 June 2011 (PDT):

With that in mind, I think the blending should still be done as per-successful pixel.

I don't see why the implementation of hardware-accelerated versions of the shaders should effect the software implementation, unless you prefer to make their code as similar as possible.

But I'm not really attached to this separate pass, it was just an idea.

The requests for new lines to process would be the bottleneck with critical sections.

I'm lost here. Are you talking about GPU or software shaders? For the former, isn't the GPU responsible for assigning pixels to ROPs, and for the later, doesn't that imply you're talking about multithreading?

I don't know what you're talking about. Alpha blending 8bit surfaces? No way.

Sorry, by r1, r2, I meant the 8 bit red components of the source and destination pixels. Clearly it was unreasonable to omit that explanation!

I'd like to shift the focus to this vectorization you were talking about. Have you done much with it?

OK, I admit, nearly nothing: that C rasterisation code I linked to up this page is pretty much the limit. It uses xmmintrin.h, which contains function definitions which are translated into individual MMX and SSE instructions. I'm not fond of actually writing assembly, I just enjoying reading through assembly manuals, and feel somewhat familiar with the instruction set.

If you want the compiler to automatically vectorize, you need to specify a minimum CPU model on which your code will run which actually has some vector instructions. Pentium 4 introduced SSE2, for example. The default is normally only a 486! The Intel C Compiler can apparently automatically generate different versions of functions for different CPUs (but watch out, they've been caught purposely generating slow code for AMD CPUs before!), but with other compilers if you want to remain portable, you have to manually compile and link multiple versions and use some runtime CPU detection code.

Hieran Del8 07:05, 11 June 2011 (PDT):

I don't see why the implementation of hardware-accelerated versions of the shaders should effect the software implementation, unless you prefer to make their code as similar as possible.

That's not what I meant. I was working on the ROP units and the ROP controller. I think I understand why you were recommending storing lines in buffers, then performing passes on them. The issue with that is the first pass doesn't have any data--it generates it from an interpolation. So only the second pass would benefit from adjacent memory location lookups. That being the case, it makes more sense to perform all operations per-pixel instead of buffering them. (And since we won't be doing complex shaders, all the easier.)

Are you talking about GPU or software shaders? For the former, isn't the GPU responsible for assigning pixels to ROPs, and for the later, doesn't that imply you're talking about multithreading?

Software. And yes, I was referring to multithreading. The ROP units, which I think there should be 4, will each be given a raster line from the ROP controller to process. Once they complete it, they can notify the controller of the completion. In this way, the controller can designate to idle ROP units tasks while other ROP units may still be busy.

Another approach is to fill all ROP units with tasks, and wait for all of them to complete before setting new tasks for all the ROP units.

Which approach works better for vectorization? Is vectorization the same thing as multithreading?

The Mad Cacti 18:43, 12 June 2011 (PDT): No, multithreading and vector operations are completely orthogonal methods of parallelism.

The issue with that is the first pass doesn't have any data--it generates it from an interpolation. So only the second pass would benefit from adjacent memory location lookups.

Yes. But it should be computation-bound, not IO-bound, so I'd still guess there would be some benefit, but I can't be sure. MMX would allow blending two pixels (6 colour components) at once, and SSE2 would allow blending four pixels (12 colour components) at once, instead of just vectorising the blending of a single pixel. However... it might be possible to vectorise the entire raster loop, including sampling (sampling, blending, and drawing four pixels at once).

Software. And yes, I was referring to multithreading. The ROP units, which I think there should be 4, will each be given a raster line from the ROP controller to process.

Wow, OK. I'm worried that one raster line is too small a unit of work. Many sprites in the OHR are pretty small. For example, what if you're drawing a map layer zoomed out a bit so that each tile is 12x12? I don't know how much the overhead of the communication will be, but I assume it'll be nontrivial. (The overhead of drawing a zoomed out map layer looks like it's going to be huge anyway.)

Also, I'm worried that multi-core CPUs are waaaay above our minimum system requirements even if they go up significantly, so multithreading is not really going to help. Also, on virtual cores of hyperthreaded CPUs (as opposed to true multicore CPUs which have only been available a few years), the two threads would be fighting over execution units (especially vector processing units), so things slow down considerably. But if you want to work on it, go ahead, but we will probably want to disable multithreading on CPUs without (true) multiple cores.

Hieran Del8 07:56, 13 June 2011 (PDT):

No, multithreading and vector operations are completely orthogonal methods of parallelism.

Ah. Then ignore the discussion of multithreading.

four pixels (12 colour components) at once, instead of just vectorising the blending of a single pixel

Hmm. The more I think about it, the more I agree that maybe the vectorization should be on a raster pixel--not a raster line. So 2 or 4 pixels could be vectorized. I'd like to use mmx, since that's more compatible. If you think SSE2 would be compatible enough, maybe we can take advantage of that. (Of course, I have no idea how this will affect PowerPC's.)

The overhead of drawing a zoomed out map layer looks like it's going to be huge anyway.

Yes. Any surfaces rendered at 320x200 will be more expensive--I just hope the vectorization could make it less so.

The Mad Cacti 16:09, 13 June 2011 (PDT):

The more I think about it, the more I agree that maybe the vectorization should be on a raster pixel--not a raster line. So 2 or 4 pixels could be vectorized.

I'm not clear what you mean, but I guess you mean sampling/writing multiple pixels in parallel.

We can definitely assume MMX on x86 at this point, and maybe SSE too. (Recently someone said they were using an AMD K6-2, and complained that the engine was really slow. I think all K6-2s and Pentium 2s -- max 550MHz -- (latest CPUs not supporting SSE) are probably too slow to run the engine at reasonable speed. But I guess hardware accelerated graphics could make these reasonable again.) MMX is missing lots of instructions you would hope to have; IMO SSE2 (which adds integer support to SSE, superceding MMX) seems to the point at which you can actually get things done without having to continually transfer results back to non-vector registers.

But I'd like to run different versions of the code on different CPUs, mostly for the benefit of Atoms, which are awfully slow but support lots of recent vector instruction sets. We should always keep a CPU-independent version around for portability. I have to say though, gcc 4.4's automatic vectorisation on x86 seems disappointingly unable to do much, while on x86-64 it's pretty impressive. So might have to hand code the vectorisation on x86.

Any surfaces rendered at 320x200 will be more expensive

Actually, I was imagining the overhead of drawing each tile as a quad using the rasterizer -- 64k pixels should be no problem for the rasterizer. Vectorising over multiple pixels at once even adds extra overhead which will be significant if a raster line is only 3 pixels long. Even at an enormous 200 cycles/pixel, a 1GHz CPU would draw 64k pixels in 13ms, low enough for 60fps. Anyway, a way around this could be to instead render the map layer scaled (so that you don't end up rendering a 3200x2000 surface if that's how much of the map you're viewing) but unrotated to a temporary surface somewhat larger than the screen resolution, and then use the rasterizer to rotate it and apply transparency. But then you need to write special purpose fast scaling-only blitting code :( (that wouldn't happen to be gfx_SurfaceStretch, would it?). Or alternatively, we could generate a mipmap of the tileset in advance and render the map layer unscaled from that (so that the temporary surface would be at most not much wider/taller than twice the screen resolution. (The mipmap would also make the scaling look better.) But while this would have to be done partially in the engine, obviously you would not want to do this when using hardware accelerated quad rendering.

Hieran Del8 19:34, 13 June 2011 (PDT):

I guess you mean sampling/writing multiple pixels in parallel.

Oh, you know what, I'm mixing up some concepts--I'm thinking about compute shaders, which run several thousand streams of code simultaneously. That's not what MMX or SSE does, right? If so, I'm back to not having a clear plan about using it. I think I've got to focus on first getting this to work, then optimize.

I was imagining the overhead of drawing each tile as a quad using the rasterizer

Since the bottleneck seems to be more the pixel calculation, not as much the raster line boundary calculation, I think the overhead will be more logarithmic in its demand on the cpu as the number of tile quads increase and quad sizes decrease. (And this time, I really do mean logarithmic ;) ).

a way around this could be to instead render the map layer scaled [...] but unrotated to a temporary surface somewhat larger than the screen resolution, then use the rasterizer to rotate it and apply transparency.

I think I understand, you want to use a faster blitting method (like gfx_surfaceCopy() or gfx_surfaceStretch()) and then sample that surface in a single rendered quad. That would be way more efficient than using the rasterizer to draw every tile as separate quads, for instance. Even in hardware, I think it'd be faster.

need to write special purpose fast scaling-only blitting code :( (that wouldn't happen to be gfx_SurfaceStretch, would it?)

Yes, it is. :) It's a simple digital differential analyzer (DDA), though I've yet to write it. I think you've done something like this already--care to write it?

generate a mipmap of the tileset in advance and render the map layer unscaled from that

Sampling does that already. I don't think hardware or software would benefit from mipmaps, honestly (unless we were doing linear pixel interpolations).

Hieran Del8 22:34, 13 June 2011 (PDT):

(And this time, I really do mean logarithmic ;) )

Doh! I meant exponential! Like time per frame is dependent on the size and number of quads:

f(s, n) = (s*n)^1.25 + n^1.85

And yes, I'm just pulling these numbers out of thin air. :P

The Mad Cacti 03:12, 14 June 2011 (PDT): No, actually you mean linear :). I think it's obvious, but here's a detailed argument anyway :). The overhead time (other than drawing pixels) for a single quad can be modelled as c1 + c2 * num_raster_lines. The time to rasterise (draw pixels for) a single quad can be modelled as c3 * num_pixels + c4 * num_raster_lines (the c4 term is per-line overhead, eg. popping a DrawingRange). If you wish, you might argue that smaller raster lines means more L1 cache misses, but this is a constant factor because all textures will fit in L2 cache (and IMO this would be a really small effect; theres still lots of data locality). The total number of raster lines is at most (drawing at a 45 degree angle) sqrt(2) * screen_height * screen_width / num_quads_across_screen. Allow constant overhead c6 for drawing a map layer. So, as a function of the zoom,

num_quads_across_screen = c5 * zoom
t(zoom) = num_quads * (c1 + c2 * num_raster_lines + c3 * num_pixels + c4 * num_raster_lines) + c6
        = (c5 * zoom^2) * c1 + c3 * total_num_pixels + (c2 + c4) * total_num_raster_lines + c6
        <= c1 * c5 * zoom^2 + c3 * total_num_pixels +  (c2 + c4) * (sqrt(2) * screen_height * screen_width / num_quads_across_screen) + c6
        = c1 * c5 * zoom^2 + c3 * total_num_pixels + (c2 + c4) * sqrt(2) * total_num_pixels / (c7 * zoom) + c6

The total number of pixel is constant, so clearly the total time is O(zoom^2), ie. quadratic in the zoom, and linear in the number of quads.

That's not what MMX or SSE does, right?

Vector instructions are parallel only on the number of bits in a vector register (or, the number of subregisters of such a register). Which is 64, 128, 256 bits for MMX, SSE, and SSE on x86_64 respectively.

Since the bottleneck seems to be more the pixel calculation, not as much the raster line boundary calculation

I admit that I only performed a very crude timing, and that was before the recent changes, and I can't remember the results anyway :). But it was certainly a large fraction, though it may in fact be easier to speed up calculateRasterLines than the rasterize itself, who knows.

Yes, it is. :) It's a simple digital differential analyzer (DDA), though I've yet to write it. I think you've done something like this already--care to write it?

Oh, I figured that this function would be hardware accelerated on DirectX/OpenGL backends (by drawing quads), thus making it useless for speeding up map layer rendering.

Yes, I'd enjoy writing that. I'm concentrating on writing my dissertation, but that has the effect that I'm avoiding all forms of procrastination except checking my emails and replying here :) (Actually, I'm spending waaay too much time writing this, as you can tell!) This should be a really quick job (quicker than this reply :( ), I'll just adapt that rasteriser code I posted earlier; and maybe even include the MMX version. I can optimise it further later when I have the luxury of time.

Sampling does that already. I don't think hardware or software would benefit from mipmaps, honestly (unless we were doing linear pixel interpolations).

The intended benefits were, firstly, that you could do a plain unscaled blit instead of sampling (however, for a mipmap with 2x step size, that means the temporary surface could be up to 4 times larger than the screen (modulo rotation), so blitting would have to be over 4 times faster than sampling for this to be faster. You could use a smaller step size though.) Secondly was, yes, interpolation. If you look at the in-game minimap of a large map (hint: F1 debug key), it looks pretty ugly (like high frequency noise). I've strongly considered implementing tileset mipmaps before, just to get better looking minimaps. Doing the interpolation only when generating the mipmap and not in the rasteriser proper is a halfway step, but I imagine would be a pretty decent improvement. (We could think about (maybe optional) quick and dirty interpolation in rasteriser too, if it turns out we have CPU cycles to spare.)

Hieran Del8 08:38, 14 June 2011 (PDT):

No, actually you mean linear :).

Alright, alright, you win! :P I was thinking of a curve, and a quadratic equation fits that description quite well. I'm amused by the detailed analysis--thanks for taking the time to write it up. :)

Vector instructions are parallel only on the number of bits in a vector register (or, the number of subregisters of such a register).

Ah, ok. So can multiple operations occur in a stream? I'm afraid I don't know enough about assembly to conclude that myself. I wish I did, but... well...

I figured that this function would be hardware accelerated on DirectX/OpenGL backends (by drawing quads), thus making it useless for speeding up map layer rendering.

An advantage with it is the device context isn't needed for the stretching (unlike quad rendering). Hardware quad rendering would be faster, since thousands of stream processors and multiple specialized color ROP units would be running and writing simultaneously. I also checked the directx api again, and it looks like a source surface cannot be a target for this method. It makes some sense that a surface receiving this operation is a render target, but it would have been nice to target a source surface, too.

Yes, I'd enjoy writing that.

Great. No hurry, just when you're available.

firstly, that you could do a plain unscaled blit instead of sampling

That would definitely be faster until some threshold of zoom level. The stretching algorithm might make up for that, though, especially since the size of the image would not need to be as large.

Secondly was, yes, interpolation. If you look at the in-game minimap of a large map (hint: F1 debug key), it looks pretty ugly (like high frequency noise).

Ok. Obviously only 32bit surfaces are being targeted, but I would be happy with this. Perhaps something akin to:

Surface *miniMap, *miniMap_midSurface;

gfx_surfaceCreate( 100, 100, SF_32bit, SU_RenderTarget, &miniMap );
gfx_surfaceCreate( 200, 200, SF_32bit, SU_RenderTarget, &miniMap_midSurface );

wScale = 200 / map_width
hScale = 200 / map_height

SurfaceRect placement;

for every tile
   placement.left = wScale * j;
   placement.top = hScale * i;
   placement.right = wScale * (j+1);
   placement.bottom = hScale * (i+1);
   gfx_surfaceStretch( &tile[i*width + j], tileMap, &placement, miniMap_midSurface, masterPalette );

gfx_surfaceStretch( 0, miniMap_midSurface, 0, miniMap, 0 );

gfx_surfaceDestroy( miniMap_midSurface );

If gfx_surfaceStretch() were expanded to include an argument for how to interpolate, such as:

gfx_surfaceStretch( surfaceInterpolation, pRectSrc, pSurfaceSrc, pRectDest, pSurfaceDest, pPalette )

then the first time gfx_surfaceStretch is called (with the tiles) it could be SI_Point (as in point sampling, as it's doing now), and the last time could be SI_Linear. I suppose we could apply this to all the rendering methods per your suggestion, but it'd only work with 32bit targets. What do you think?

The Mad Cacti 23:47, 17 June 2011 (PDT):

Ah, ok. So can multiple operations occur in a stream?

A little unclear on what is meant by 'stream'. Whole 64/128(/256) bit vector registers are normally operated on at once. The number of calculations which can happen concurrently is limited by the number of vector execution units on the chip. Each instruction creates a one-register result. These units seem to take up a vast number of transistors on the die, so it seems there's usually only one of each. "Intel's literature states that Core can, for example, execute a 128-bit packed multiply, 128-bit packed add, 128-bit packed load, 128-bit packed store, and a macro-fused cmpjcc (a compare + a jump on condition code) all in the same cycle." Interestingly, Core was the first Intel CPU to actually have 128 bit instead of 64 bit vector units (in other words on older CPUs, throughput for SSE was not much better than MMX, even the most trivial operation took two cycles). However, the SSE2+ instruction set is much better than MMX, and you can use the FPU at the same time.

An advantage with it is the device context isn't needed for the stretching (unlike quad rendering).

Why is that? Is it completely different to rendering a quad?

It makes some sense that a surface receiving this operation is a render target, but it would have been nice to target a source surface, too.

That's OK, but not allowing plain surface copies from source to source surfaces is more disappointing. I guess gfx_directx could hide that detail by doing the blit in software and updating the texture... if we really wanted this badly.

then the first time gfx_surfaceStretch is called (with the tiles) it could be SI_Point (as in point sampling, as it's doing now), and the last time could be SI_Linear. I suppose we could apply this to all the rendering methods per your suggestion, but it'd only work with 32bit targets. What do you think?

Looks like it's sampling and averaging 4 pixels from the tileset for every pixel in the minimap. Would definitely be an improvement... but seems a little inflexible and indirect (what if you zoom out to 1 pixel per tile, and want more samples per tile?) But probably good enough.

Mipmapping might be useful in general to improved shrunk sprites... I imagine nearest-neighbour would look pretty bad for sprite with black outlines, for example. Maybe there even exists something like hq4x, but for scaling down sprites instead, which might still look better when resampled (with a less drastic scaling) with nearest neighbour. If there isn't, sounds like a fun project to create one...

But mipmapping seems like something worth putting off until later :)

Also, wow, lots of activity on the rasterizer. I see some templating. Aside from optimisation, nearing completion?

Hieran Del8 00:52, 18 June 2011 (PDT):

Whole 64/128(/256) bit vector registers are normally operated on at once.

I think I finally get it. It'd be advantageous when multiple components must be calculated simultaneously.

Why is that? Is it completely different to rendering a quad?

Yes, but I've changed the restriction to encompass all operations, since the model is now a source->submix->master setup. It's much easier that way, since everything could be a render instead of texture operation (it's faster that way, though less flexible).

I think the render targets should all be 32bit. It'd be easier, mostly for the hardware backends. (Specifically, coding for OpenGL 1.1 could be done without extensions! :O )

Mipmapping might be useful in general to improved shrunk sprites

With the view that the image would be static (a minimap), it would be a fine idea, though I'm sure one can generate the image once per map load the same way by just rendering and linearly sampling. I don't know which way right now, too tired to think about it.... :P

lots of activity on the rasterizer. I see some templating. Aside from optimisation, nearing completion?

Yes, I did some templating (now that pixel shaders and the ROP units are out, it's so much easier), and it's nearing completion. The triangle alignment errors I haven't duplicated yet. Can you post an image?

Also, there's the bug relating to the center vertex, the raster line not being drawn out from the center. Can't find the solution yet.

The Mad Cacti 03:00, 18 June 2011 (PDT):

OK, here you go: 2 3 4. (Each pixel of the source image is a different colour.) It actually looks like the left and right triangles are curving outwards by one pixel!

I think the render targets should all be 32bit.

Heh, I was actually assuming that all along, and totally forgot my assumption without noticing it upon seeing that you implemented it :)

I think I'll want to add faster specialised raster function(s) which don't support per-vertex colours; along with different architecture-specific it seems like there could be an explosion of versions, so trimming some of them is helpful. This is the reason that you see projects like [2]. Another use of a JIT: swiftshader, a software Direct3D 9 and OpenGL 2.0 implementation.

Hmmm... coincidentally I'd like to write a JIT compiler for the script interpreter :P

It's funny, shortly before you started on rasterizer.cpp, I was thinking of embedding some other rasterizer in the engine, like TinyGL. I'm kind of glad I didn't suggest it: working on one yourself is so much more fun :) I'm looking forward to spending time on it in a couple of weeks. (I would just fix the texture misalignment bugs myself if I had time. I won't be writing the surface stretcher before then.)

the model is now a source->submix->master setup

I have no idea what that is.

With the view that the image would be static (a minimap), it would be a fine idea, though I'm sure one can generate the image once per map load the same way by just rendering and linearly sampling.

Opps, I forgot about map layers. (Colour-key) transparency does not play nicely with mipmapping: shrinking tiles separately and then combining them could give completely the wrong result. So I guess your suggestion of a fixed number of samples per destination pixel is best.

(Still doesn't rule out mipmapping as a beneficial feature.)

Hieran Del8 09:51, 18 June 2011 (PDT):

It actually looks like the left and right triangles are curving outwards by one pixel!

Hmm. I'll have to study this in more detail later today. Thanks for the pics, it helps a lot. :)

Heh, I was actually assuming that all along, and totally forgot my assumption without noticing it upon seeing that you implemented it :)

Oh, ok. :) Well, consider them 32bit then. I'll adjust the new render plan later.

I think I'll want to add faster specialised raster function(s) which don't support per-vertex colours

Yeah, that's a good idea, and maybe one that doesn't have a color argb modifier. Hmm. Maybe we should include a dword parameter in the function containing bits whether to operate or not? Or would separate functions be faster (albeit, sounds like there may be a lot of functions here)?

a software Direct3D 9 and OpenGL 2.0 implementation.

That is impressive!

working on one yourself is so much more fun

It is! And it's your fault too that we're having so much fun! :P

I have no idea what that is.

It's actually an audio graph design, but applies well to any scene/filter graph. Source is whatever content, submix is a render target, doing format conversion if necessary, and master is the backbuffer.

The Mad Cacti 21:09, 18 June 2011 (PDT):

Swiftshader used to be an LGPL project, hosted on sourceforge, so the source must be floating around somewhere...

Yeah, that's a good idea, and maybe one that doesn't have a color argb modifier.

Actually, if we have only two alternatives, one with vertex colours+argb modifier and one without either would make the most sense. Couldn't we preblend the argb modifier into the vertex colours instead of applying them separately? I assume that interpolating the vertex colours at a pixel would be quite a small fraction of the time (a single addition? No wrap-around required), making the extra cost of vertex colours over argb modification minor.

Maybe we should include a dword parameter in the function containing bits whether to operate or not?

That might be better than exposing the implementation detail of which combinations have specialised implementations. But either way, I think it would be cleaner to separate the actual implementations into separate functions (necessary anyway for CPU-specific code), meaning a combined function would just be a wrapper. What bits would we actually have? I think it's very unlikely that we would ever use a colour key colour other than 0, so that could be turned into a bit.

Hieran Del8 22:14, 18 June 2011 (PDT):

Couldn't we preblend the argb modifier into the vertex colours instead of applying them separately?

Oh, that's a great idea! Of course, it won't work if the vertex doesn't have color data, though the argument I quote next solves this:

two alternatives, one with vertex colours+argb modifier and one without either would make the most sense.

Hmm, that might just work out better! I've contemplated it for a while now, and I agree, that should be the difference! So it sounds like three formats:

  • Pos, col (VertexPC)
  • Pos, tex (VertexPT)
  • Pos, tex, col (VertexPTC)

And three triangle drawing routines:

draw( VertexPC* pTri, ui32 argbModifier, Surface* pDest )
draw( VertexPT* pTri, Surface* pTexture, Palette* pPalette, bool useColorKey0, Surface* pDest )
draw( VertexPTC* pTri, Surface* pTexture, Palette* pPalette, bool useColorKey0, ui32 argbModifier, Surface* pDest )

Eh?

The Mad Cacti 01:21, 19 June 2011 (PDT):

Of course, it won't work if the vertex doesn't have color data, though the argument I quote next solves this:

Well... I don't think it does, unless the intent was to force use of VertexPTC and thus of setting the colour components on the vertices. A boolean flag argument to ignore (or initialise) the colour components of the vertices could work... but I suspect now I'm just being lazy!

This sounds pretty good otherwise.