NetSurf's parser isn't suckless.
Here are some issues with their tokeniser.c
- Using switch statement and explicit case numbers to implement the
tokeniser FSM instead of simple gotos and implicit position-based FSM.
This means it has to go through that switch every time through the
main loop, even if the next state is the same, instead of just jumping
to the right place.
- Making every damn state a function. The amount of work that has to
be done in each state is tiny, but somehow they've managed to bloat
them big enough to look like each one of those functions is
nontrivial. There is a ton of duplicate code as a result - for
example, compare hubbub_tokeniser_handle_attribute_value_dq(),
hubbub_tokeniser_handle_attribute_value_sq(), and
hubbub_tokeniser_handle_attribute_value_uq(). (The names are
ridiculously long too.) Each one is over 40 lines and performs
basically the same thing... can you spot the differences between them?
Here's the corresponding part of my parser's tokeniser; it handles all
3 types of attribute values.
attrv_unquoted_loop:
cmp al, '>'
jz finish_attr_done
call [ebp+htmlp.func_iswhs]
jnz attrv_loop
finish_attr_done:
; omitted code here finishes an attribute and emits a tag or starts a
new one (~10 lines)
...
find_attrv:
xor bl, bl ; presume unquoted
; omitted code here scans for start of an attribute value (<10 lines)
...
cmp al, '"'
jz attrv_quoted
cmp al, "'"
jnz attrv_unquoted
attrv_quoted:
mov bl, al ; save quote char
inc edx ; skip quote char
attrv_loop:
call [ebp+htmlp.func_getchar] ; EOF - ignore
attrv_unquoted:
or bl, bl
jz attrv_unquoted_loop
cmp al, bl
jnz attrv_loop
jmp finish_attr_done
I've shown quite a bit more than the actual unquoted/quoted attribute
value states, just so you can follow the structure around it, and yet
this code in x86 Asm is still shorter than one of Hubbub's functions,
in C, that does a tiny fraction of the real work this does! The actual
quoted/unquoted attribute value states are represented by these ten
instructions:
attrv_unquoted_loop:
cmp al, '>'
jz finish_attr_done
call [ebp+htmlp.func_iswhs]
jnz attrv_loop
...
attrv_loop:
call [ebp+htmlp.func_getchar] ; EOF - ignore
attrv_unquoted:
or bl, bl
jz attrv_unquoted_loop
cmp al, bl
jnz attrv_loop
jmp finish_attr_done
I'm not claiming to be an expert on HTML parsing by any means, but
this code is so short because what it does really is that simple -
unquoted attributes are terminated by > or whitespace, and quoted ones
are terminated by their quote character. There's nothing at all deep
or complex about this, yet Hubbub's version makes it look like even
"lightweight" code (which they claim to be) needs to do all that shit
so it gets used as "evidence" by all the people claiming "browsers
need to be complex!"
- Character references don't need to be handled by the tokeniser,
since the presence or absence of any '&'s doesn't affect tokenisation.
- DOCTYPE tokens, same thing, already mentioned above.
- EOF - this is another bloater. Hubbub checks for EOF in the code of
every state(!), but if you read the spec you'll see that the number of
operations upon EOF are limited to two: not doing anything or "emit
the current token", so my EOF handling is done in getchar. The cases
that don't do anything don't need any code, the ones that do are
recorded in an array (there's only 16 entries) and the right place is
jumped to when the EOF function is called.
- Partial input handling: this is all handled in getchar - which
returns "up two levels" to whatever called the parser when it runs out
of input (try doing that in C!) The input is one big buffer which gets
resized and moved as needed. I'm not sure what Hubbub does here, but
it's probably more complex than this.
- Output interface: Hubbub stuffs everything into a "token" structure
and then calls one function, token_handler(), to push the results out.
It has functions named emit_current_tag(), emit_current_comment(),
emit_current_chars(), etc., but all those have to go through that one
token_handler() function. On the other side, the tree construction has
to then switch on that type, just to call the right function, when
they could've made the tokeniser call those different functions in the
first place. How stupid is that? It's bloody-panties-on-head
retarded!! I take the sane route with my tokeniser interface:
extrn _AT_emit_doctype_AT_4:near
extrn _AT_emit_text_AT_4:near
extrn _AT_emit_comment_AT_4:near
extrn _AT_emit_tag_AT_4:near
One function for each token type, as it should be. No need for a
"token type" either, because the function being called implicitly
determines what token was emitted, and it can directly read from the
right fields of the parser/tokeniser structure to get its info. The
pointer to the parser structure is in a register already. Simple and
efficient.
On Sun, Feb 15, 2015 at 5:30 PM, Marc Collin <marc.collin7_AT_gmail.com> wrote:
> NetSurf's parser isn't suckless.
>
> Here are some issues with their tokeniser.c
>
> - Using switch statement and explicit case numbers to implement the
> tokeniser FSM instead of simple gotos and implicit position-based FSM. This
> means it has to go through that switch every time through the main loop,
> even if the next state is the same, instead of just jumping to the right
> place.
>
> - Making every damn state a function. The amount of work that has to be done
> in each state is tiny, but somehow they've managed to bloat them big enough
> to look like each one of those functions is nontrivial. There is a ton of
> duplicate code as a result - for example, compare
> hubbub_tokeniser_handle_attribute_value_dq(),
> hubbub_tokeniser_handle_attribute_value_sq(), and
> hubbub_tokeniser_handle_attribute_value_uq(). (The names are ridiculously
> long too.) Each one is over 40 lines and performs basically the same
> thing... can you spot the differences between them?
>
> Here's the corresponding part of my parser's tokeniser; it handles all 3
> types of attribute values.
> attrv_unquoted_loop:
> cmp al, '>'
> jz finish_attr_done
> call [ebp+htmlp.func_iswhs]
> jnz attrv_loop
> finish_attr_done:
> ; omitted code here finishes an attribute and emits a tag or starts a new
> one (~10 lines)
> ...
> find_attrv:
> xor bl, bl ; presume unquoted
> ; omitted code here scans for start of an attribute value (<10 lines)
> ...
> cmp al, '"'
> jz attrv_quoted
> cmp al, "'"
> jnz attrv_unquoted
> attrv_quoted:
> mov bl, al ; save quote char
> inc edx ; skip quote char
> attrv_loop:
> call [ebp+htmlp.func_getchar] ; EOF - ignore
> attrv_unquoted:
> or bl, bl
> jz attrv_unquoted_loop
> cmp al, bl
> jnz attrv_loop
> jmp finish_attr_done
> I've shown quite a bit more than the actual unquoted/quoted attribute value
> states, just so you can follow the structure around it, and yet this code in
> x86 Asm is still shorter than one of Hubbub's functions, in C, that does a
> tiny fraction of the real work this does! The actual quoted/unquoted
> attribute value states are represented by these ten instructions:
> attrv_unquoted_loop:
> cmp al, '>'
> jz finish_attr_done
> call [ebp+htmlp.func_iswhs]
> jnz attrv_loop
> ...
> attrv_loop:
> call [ebp+htmlp.func_getchar] ; EOF - ignore
> attrv_unquoted:
> or bl, bl
> jz attrv_unquoted_loop
> cmp al, bl
> jnz attrv_loop
> jmp finish_attr_done
> I'm not claiming to be an expert on HTML parsing by any means, but this code
> is so short because what it does really is that simple - unquoted attributes
> are terminated by > or whitespace, and quoted ones are terminated by their
> quote character. There's nothing at all deep or complex about this, yet
> Hubbub's version makes it look like even "lightweight" code (which they
> claim to be) needs to do all that shit so it gets used as "evidence" by all
> the people claiming "browsers need to be complex!"
>
> - Character references don't need to be handled by the tokeniser, since the
> presence or absence of any '&'s doesn't affect tokenisation.
>
> - DOCTYPE tokens, same thing, already mentioned above.
>
> - EOF - this is another bloater. Hubbub checks for EOF in the code of every
> state(!), but if you read the spec you'll see that the number of operations
> upon EOF are limited to two: not doing anything or "emit the current token",
> so my EOF handling is done in getchar. The cases that don't do anything
> don't need any code, the ones that do are recorded in an array (there's only
> 16 entries) and the right place is jumped to when the EOF function is
> called.
>
> - Partial input handling: this is all handled in getchar - which returns "up
> two levels" to whatever called the parser when it runs out of input (try
> doing that in C!) The input is one big buffer which gets resized and moved
> as needed. I'm not sure what Hubbub does here, but it's probably more
> complex than this.
>
> - Output interface: Hubbub stuffs everything into a "token" structure and
> then calls one function, token_handler(), to push the results out. It has
> functions named emit_current_tag(), emit_current_comment(),
> emit_current_chars(), etc., but all those have to go through that one
> token_handler() function. On the other side, the tree construction has to
> then switch on that type, just to call the right function, when they
> could've made the tokeniser call those different functions in the first
> place. How stupid is that? It's bloody-panties-on-head retarded!! I take the
> sane route with my tokeniser interface:
>
> extrn _AT_emit_doctype_AT_4:near
> extrn _AT_emit_text_AT_4:near
> extrn _AT_emit_comment_AT_4:near
> extrn _AT_emit_tag_AT_4:near
>
> One function for each token type, as it should be. No need for a "token
> type" either, because the function being called implicitly determines what
> token was emitted, and it can directly read from the right fields of the
> parser/tokeniser structure to get its info. The pointer to the parser
> structure is in a register already. Simple and efficient.
>
> On Sun, Feb 15, 2015 at 5:27 PM, Ralph Eastwood <tcmreastwood_AT_gmail.com>
> wrote:
>>
>> On 15 February 2015 at 19:02, Calvin Morrison <mutantturkey_AT_gmail.com>
>> wrote:
>> > check out dillo
>> >
>> > On 15 February 2015 at 13:52, Ralph Eastwood <tcmreastwood_AT_gmail.com>
>> > wrote:
>> >> On 15 February 2015 at 18:23, Marc Weber <marco-oweber_AT_gmx.de> wrote:
>> >>> http://www.netsurf-browser.org/ could be close.
>> >>> also have a look at: http://de.wikipedia.org/wiki/Servo_(Software)
>> >>> In any case it looks like to be a lot of effort.
>> >>>
>> >>> Marc Weber
>> >>>
>> >>
>> >> I've glanced at netsurf in passing (source code/design) etc. It is by
>> >> far more suckless than any other implementation I've come across.
>> >>
>> >> --
>> >> Tai Chi Minh Ralph Eastwood
>> >> tcmreastwood_AT_gmail.com
>> >>
>> >
>>
>> I stand corrected! I had forgotten about dillo.
>>
>> --
>> Tai Chi Minh Ralph Eastwood
>> tcmreastwood_AT_gmail.com
>>
>
Received on Sun Feb 15 2015 - 20:30:49 CET