Class: Arrow::HTMLTokenizer

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/arrow/htmltokenizer.rb

Overview

The Arrow::HTMLTokenizer class — a simple HTML parser that can be used to break HTML down into tokens.

Some of the code and design were stolen from the excellent HTMLTokenizer library by Ben Giddings .

VCS Id

 $Id$

Authors

  • Michael Granger

:include: LICENSE

Please see the file LICENSE in the top-level directory for licensing details.

Constant Summary

SVNRev =

SVN Revision

%q$Rev$
SVNId =

SVN Id

%q$Id$

Instance Attribute Summary

Instance Method Summary

Methods inherited from Object

deprecate_class_method, deprecate_method, inherited

Methods included from Loggable

#log

Constructor Details

- (HTMLTokenizer) initialize(source)

Create a new Arrow::HtmlTokenizer object.



41
42
43
44
# File 'lib/arrow/htmltokenizer.rb', line 41

def initialize( source )
  @source = source
  @scanner = StringScanner.new( source )
end

Instance Attribute Details

- (Object) scanner (readonly)

The StringScanner doing the tokenizing



55
56
57
# File 'lib/arrow/htmltokenizer.rb', line 55

def scanner
  @scanner
end

- (Object) source (readonly)

The HTML source being tokenized



52
53
54
# File 'lib/arrow/htmltokenizer.rb', line 52

def source
  @source
end

Instance Method Details

- (Object) each

Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.



60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/arrow/htmltokenizer.rb', line 60

def each
  @scanner.reset

  until @scanner.empty?
    if @scanner.peek(1) == '<'
      tag = @scanner.scan_until( />/ )

      case tag
      when /^<!--/
        token = HTMLComment.new( tag )
      when /^<!/
        token = DocType.new( tag )
      when /^<\?/
        token = ProcessingInstruction.new( tag )
      else
        token = HTMLTag.new( tag )
      end
    else
      text = @scanner.scan( /[^<]+/ )
      token = HTMLText.new( text )
    end

    yield( token )
  end
end