Class: Arrow::HTMLTokenizer

Inherits:

Object

Object
Object
Arrow::HTMLTokenizer

show all

Includes:

Enumerable

Defined in:

lib/arrow/htmltokenizer.rb

Overview

The Arrow::HTMLTokenizer class — a simple HTML parser that can be used to break HTML down into tokens.

Some of the code and design were stolen from the excellent HTMLTokenizer library by Ben Giddings .

VCS Id

 $Id$

Authors

Michael Granger

:include: LICENSE

—

Please see the file LICENSE in the top-level directory for licensing details.

Constant Summary

SVNRev = SVN Revision

%q$Rev$

SVNId = SVN Id

%q$Id$

Instance Attribute Summary

- (Object) scanner readonly
The StringScanner doing the tokenizing.
- (Object) source readonly
The HTML source being tokenized.

Instance Method Summary

- (Object) each
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
- (HTMLTokenizer) initialize(source) constructor
Create a new Arrow::HtmlTokenizer object.

Methods inherited from Object

deprecate_class_method, deprecate_method, inherited

Methods included from Loggable

#log

Constructor Details

- (HTMLTokenizer) initialize(source)

Create a new Arrow::HtmlTokenizer object.

# File 'lib/arrow/htmltokenizer.rb', line 41

def initialize( source )
  @source = source
  @scanner = StringScanner.new( source )
end

Instance Attribute Details

- (Object) scanner (readonly)

The StringScanner doing the tokenizing



55
56
57

# File 'lib/arrow/htmltokenizer.rb', line 55

def scanner
  @scanner
end

- (Object) source (readonly)

The HTML source being tokenized



52
53
54

# File 'lib/arrow/htmltokenizer.rb', line 52

def source
  @source
end

Instance Method Details

- (Object) each

Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.

# File 'lib/arrow/htmltokenizer.rb', line 60

def each
  @scanner.reset

  until @scanner.empty?
    if @scanner.peek(1) == '<'
      tag = @scanner.scan_until( />/ )

      case tag
      when /^<!--/
        token = HTMLComment.new( tag )
      when /^<!/
        token = DocType.new( tag )
      when /^<\?/
        token = ProcessingInstruction.new( tag )
      else
        token = HTMLTag.new( tag )
      end
    else
      text = @scanner.scan( /[^<]+/ )
      token = HTMLText.new( text )
    end

    yield( token )
  end
end