Class: Arrow::HTMLTokenizer
- Inherits:
-
Object
- Object
- Object
- Arrow::HTMLTokenizer
- Includes:
- Enumerable
- Defined in:
- lib/arrow/htmltokenizer.rb
Overview
The Arrow::HTMLTokenizer class — a simple HTML parser that can be used to break HTML down into tokens.
Some of the code and design were stolen from the excellent HTMLTokenizer
library by Ben Giddings
VCS Id
$Id$
Authors
Michael Granger
:include: LICENSE
—
Please see the file LICENSE in the top-level directory for licensing details.
Constant Summary
- SVNRev =
SVN Revision
%q$Rev$
- SVNId =
SVN Id
%q$Id$
Instance Attribute Summary
-
- (Object) scanner
readonly
The StringScanner doing the tokenizing.
-
- (Object) source
readonly
The HTML source being tokenized.
Instance Method Summary
-
- (Object) each
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
-
- (HTMLTokenizer) initialize(source)
constructor
Create a new Arrow::HtmlTokenizer object.
Methods inherited from Object
deprecate_class_method, deprecate_method, inherited
Methods included from Loggable
Constructor Details
- (HTMLTokenizer) initialize(source)
Create a new Arrow::HtmlTokenizer object.
41 42 43 44 |
# File 'lib/arrow/htmltokenizer.rb', line 41 def initialize( source ) @source = source @scanner = StringScanner.new( source ) end |
Instance Attribute Details
- (Object) scanner (readonly)
The StringScanner doing the tokenizing
55 56 57 |
# File 'lib/arrow/htmltokenizer.rb', line 55 def scanner @scanner end |
- (Object) source (readonly)
The HTML source being tokenized
52 53 54 |
# File 'lib/arrow/htmltokenizer.rb', line 52 def source @source end |
Instance Method Details
- (Object) each
Enumerable interface: Iterates over parsed tokens, calling the supplied block with each one.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/arrow/htmltokenizer.rb', line 60 def each @scanner.reset until @scanner.empty? if @scanner.peek(1) == '<' tag = @scanner.scan_until( />/ ) case tag when /^<!--/ token = HTMLComment.new( tag ) when /^<!/ token = DocType.new( tag ) when /^<\?/ token = ProcessingInstruction.new( tag ) else token = HTMLTag.new( tag ) end else text = @scanner.scan( /[^<]+/ ) token = HTMLText.new( text ) end yield( token ) end end |