Class WordNet::Lexicon
In: lib/wordnet/lexicon.rb  (CVS)
Parent: Object

WordNet lexicon class - abstracts access to the WordNet lexical databases, and provides factory methods for looking up and creating new WordNet::Synset objects.

Methods

Included Modules

WordNet::Constants CrossCase

Constants

SvnId = %q$Id: lexicon.rb 79 2007-02-20 19:07:32Z deveiant $   Subversion Id
SvnRev = %q$Rev$   Subversion revision
DefaultDbEnv = File::join( Config::CONFIG['datadir'], "ruby-wordnet" )   The path to the WordNet BerkeleyDB Env. It lives in the directory that this module is in.
EnvOptions = { :set_timeout => 50, :set_lk_detect => 1, :set_verbose => false, }   Options for the creation of the Env object
EnvFlagsRW = BDB::CREATE|BDB::INIT_TRANSACTION|BDB::RECOVER|BDB::INIT_MPOOL   Flags for the creation of the Env object (read-write and read-only)
EnvFlagsRO = BDB::INIT_MPOOL
TableNames = { :index => "index", :data => "data", :morph => "morph", }   Table names (actually database names in BerkeleyDB)

Attributes

data_db  [R]  The handle to the synset data table
env  [R]  The BDB::Env object which contains the wordnet lexicon‘s databases.
index_db  [R]  The handle to the index table
morph_db  [R]  The handle to the morph table

Public Class methods

Create a new WordNet::Lexicon object that will read its data from the given dbenv (a BerkeleyDB env directory). The database will be opened with the specified mode, which can either be a numeric octal mode (e.g., 0444) or one of (:readonly, :readwrite).

[Source]

# File lib/wordnet/lexicon.rb, line 97
    def initialize( dbenv=DefaultDbEnv, mode=:readonly )
        raise ArgumentError, "Cannot find data directory '#{dbenv}'" unless
            File::directory?( dbenv )

        @mode = normalize_mode( mode )
        debug_msg "Mode is: %04o" % [ mode ] if $DEBUG

        unless self.readonly?
            debug_msg "Using read/write flags"
            envflags = EnvFlagsRW
            dbflags = BDB::CREATE
        else
            debug_msg "Using readonly flags"
            envflags = EnvFlagsRO
            dbflags = 0
        end

        debug_msg "Env flags are: %0s, dbflags are %0s" %
            [ envflags.to_s(2), dbflags.to_s(2) ]

        begin
            @env = BDB::Env::new( dbenv, envflags, EnvOptions )
            @index_db = @env.open_db( BDB::BTREE, "index", nil, dbflags, @mode )
            @data_db = @env.open_db( BDB::BTREE, "data", nil, dbflags, @mode )
            @morph_db = @env.open_db( BDB::BTREE, "morph", nil, dbflags, @mode )
        rescue StandardError => err
            msg = "Error while opening Ruby-WordNet data files: #{dbenv}: %s" % 
                [ err.message ]
            raise err, msg, err.backtrace
        end
    end

Public Instance methods

Return a list of archival logfiles that can be removed safely. (BerkeleyDB-specific).

[Source]

# File lib/wordnet/lexicon.rb, line 174
    def archlogs
        return @env.log_archive( BDB::ARCH_ABS )
    end

Checkpoint the database. (BerkeleyDB-specific)

[Source]

# File lib/wordnet/lexicon.rb, line 167
    def checkpoint( bytes=0, minutes=0 )
        @env.checkpoint
    end

Remove any archival logfiles for the lexicon‘s database environment. (BerkeleyDB-specific).

[Source]

# File lib/wordnet/lexicon.rb, line 181
    def clean_logs
        return unless self.readwrite?
        self.archlogs.each do |logfile|
            File::chmod( 0777, logfile )
            File::delete( logfile )
        end
    end

Close the lexicon‘s database environment

[Source]

# File lib/wordnet/lexicon.rb, line 161
    def close
        @env.close if @env
    end

Factory method: Creates and returns a new WordNet::Synset object in this lexicon for the specified word and part_of_speech.

[Source]

# File lib/wordnet/lexicon.rb, line 290
    def create_synset( word, part_of_speech )
        return WordNet::Synset::new( self, '', part_of_speech, word )
    end

Returns an integer of the familiarity/polysemy count for word as a part_of_speech. Note that polysemy can be identified for a given word by counting the synsets returned by lookup_synsets.

[Source]

# File lib/wordnet/lexicon.rb, line 193
    def familiarity( word, part_of_speech, polyCount=nil )
        wordkey = self.make_word_key( word, part_of_speech )
        return nil unless @index_db.key?( wordkey )
        @index_db[ wordkey ].split( WordNet::SubDelimRe ).length
    end

Returns an array of compound words matching text.

[Source]

# File lib/wordnet/lexicon.rb, line 269
    def grep( text )
        return [] if text.empty?
        
        words = []
        
        # Grab a cursor into the database and fetch while the key matches
        # the target text
        cursor = @index_db.cursor
        rec = cursor.set_range( text )
        while /^#{text}/ =~ rec[0]
            words.push rec[0]
            rec = cursor.next
        end
        cursor.close

        return *words
    end

Look up sysets (Wordnet::Synset objects) matching text as a part_of_speech, where part_of_speech is one of +WordNet::Noun+, +WordNet::Verb+, +WordNet::Adjective+, or +WordNet::Adverb+. Without sense, lookup_synsets will return all matches that are a part_of_speech. If sense is specified, only the synset object that matches that particular part_of_speech and sense is returned.

[Source]

# File lib/wordnet/lexicon.rb, line 206
    def lookup_synsets( word, part_of_speech, sense=nil )
        wordkey = self.make_word_key( word, part_of_speech )
        pos = self.make_pos( part_of_speech )
        synsets = []

        # Look up the index entry, trying first the word as given, and if
        # that fails, trying morphological conversion.
        entry = @index_db[ wordkey ]
        if entry.nil? && (word = self.morph( word, part_of_speech ))
            entry = @index_db[ wordkey ]
        end

        # If the lookup failed both ways, just abort
        return nil unless entry

        # Make synset keys from the entry, narrowing it to just the sense
        # requested if one was specified.
        synkeys = entry.split( SubDelimRe ).collect {|off| "#{off}%#{pos}" }
        if sense
            return lookup_synsets_by_key( synkeys[sense - 1] )
        else
            return [ lookup_synsets_by_key(*synkeys) ].flatten
        end
    end
lookup_synsetsByOffset( *keys )

Returns the WordNet::Synset objects corresponding to the keys specified. The keys are made up of the target synset‘s "offset" and syntactic category catenated together with a ’%’ character.

[Source]

# File lib/wordnet/lexicon.rb, line 235
    def lookup_synsets_by_key( *keys )
        synsets = []

        keys.each {|key|
            raise WordNet::LookupError, "Failed lookup of synset '#{key}':"\
                "No such synset" unless @data_db.key?( key )

            data = @data_db[ key ]
            offset, part_of_speech = key.split( /%/, 2 )
            synsets << WordNet::Synset::new( self, offset, part_of_speech, nil, data )
        }

        return *synsets
    end

Returns a form of word as a part of speech part_of_speech, as found in the WordNet morph files. The lookup_synsets method perfoms morphological conversion automatically, so a call to morph is not required.

[Source]

# File lib/wordnet/lexicon.rb, line 256
    def morph( word, part_of_speech )
        return @morph_db[ self.make_word_key(word, part_of_speech) ]
    end
new_synset( word, part_of_speech )

Alias for create_synset

Returns true if the lexicon was opened in read-only mode.

[Source]

# File lib/wordnet/lexicon.rb, line 149
    def readonly?
        ( @mode & 0200 ).nonzero? ? false : true
    end

Returns true if the lexicon was opened in read-write mode.

[Source]

# File lib/wordnet/lexicon.rb, line 155
    def readwrite?
        ! self.readonly?
    end

Remove the specified synset (a WordNet::Synset object) in the lexicon. Returns the offset of the stored synset.

[Source]

# File lib/wordnet/lexicon.rb, line 338
    def remove_synset( synset )
        # If it's not in the database (ie., doesn't have a real offset),
        # just return.
        return nil if synset.offset == 1

        # Start a transaction on the data table
        @env.begin( BDB::TXN_COMMIT, @data_db ) do |txn,datadb|

            # First remove the index entries for this synset by iterating
            # over each of its words
            txn.begin( BDB::TXN_COMMIT, @index_db ) do |txn,indexdb|
                synset.words.collect {|word| word + "%" + pos }.each {|word|

                    # If the index contains an entry for this word, either
                    # splice out the offset for the synset being deleted if
                    # there are more than one, or just delete the whole
                    # entry if it's the only one.
                    if indexdb.key?( word )
                        offsets = indexdb[ word ].
                            split( SubDelimRe ).
                            reject {|offset| offset == synset.offset}

                        unless offsets.empty?
                            index_db[ word ] = newoffsets.join( SubDelim )
                        else
                            index_db.delete( word )
                        end
                    end
                }
            end

            # :TODO: Delete synset from pointers of related synsets

            # Delete the synset from the main db
            datadb.delete( synset.offset )
        end

        return true
    end

Returns the result of looking up word in the inverse of the WordNet morph files. _(This is undocumented in Lingua::Wordnet)_

[Source]

# File lib/wordnet/lexicon.rb, line 263
    def reverse_morph( word )
        @morph_db.invert[ word ]
    end

Store the specified synset (a WordNet::Synset object) in the lexicon. Returns the key of the stored synset.

[Source]

# File lib/wordnet/lexicon.rb, line 298
    def store_synset( synset )
        strippedOffset = nil
        pos = nil

        # Start a transaction
        @env.begin( BDB::TXN_COMMIT, @data_db ) do |txn,datadb|

            # If this is a new synset, generate an offset for it
            if synset.offset == 1
                synset.offset =
                    (datadb['offsetcount'] = datadb['offsetcount'].to_i + 1)
            end
            
            # Write the data entry
            datadb[ synset.key ] = synset.serialize
                
            # Write the index entries
            txn.begin( BDB::TXN_COMMIT, @index_db ) do |txn,indexdb|

                # Make word/part-of-speech pairs from the words in the synset
                synset.words.collect {|word| word + "%" + pos }.each {|word|

                    # If the index already has this word, but not this
                    # synset, add it
                    if indexdb.key?( word )
                        indexdb[ word ] << SubDelim << synset.offset unless
                            indexdb[ word ].include?( synset.offset )
                    else
                        indexdb[ word ] = synset.offset
                    end
                }
            end # transaction on @index_db
        end # transaction on @dataDB

        return synset.offset
    end

Protected Instance methods

Normalize various ways of specifying a part of speech into the WordNet part of speech indicator from the original representation, which may be the name (e.g., "noun"); nil, in which case it defaults to the indicator for a noun; or the indicator character itself, in which case it is returned unmodified.

[Source]

# File lib/wordnet/lexicon.rb, line 388
    def make_pos( original )
        return WordNet::Noun if original.nil?
        osym = original.to_s.intern
        return WordNet::SyntacticCategories[ osym ] if
            WordNet::SyntacticCategories.key?( osym )
        return original if SyntacticSymbols.key?( original )
        return nil
    end

Make a lexicon key out of the given word and part of speech (pos).

[Source]

# File lib/wordnet/lexicon.rb, line 400
    def make_word_key( word, pos )
        pos = self.make_pos( pos )
        word = word.gsub( /\s+/, '_' )
        return "#{word}%#{pos}"
    end

Private Instance methods

Output the given msg to STDERR if $DEBUG is turned on.

[Source]

# File lib/wordnet/lexicon.rb, line 427
    def debug_msg( *msg )
        return unless $DEBUG
        $deferr.puts msg
    end

Turn the given origmode into an octal file mode such as that given to File.open.

[Source]

# File lib/wordnet/lexicon.rb, line 413
    def normalize_mode( origmode )
        case origmode
        when :readonly
            0444 & ~File.umask
        when :readwrite, :writable
            0666 & ~File.umask
        when Fixnum
            origmode
        else
            raise ArgumentError, "unrecognized mode %p" % [origmode]
        end
    end

[Validate]