| Class | WordNet::Lexicon |
| In: |
lib/wordnet/lexicon.rb
(CVS)
|
| Parent: | Object |
WordNet lexicon class - abstracts access to the WordNet lexical databases, and provides factory methods for looking up and creating new WordNet::Synset objects.
| SvnId | = | %q$Id: lexicon.rb 79 2007-02-20 19:07:32Z deveiant $ | Subversion Id | |
| SvnRev | = | %q$Rev$ | Subversion revision | |
| DefaultDbEnv | = | File::join( Config::CONFIG['datadir'], "ruby-wordnet" ) | The path to the WordNet BerkeleyDB Env. It lives in the directory that this module is in. | |
| EnvOptions | = | { :set_timeout => 50, :set_lk_detect => 1, :set_verbose => false, } | Options for the creation of the Env object | |
| EnvFlagsRW | = | BDB::CREATE|BDB::INIT_TRANSACTION|BDB::RECOVER|BDB::INIT_MPOOL | Flags for the creation of the Env object (read-write and read-only) | |
| EnvFlagsRO | = | BDB::INIT_MPOOL | ||
| TableNames | = | { :index => "index", :data => "data", :morph => "morph", } | Table names (actually database names in BerkeleyDB) |
| data_db | [R] | The handle to the synset data table |
| env | [R] | The BDB::Env object which contains the wordnet lexicon‘s databases. |
| index_db | [R] | The handle to the index table |
| morph_db | [R] | The handle to the morph table |
Create a new WordNet::Lexicon object that will read its data from the given dbenv (a BerkeleyDB env directory). The database will be opened with the specified mode, which can either be a numeric octal mode (e.g., 0444) or one of (:readonly, :readwrite).
# File lib/wordnet/lexicon.rb, line 97 def initialize( dbenv=DefaultDbEnv, mode=:readonly ) raise ArgumentError, "Cannot find data directory '#{dbenv}'" unless File::directory?( dbenv ) @mode = normalize_mode( mode ) debug_msg "Mode is: %04o" % [ mode ] if $DEBUG unless self.readonly? debug_msg "Using read/write flags" envflags = EnvFlagsRW dbflags = BDB::CREATE else debug_msg "Using readonly flags" envflags = EnvFlagsRO dbflags = 0 end debug_msg "Env flags are: %0s, dbflags are %0s" % [ envflags.to_s(2), dbflags.to_s(2) ] begin @env = BDB::Env::new( dbenv, envflags, EnvOptions ) @index_db = @env.open_db( BDB::BTREE, "index", nil, dbflags, @mode ) @data_db = @env.open_db( BDB::BTREE, "data", nil, dbflags, @mode ) @morph_db = @env.open_db( BDB::BTREE, "morph", nil, dbflags, @mode ) rescue StandardError => err msg = "Error while opening Ruby-WordNet data files: #{dbenv}: %s" % [ err.message ] raise err, msg, err.backtrace end end
Return a list of archival logfiles that can be removed safely. (BerkeleyDB-specific).
# File lib/wordnet/lexicon.rb, line 174 def archlogs return @env.log_archive( BDB::ARCH_ABS ) end
Checkpoint the database. (BerkeleyDB-specific)
# File lib/wordnet/lexicon.rb, line 167 def checkpoint( bytes=0, minutes=0 ) @env.checkpoint end
Remove any archival logfiles for the lexicon‘s database environment. (BerkeleyDB-specific).
# File lib/wordnet/lexicon.rb, line 181 def clean_logs return unless self.readwrite? self.archlogs.each do |logfile| File::chmod( 0777, logfile ) File::delete( logfile ) end end
Close the lexicon‘s database environment
# File lib/wordnet/lexicon.rb, line 161 def close @env.close if @env end
Factory method: Creates and returns a new WordNet::Synset object in this lexicon for the specified word and part_of_speech.
# File lib/wordnet/lexicon.rb, line 290 def create_synset( word, part_of_speech ) return WordNet::Synset::new( self, '', part_of_speech, word ) end
Returns an integer of the familiarity/polysemy count for word as a part_of_speech. Note that polysemy can be identified for a given word by counting the synsets returned by lookup_synsets.
# File lib/wordnet/lexicon.rb, line 193 def familiarity( word, part_of_speech, polyCount=nil ) wordkey = self.make_word_key( word, part_of_speech ) return nil unless @index_db.key?( wordkey ) @index_db[ wordkey ].split( WordNet::SubDelimRe ).length end
Returns an array of compound words matching text.
# File lib/wordnet/lexicon.rb, line 269 def grep( text ) return [] if text.empty? words = [] # Grab a cursor into the database and fetch while the key matches # the target text cursor = @index_db.cursor rec = cursor.set_range( text ) while /^#{text}/ =~ rec[0] words.push rec[0] rec = cursor.next end cursor.close return *words end
Look up sysets (Wordnet::Synset objects) matching text as a part_of_speech, where part_of_speech is one of +WordNet::Noun+, +WordNet::Verb+, +WordNet::Adjective+, or +WordNet::Adverb+. Without sense, lookup_synsets will return all matches that are a part_of_speech. If sense is specified, only the synset object that matches that particular part_of_speech and sense is returned.
# File lib/wordnet/lexicon.rb, line 206 def lookup_synsets( word, part_of_speech, sense=nil ) wordkey = self.make_word_key( word, part_of_speech ) pos = self.make_pos( part_of_speech ) synsets = [] # Look up the index entry, trying first the word as given, and if # that fails, trying morphological conversion. entry = @index_db[ wordkey ] if entry.nil? && (word = self.morph( word, part_of_speech )) entry = @index_db[ wordkey ] end # If the lookup failed both ways, just abort return nil unless entry # Make synset keys from the entry, narrowing it to just the sense # requested if one was specified. synkeys = entry.split( SubDelimRe ).collect {|off| "#{off}%#{pos}" } if sense return lookup_synsets_by_key( synkeys[sense - 1] ) else return [ lookup_synsets_by_key(*synkeys) ].flatten end end
Returns the WordNet::Synset objects corresponding to the keys specified. The keys are made up of the target synset‘s "offset" and syntactic category catenated together with a ’%’ character.
# File lib/wordnet/lexicon.rb, line 235 def lookup_synsets_by_key( *keys ) synsets = [] keys.each {|key| raise WordNet::LookupError, "Failed lookup of synset '#{key}':"\ "No such synset" unless @data_db.key?( key ) data = @data_db[ key ] offset, part_of_speech = key.split( /%/, 2 ) synsets << WordNet::Synset::new( self, offset, part_of_speech, nil, data ) } return *synsets end
Returns a form of word as a part of speech part_of_speech, as found in the WordNet morph files. The lookup_synsets method perfoms morphological conversion automatically, so a call to morph is not required.
# File lib/wordnet/lexicon.rb, line 256 def morph( word, part_of_speech ) return @morph_db[ self.make_word_key(word, part_of_speech) ] end
Returns true if the lexicon was opened in read-only mode.
# File lib/wordnet/lexicon.rb, line 149 def readonly? ( @mode & 0200 ).nonzero? ? false : true end
Returns true if the lexicon was opened in read-write mode.
# File lib/wordnet/lexicon.rb, line 155 def readwrite? ! self.readonly? end
Remove the specified synset (a WordNet::Synset object) in the lexicon. Returns the offset of the stored synset.
# File lib/wordnet/lexicon.rb, line 338 def remove_synset( synset ) # If it's not in the database (ie., doesn't have a real offset), # just return. return nil if synset.offset == 1 # Start a transaction on the data table @env.begin( BDB::TXN_COMMIT, @data_db ) do |txn,datadb| # First remove the index entries for this synset by iterating # over each of its words txn.begin( BDB::TXN_COMMIT, @index_db ) do |txn,indexdb| synset.words.collect {|word| word + "%" + pos }.each {|word| # If the index contains an entry for this word, either # splice out the offset for the synset being deleted if # there are more than one, or just delete the whole # entry if it's the only one. if indexdb.key?( word ) offsets = indexdb[ word ]. split( SubDelimRe ). reject {|offset| offset == synset.offset} unless offsets.empty? index_db[ word ] = newoffsets.join( SubDelim ) else index_db.delete( word ) end end } end # :TODO: Delete synset from pointers of related synsets # Delete the synset from the main db datadb.delete( synset.offset ) end return true end
Store the specified synset (a WordNet::Synset object) in the lexicon. Returns the key of the stored synset.
# File lib/wordnet/lexicon.rb, line 298 def store_synset( synset ) strippedOffset = nil pos = nil # Start a transaction @env.begin( BDB::TXN_COMMIT, @data_db ) do |txn,datadb| # If this is a new synset, generate an offset for it if synset.offset == 1 synset.offset = (datadb['offsetcount'] = datadb['offsetcount'].to_i + 1) end # Write the data entry datadb[ synset.key ] = synset.serialize # Write the index entries txn.begin( BDB::TXN_COMMIT, @index_db ) do |txn,indexdb| # Make word/part-of-speech pairs from the words in the synset synset.words.collect {|word| word + "%" + pos }.each {|word| # If the index already has this word, but not this # synset, add it if indexdb.key?( word ) indexdb[ word ] << SubDelim << synset.offset unless indexdb[ word ].include?( synset.offset ) else indexdb[ word ] = synset.offset end } end # transaction on @index_db end # transaction on @dataDB return synset.offset end
Normalize various ways of specifying a part of speech into the WordNet part of speech indicator from the original representation, which may be the name (e.g., "noun"); nil, in which case it defaults to the indicator for a noun; or the indicator character itself, in which case it is returned unmodified.
# File lib/wordnet/lexicon.rb, line 388 def make_pos( original ) return WordNet::Noun if original.nil? osym = original.to_s.intern return WordNet::SyntacticCategories[ osym ] if WordNet::SyntacticCategories.key?( osym ) return original if SyntacticSymbols.key?( original ) return nil end
Make a lexicon key out of the given word and part of speech (pos).
# File lib/wordnet/lexicon.rb, line 400 def make_word_key( word, pos ) pos = self.make_pos( pos ) word = word.gsub( /\s+/, '_' ) return "#{word}%#{pos}" end
Output the given msg to STDERR if $DEBUG is turned on.
# File lib/wordnet/lexicon.rb, line 427 def debug_msg( *msg ) return unless $DEBUG $deferr.puts msg end
Turn the given origmode into an octal file mode such as that given to File.open.
# File lib/wordnet/lexicon.rb, line 413 def normalize_mode( origmode ) case origmode when :readonly 0444 & ~File.umask when :readwrite, :writable 0666 & ~File.umask when Fixnum origmode else raise ArgumentError, "unrecognized mode %p" % [origmode] end end