Skip to topic | Skip to bottom
Search:
Home

Chris and Janet's website

Home


Start of topic | Skip to actions

Plucene (MAX_FIELD_LENGTH)

I had a problem where large pdf files (7-8MB) were not being thoroughly indexed. FYI, I was trying to index pdf files attached to my twiki site using the PluceneSearch plugin.

It turns out that the Index writer has a limit of 10000 words per field. I found the answer was to modify the following line:

use constant MAX_FIELD_LENGTH => 10_000;

in /usr/share/perl5/Plucene/Index/Writer.pm, to:

use constant MAX_FIELD_LENGTH => 50_000;

Be warned, this slows down the indexing process considerably (I'm prepared to live with this as I want the index to be comprehensive). According to the Lucene (java version) documentation upping this number can cause out of memory errors. I haven't experienced this problem.

Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a runtime property (via -D).

As an aside, I found an online version of the Porter Stemming algorithm. I'm not sure if the standard analyser in plucene is using this algorithm, but it's interesting to enter words with apostrophes. It certainly shows the importance of using the same analyser for both index creation and query parsing.


Search and replace in multiple files:

perl  -e 's/gopher/World Wide Web/gi'  -p  -i.bak  *.html

link
to top


You are here: Home > WebLeftBar > TechStuff > Perl

to top