<<O>>  Difference Topic Perl (r1.4 - 12 Dec 2005 - ChrisJones)

META TOPICPARENT TechStuff

Plucene (MAX_FIELD_LENGTH)

Line: 20 to 20

As an aside, I found an online version of the Porter Stemming algorithm. I'm not sure if the standard analyser in plucene is using this algorithm, but it's interesting to enter words with apostrophes. It certainly shows the importance of using the same analyser for both index creation and query parsing.

Added:
>
>

Search and replace in multiple files:

perl  -e 's/gopher/World Wide Web/gi'  -p  -i.bak  *.html

link

 <<O>>  Difference Topic Perl (r1.3 - 12 Jul 2005 - ChrisJones)

META TOPICPARENT TechStuff

Plucene (MAX_FIELD_LENGTH)

Line: 7 to 7

It turns out that the Index writer has a limit of 10000 words per field. I found the answer was to modify the following line:
Changed:
<
<
use constant MAX_FIELD_LENGTH => 10_000;
>
>
use constant MAX_FIELD_LENGTH => 10_000;

in /usr/share/perl5/Plucene/Index/Writer.pm, to:

Changed:
<
<
use constant MAX_FIELD_LENGTH => 50_000;
>
>
use constant MAX_FIELD_LENGTH => 50_000;

Be warned, this slows down the indexing process considerably (I'm prepared to live with this as I want the index to be comprehensive). According to the Lucene (java version) documentation upping this number can cause out of memory errors. I haven't experienced this problem.

Changed:
<
<
Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a property (ala -D).
>
>
Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a runtime property (via -D).

As an aside, I found an online version of the Porter Stemming algorithm. I'm not sure if the standard analyser in plucene is using this algorithm, but it's interesting to enter words with apostrophes. It certainly shows the importance of using the same analyser for both index creation and query parsing.


 <<O>>  Difference Topic Perl (r1.2 - 12 Jul 2005 - ChrisJones)

META TOPICPARENT TechStuff

Plucene (MAX_FIELD_LENGTH)

Line: 14 to 14

use constant MAX_FIELD_LENGTH => 50_000;

Be warned, this slows down the indexing process considerably (I'm prepared to live with this as I want the index to be comprehensive).

Changed:
<
<
According to the Lucene (java version) documentation upping this number can cause out of memory errors. I didn't experience problem.
>
>
According to the Lucene (java version) documentation upping this number can cause out of memory errors. I haven't experienced this problem.

Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a property (ala -D).

 <<O>>  Difference Topic Perl (r1.1 - 09 Jul 2005 - ChrisJones)
Line: 1 to 1
Added:
>
>
META TOPICPARENT TechStuff

Plucene (MAX_FIELD_LENGTH)

I had a problem where large pdf files (7-8MB) were not being thoroughly indexed. FYI, I was trying to index pdf files attached to my twiki site using the PluceneSearch plugin.

It turns out that the Index writer has a limit of 10000 words per field. I found the answer was to modify the following line:

use constant MAX_FIELD_LENGTH => 10_000;

in /usr/share/perl5/Plucene/Index/Writer.pm, to:

use constant MAX_FIELD_LENGTH => 50_000;

Be warned, this slows down the indexing process considerably (I'm prepared to live with this as I want the index to be comprehensive). According to the Lucene (java version) documentation upping this number can cause out of memory errors. I didn't experience problem.

Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a property (ala -D).

View topic | Diffs | r1.4 | > | r1.3 | > | r1.2 | More
Revision r1.1 - 09 Jul 2005 - 03:20 - ChrisJones
Revision r1.4 - 12 Dec 2005 - 15:06 - ChrisJones