Skip to topic
|
Skip to bottom
Search:
Chris and Janet's website
Home
Historian
Archive
Tech
Amnesty
Home
Changes
Index
Search
Tools
Start of topic |
Skip to actions
---+ Plucene (MAX_FIELD_LENGTH) I had a problem where large pdf files (7-8MB) were not being thoroughly indexed. FYI, I was trying to index pdf files attached to my twiki site using the !PluceneSearch plugin. It turns out that the Index writer has a limit of 10000 words per field. I found the answer was to modify the following line: _use constant MAX_FIELD_LENGTH => 10_000;_ in /usr/share/perl5/Plucene/Index/Writer.pm, to: _use constant MAX_FIELD_LENGTH => 50_000;_ Be warned, this slows down the indexing process considerably (I'm prepared to live with this as I want the index to be comprehensive). According to the Lucene (java version) documentation upping this number can cause out of memory errors. I haven't experienced this problem. Hopefully, the Plucene developers will provide a cleaner way to configure this parameter in the future, along the same lines as the java Lucene code that uses a runtime property (via -D). As an aside, I found an [[http://maya.cs.depaul.edu/~classes/ds575/porter.html][online]] version of the [[http://www.tartarus.org/~martin/PorterStemmer/][Porter Stemming]] algorithm. I'm not sure if the standard analyser in plucene is using this algorithm, but it's interesting to enter words with apostrophes. It certainly shows the importance of using the same analyser for both index creation and query parsing. --- Search and replace in multiple files: =perl -e 's/gopher/World Wide Web/gi' -p -i.bak *.html= [[http://www.cclabs.missouri.edu/things/instruction/perl/perlcourse.html#taste][link]]
to top
End of topic
Skip to action links
|
Back to top
Edit
|
Attach image or document
|
Printable version
|
Raw text
|
More topic actions
Revisions: | r1.4 |
>
|
r1.3
|
>
|
r1.2
|
Total page history
|
Backlinks
You are here:
Home
>
Perl
to top