Real life problems solved

Advert

Install Solr and Nutch on Fedora 14 using Sun Java

A followup to an earlier post where I installed Solr and Nutch on a Fedora 14 box. However I used Fedoras Open Java for this, I have subsequently found it works better using Sun’s install so this post details steps required to achieve this.

Install Sun Java

yum install java-1.6.0-openjdk-devel
cd /var/tmp

Get a download link from from http://java.sun.com/javase/downloads/index.jsp

wget http://www.link-from-earlier
chmod +x jdk-6u23-*-rpm.bin
./jdk-6u23-*-rpm.bin
alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_23/bin/java 20000
alternatives --config java
vi /etc/profile.d/java.sh
export JAVA_HOME="/usr/java/jdk1.6.0_23"
export JAVA_PATH="$JAVA_HOME"
export PATH="$PATH:$JAVA_HOME/bin"
chmod +x /etc/profile.d/java.sh

Reboot the system

reboot

Install TomCat 6

yum install tomcat6 tomcat6-admin-webapps tomcat6-webapps
service tomcat6 start
chkconfig --add tomcat6

Open Firewall

/sbin/iptables -I INPUT 1 -p tcp --dport 8080 -j ACCEPT
/sbin/service iptables save
service iptables restart

You should now be able to see the Tomcat welcome page at http://xxx.xxx.xxx.xxx:8080 where xxx.xxx.xxx.xxx is your server IP

You should now find the URL for the latest Solr release from here: http://www.apache.org/dyn/closer.cgi/lucene/solr/

Use this to download the latest version:

mkdir /var/solr
mkdir /var/nutch
wget http://mirror.ox.ac.uk/sites/rsync.apache.org//lucene/solr/1.4.1/apache-solr-1.4.1.zip
unzip -q apache-solr-1.4.1.zip
cp -r apache-solr-1.4.1 /var/solr
wget http://apache.mirror.rbftpnetworks.com/nutch/apache-nutch-1.2-bin.zip
unzip -q apache-nutch-1.2-bin.zip
cp -r nutch-1.2 /var/nutch

Copy the provided Nutch schema from directory apache-nutch-1.2/conf to directory apache-solr-1.4.1/example/solr/conf (override the existing file)

cp /var/nutch/nutch-1.2/conf/schema.xml /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml

Change schema.xml so that the stored attribute of field “content” is true.

vi /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml
<field name="content" type="text" stored="true" indexed="true" />

Edit config to work with Nutch

vi /var/solr/apache-solr-1.4.1/example/solr/conf/solrconfig.xml
<requestHandler name="/nutch" class="solr.SearchHandler" >
<lst name="defaults">
<str name="defType">dismax</str>
<str name="echoParams">explicit</str>
<float name="tie">0.01</float>
<str name="qf">
content&#94;0.5 anchor&#94;1.0 title&#94;1.2 </str>
<str name="pf"> content&#94;0.5 anchor&#94;1.5 title&#94;1.2 site&#94;1.5 </str>
<str name="fl"> url </str>
<str name="mm"> 2&lt;-1 5&lt;-2 6&lt;90% </str>
<int name="ps">100</int>
<bool name="hl">true</bool>
<str name="q.alt">*:*</str>
<str name="hl.fl">title url content</str>
<str name="f.title.hl.fragsize">0</str>
<str name="f.title.hl.alternateField">title</str>
<str name="f.url.hl.fragsize">0</str>
<str name="f.url.hl.alternateField">url</str>
<str name="f.content.hl.fragmenter">regex</str>
</lst>
</requestHandler>

Set Nutch Home

vi /etc/profile.d/nutch.sh
export NUTCH_HOME="/var/nutch/nutch-1.2"
chmod +x /etc/profile.d/nutch.sh
. /etc/profile.d/nutch.sh

Option 1 – Use Solr with TomCat

mkdir /var/lib/tomcat6/webapps/solr
cp /var/solr/apache-solr-1.4.1/dist/apache-solr-1.4.1.war /var/lib/tomcat6/webapps/solr
cd /var/lib/tomcat6/webapps/solr/
jar xvf apache-solr-1.4.1.war

Add config File

vi /etc/tomcat6/Catalina/localhost/solr.xml

Add the following:

<Context docBase="/usr/share/tomcat6/webapps/solr/apache-solr-1.4.1.war" debug="0"  privileged="true" allowLinking="true" crossContext="true">

<Environment name="solr/home" type="java.lang.String" value="/var/solr/apache-solr-1.4.1/example/solr" override="true" />

</Context>

Restart Tomcat

service tomcat6 restart

You should be able to see Solr at (substitute your server IP):

http://xxx.xxx.xxx.xxx:8080/solr/

Test everything starts up OK by rebooting the server:

reboot

Set up Nutch with Tomcat

cd /var/nutch/nutch-1.2
mkdir /var/lib/tomcat6/webapps/nutch
cp nutch*.war /var/lib/tomcat6/webapps/nutch
cd /var/lib/tomcat6/webapps/nutch/
jar xvf nutch-1.2.war
service tomcat6 restart

You can now access Nutch at: http://xxx.xxx.xxx.xxx:8080/nutch/

Add site to Nutch

cd /var/nutch/nutch-1.2
mkdir seed
vi seed/url.txt

Add a list of root URLs to crawl on seperate lines

Limit Nutch to certain domains (optional)

If you want Nutch to only crawl a certain domain edit the crawl-urlfiler.txt

vi /var/nutch/nutch-1.2/conf/crawl-urlfilter.txt

Set Nutch Crawler Options

vi /var/nutch/nutch-1.2/conf/nutch-site.xml

Add the following between configuration:

<property>

<name>http.agent.name</name>

<value>YourName</value>

<description>HTTP 'User-Agent' request header. MUST NOT be empty -please set this to a single word uniquely related to your organization.

</description>

</property>

Set Nutch Config Options

vi /var/lib/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml

Add the following between configuration:

<property>

<name>searcher.dir</name>

<value>/var/nutch/nutch-1.2/crawl</value>

</property>

Start Nutch Crawl

bin/nutch crawl seed/url.txt -dir crawl -depth 3 -topN 50
bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
Adsense