Real life problems solved

Advert

Install Solr and Nutch on Fedora 14

NOTE: I have created a better version of this using Sun’s Java and some other improvements, find it here: http://www.tech-problems.com/install-solr-and-nutch-on-fedora-14-using-sun-java/

This post should detail how to install Solr and Nutch on Fedora using the Rackspace Cloud, instructions for CentOS should also be identical. Any suggestions on how to improve this please comment below and I’ll update the post.

Install Java and TomCat6

yum install java-1.6.0-openjdk-devel
yum install tomcat6 tomcat6-admin-webapps tomcat6-webapps
service tomcat6 start
chkconfig --add tomcat6

/sbin/iptables -I INPUT 1 -p tcp --dport 8080 -j ACCEPT
/sbin/iptables -I INPUT 1 -p tcp --dport 8983 -j ACCEPT
/sbin/service iptables save
service iptables restart

Set JAVA_HOME

vi /etc/profile.d/java.sh
export JAVA_HOME="/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre"
export JAVA_PATH="$JAVA_HOME"
export PATH="$PATH:$JAVA_HOME/bin"
chmod +x /etc/profile.d/java.sh

You should now be able to see the Tomcat welcome page at http://xxx.xxx.xxx.xxx:8080 where xxx.xxx.xxx.xxx is your server IP

You should now find the URL for the latest Solr release from here: http://www.apache.org/dyn/closer.cgi/lucene/solr/

Use this to download the latest version:

mkdir /var/solr
mkdir /var/nutch
cd /var/solr
wget http://apache.mirror.rbftpnetworks.com/lucene/solr/1.4.1/apache-solr-1.4.1.zip
unzip -q apache-solr-1.4.1.zip
cd /var/nutch
wget http://mirrors.dedipower.com/ftp.apache.org//nutch/apache-nutch-1.2-bin.zip
unzip -q apache-nutch-1.2-bin.zip

Copy the provided Nutch schema from directory apache-nutch-1.2/conf to directory apache-solr-1.4.1/example/solr/conf (override the existing file)

cp /var/nutch/nutch-1.2/conf/schema.xml /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml

Change schema.xml so that the stored attribute of field “content” is true.

vi /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml
<field name="content" type="text" stored="true" indexed="true" />

*TODO*
Edit DisMax – http://wiki.apache.org/solr/DisMaxQParserPlugin

Option 1 – Use Solr with TomCat

mkdir /var/lib/tomcat6/webapps/solr
cp /var/solr/apache-solr-1.4.1/dist/apache-solr-1.4.1.war /var/lib/tomcat6/webapps/solr
cd /var/lib/tomcat6/webapps/solr/
jar xvf apache-solr-1.4.1.war

Add config File

vi /etc/tomcat6/Catalina/localhost/solr.xml

Add the following:


<Context docBase="/usr/share/tomcat6/webapps/solr/apache-solr-1.4.1.war" debug="0" privileged="true" allowLinking="true" crossContext="true">
<Environment name="solr/home" type="java.lang.String" value="/var/solr/apache-solr-1.4.1/example/solr" override="true" />
</Context>

Restart Tomcat

service tomcat6 restart

You should be able to see Solr at (substitute your server IP):

http://xxx.xxx.xxx.xxx:8080/solr/

Option 2 – Set Solr to start on startup using Jetty

mkdir /var/scripts
cd /var/scripts
vi start.sh

In here copy the following and save ESC :x

#!/bin/bash
cd /var/solr/apache-solr-1.4.1/example
/usr/bin/java -jar start.jar

Give this execute priv

chmod og+x start.sh

Test this to make sure it works, quit by hitting Ctrl C

./start.sh

You should be able to see Solr at (substitute your server IP):

http://xxx.xxx.xxx.xxx:8983/solr/

If it does set your script to run at startup:

vi /etc/rc.local

Add the following to the bottom and save

/var/scripts/start.sh

Test everything starts up OK by rebooting the server:

reboot

Set up Nutch with Tomcat

cd /var/nutch/nutch-1.2
mkdir /var/lib/tomcat6/webapps/nutch
cp nutch*.war /var/lib/tomcat6/webapps/nutch
cd /var/lib/tomcat6/webapps/nutch/
jar xvf nutch-1.2.war
service tomcat6 restart

You can now access Nutch at: http://xxx.xxx.xxx.xxx:8080/nutch/

Add site to Nutch

cd /var/nutch/nutch-1.2
mkdir seed
vi seed/url.txt

Add a list of root URLs to crawl on seperate lines

Limit Nutch to certain domains (optional)

If you want Nutch to only crawl a certain domain edit the crawl-urlfiler.txt

vi /var/nutch/nutch-1.2/conf/crawl-urlfilter.txt

Set Nutch Crawler Name

vi /var/nutch/nutch-1.2/conf/nutch-site.xml

Add the following between configuration:

<property>
<name>http.agent.name</name>
<value>YourName</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -please set this to a single word uniquely related to your organization.
</description>
</property>

Start Nutch Crawl – Method 1

Method 1 is a quick way to get a crawl going, simply use the command below

bin/nutch crawl seed/url.txt -dir crawl -depth 3 -topN 50

Pass Nutch data to Solr

bin/nutch solrindex http://127.0.0.1:8080/solr/ $NUTCH_HOME/crawl/crawldb $NUTCH_HOME/crawl/linkdb $NUTCH_HOME/crawl/segments/*

Start Nutch Crawl – Method 2 (Preferred)

Method 2 breaks down the crawl process into seperate parts, allowing for greater control – *TODO*

mkdir /var/scripts
cd /var/scripts
vi crawl.sh

Use the following code:

cd /var/nutch/nutch-1.2

bin/nutch inject crawl/crawldb seed/url.txt
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch updatedb crawl/crawldb $s1 

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2

bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2

bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3 

bin/nutch invertlinks crawl/linkdb -dir crawl/segments 

bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* 

bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

You can now go to your Solr admin and try out a search, hopefully you will get results back!

  • http://jerry.gallagher.myopenid.com/ Jerry Gallagher

    If you install Tomcat from scratch the JAVA_HOME variable in this post will not work. javac is off bin one directory higher.

  • Tech Problems

    Thanks for the comment, I think I am going to update the advise to use Sun Java instead, seems to be a little more stable than the open one

Adsense