Install Solr and Nutch on Fedora 14
NOTE: I have created a better version of this using Sun’s Java and some other improvements, find it here: http://www.tech-problems.com/install-solr-and-nutch-on-fedora-14-using-sun-java/
This post should detail how to install Solr and Nutch on Fedora using the Rackspace Cloud, instructions for CentOS should also be identical. Any suggestions on how to improve this please comment below and I’ll update the post.
Install Java and TomCat6
yum install java-1.6.0-openjdk-devel yum install tomcat6 tomcat6-admin-webapps tomcat6-webapps service tomcat6 start chkconfig --add tomcat6 /sbin/iptables -I INPUT 1 -p tcp --dport 8080 -j ACCEPT /sbin/iptables -I INPUT 1 -p tcp --dport 8983 -j ACCEPT /sbin/service iptables save service iptables restart
Set JAVA_HOME
vi /etc/profile.d/java.sh export JAVA_HOME="/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/jre" export JAVA_PATH="$JAVA_HOME" export PATH="$PATH:$JAVA_HOME/bin" chmod +x /etc/profile.d/java.sh
You should now be able to see the Tomcat welcome page at http://xxx.xxx.xxx.xxx:8080 where xxx.xxx.xxx.xxx is your server IP
You should now find the URL for the latest Solr release from here: http://www.apache.org/dyn/closer.cgi/lucene/solr/
Use this to download the latest version:
mkdir /var/solr mkdir /var/nutch cd /var/solr wget http://apache.mirror.rbftpnetworks.com/lucene/solr/1.4.1/apache-solr-1.4.1.zip unzip -q apache-solr-1.4.1.zip cd /var/nutch wget http://mirrors.dedipower.com/ftp.apache.org//nutch/apache-nutch-1.2-bin.zip unzip -q apache-nutch-1.2-bin.zip
Copy the provided Nutch schema from directory apache-nutch-1.2/conf to directory apache-solr-1.4.1/example/solr/conf (override the existing file)
cp /var/nutch/nutch-1.2/conf/schema.xml /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml
Change schema.xml so that the stored attribute of field “content” is true.
vi /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml <field name="content" type="text" stored="true" indexed="true" />
*TODO*
Edit DisMax – http://wiki.apache.org/solr/DisMaxQParserPlugin
Option 1 – Use Solr with TomCat
mkdir /var/lib/tomcat6/webapps/solr cp /var/solr/apache-solr-1.4.1/dist/apache-solr-1.4.1.war /var/lib/tomcat6/webapps/solr cd /var/lib/tomcat6/webapps/solr/ jar xvf apache-solr-1.4.1.war
Add config File
vi /etc/tomcat6/Catalina/localhost/solr.xml
Add the following:
<Context docBase="/usr/share/tomcat6/webapps/solr/apache-solr-1.4.1.war" debug="0" privileged="true" allowLinking="true" crossContext="true">
<Environment name="solr/home" type="java.lang.String" value="/var/solr/apache-solr-1.4.1/example/solr" override="true" />
</Context>
Restart Tomcat
service tomcat6 restart
You should be able to see Solr at (substitute your server IP):
http://xxx.xxx.xxx.xxx:8080/solr/
Option 2 – Set Solr to start on startup using Jetty
mkdir /var/scripts cd /var/scripts vi start.sh
In here copy the following and save ESC :x
#!/bin/bash cd /var/solr/apache-solr-1.4.1/example /usr/bin/java -jar start.jar
Give this execute priv
chmod og+x start.sh
Test this to make sure it works, quit by hitting Ctrl C
./start.sh
You should be able to see Solr at (substitute your server IP):
http://xxx.xxx.xxx.xxx:8983/solr/
If it does set your script to run at startup:
vi /etc/rc.local
Add the following to the bottom and save
/var/scripts/start.sh
Test everything starts up OK by rebooting the server:
reboot
Set up Nutch with Tomcat
cd /var/nutch/nutch-1.2 mkdir /var/lib/tomcat6/webapps/nutch cp nutch*.war /var/lib/tomcat6/webapps/nutch cd /var/lib/tomcat6/webapps/nutch/ jar xvf nutch-1.2.war service tomcat6 restart
You can now access Nutch at: http://xxx.xxx.xxx.xxx:8080/nutch/
Add site to Nutch
cd /var/nutch/nutch-1.2 mkdir seed vi seed/url.txt
Add a list of root URLs to crawl on seperate lines
Limit Nutch to certain domains (optional)
If you want Nutch to only crawl a certain domain edit the crawl-urlfiler.txt
vi /var/nutch/nutch-1.2/conf/crawl-urlfilter.txt
Set Nutch Crawler Name
vi /var/nutch/nutch-1.2/conf/nutch-site.xml
Add the following between configuration:
<property>
<name>http.agent.name</name>
<value>YourName</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -please set this to a single word uniquely related to your organization.
</description>
</property>
Start Nutch Crawl – Method 1
Method 1 is a quick way to get a crawl going, simply use the command below
bin/nutch crawl seed/url.txt -dir crawl -depth 3 -topN 50
Pass Nutch data to Solr
bin/nutch solrindex http://127.0.0.1:8080/solr/ $NUTCH_HOME/crawl/crawldb $NUTCH_HOME/crawl/linkdb $NUTCH_HOME/crawl/segments/*
Start Nutch Crawl – Method 2 (Preferred)
Method 2 breaks down the crawl process into seperate parts, allowing for greater control – *TODO*
mkdir /var/scripts cd /var/scripts vi crawl.sh
Use the following code:
cd /var/nutch/nutch-1.2 bin/nutch inject crawl/crawldb seed/url.txt bin/nutch generate crawl/crawldb crawl/segments s1=`ls -d crawl/segments/2* | tail -1` echo $s1 bin/nutch fetch $s1 bin/nutch updatedb crawl/crawldb $s1 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=`ls -d crawl/segments/2* | tail -1` echo $s2 bin/nutch fetch $s2 bin/nutch updatedb crawl/crawldb $s2 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s3=`ls -d crawl/segments/2* | tail -1` echo $s3 bin/nutch fetch $s3 bin/nutch updatedb crawl/crawldb $s3 bin/nutch invertlinks crawl/linkdb -dir crawl/segments bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
You can now go to your Solr admin and try out a search, hopefully you will get results back!
-
http://jerry.gallagher.myopenid.com/ Jerry Gallagher
-
Tech Problems