Install Solr and Nutch on Fedora 14 using Sun Java
A followup to an earlier post where I installed Solr and Nutch on a Fedora 14 box. However I used Fedoras Open Java for this, I have subsequently found it works better using Sun’s install so this post details steps required to achieve this.
Install Sun Java
yum install java-1.6.0-openjdk-devel cd /var/tmp
Get a download link from from http://java.sun.com/javase/downloads/index.jsp
wget http://www.link-from-earlier chmod +x jdk-6u23-*-rpm.bin ./jdk-6u23-*-rpm.bin alternatives --install /usr/bin/java java /usr/java/jdk1.6.0_23/bin/java 20000 alternatives --config java vi /etc/profile.d/java.sh export JAVA_HOME="/usr/java/jdk1.6.0_23" export JAVA_PATH="$JAVA_HOME" export PATH="$PATH:$JAVA_HOME/bin" chmod +x /etc/profile.d/java.sh
Reboot the system
reboot
Install TomCat 6
yum install tomcat6 tomcat6-admin-webapps tomcat6-webapps service tomcat6 start chkconfig --add tomcat6
Open Firewall
/sbin/iptables -I INPUT 1 -p tcp --dport 8080 -j ACCEPT /sbin/service iptables save service iptables restart
You should now be able to see the Tomcat welcome page at http://xxx.xxx.xxx.xxx:8080 where xxx.xxx.xxx.xxx is your server IP
You should now find the URL for the latest Solr release from here: http://www.apache.org/dyn/closer.cgi/lucene/solr/
Use this to download the latest version:
mkdir /var/solr mkdir /var/nutch wget http://mirror.ox.ac.uk/sites/rsync.apache.org//lucene/solr/1.4.1/apache-solr-1.4.1.zip unzip -q apache-solr-1.4.1.zip cp -r apache-solr-1.4.1 /var/solr wget http://apache.mirror.rbftpnetworks.com/nutch/apache-nutch-1.2-bin.zip unzip -q apache-nutch-1.2-bin.zip cp -r nutch-1.2 /var/nutch
Copy the provided Nutch schema from directory apache-nutch-1.2/conf to directory apache-solr-1.4.1/example/solr/conf (override the existing file)
cp /var/nutch/nutch-1.2/conf/schema.xml /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml
Change schema.xml so that the stored attribute of field “content” is true.
vi /var/solr/apache-solr-1.4.1/example/solr/conf/schema.xml <field name="content" type="text" stored="true" indexed="true" />
Edit config to work with Nutch
vi /var/solr/apache-solr-1.4.1/example/solr/conf/solrconfig.xml <requestHandler name="/nutch" class="solr.SearchHandler" > <lst name="defaults"> <str name="defType">dismax</str> <str name="echoParams">explicit</str> <float name="tie">0.01</float> <str name="qf"> content^0.5 anchor^1.0 title^1.2 </str> <str name="pf"> content^0.5 anchor^1.5 title^1.2 site^1.5 </str> <str name="fl"> url </str> <str name="mm"> 2<-1 5<-2 6<90% </str> <int name="ps">100</int> <bool name="hl">true</bool> <str name="q.alt">*:*</str> <str name="hl.fl">title url content</str> <str name="f.title.hl.fragsize">0</str> <str name="f.title.hl.alternateField">title</str> <str name="f.url.hl.fragsize">0</str> <str name="f.url.hl.alternateField">url</str> <str name="f.content.hl.fragmenter">regex</str> </lst> </requestHandler>
Set Nutch Home
vi /etc/profile.d/nutch.sh export NUTCH_HOME="/var/nutch/nutch-1.2" chmod +x /etc/profile.d/nutch.sh . /etc/profile.d/nutch.sh
Option 1 – Use Solr with TomCat
mkdir /var/lib/tomcat6/webapps/solr cp /var/solr/apache-solr-1.4.1/dist/apache-solr-1.4.1.war /var/lib/tomcat6/webapps/solr cd /var/lib/tomcat6/webapps/solr/ jar xvf apache-solr-1.4.1.war
Add config File
vi /etc/tomcat6/Catalina/localhost/solr.xml
Add the following:
<Context docBase="/usr/share/tomcat6/webapps/solr/apache-solr-1.4.1.war" debug="0" privileged="true" allowLinking="true" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="/var/solr/apache-solr-1.4.1/example/solr" override="true" /> </Context>
Restart Tomcat
service tomcat6 restart
You should be able to see Solr at (substitute your server IP):
http://xxx.xxx.xxx.xxx:8080/solr/
Test everything starts up OK by rebooting the server:
reboot
Set up Nutch with Tomcat
cd /var/nutch/nutch-1.2 mkdir /var/lib/tomcat6/webapps/nutch cp nutch*.war /var/lib/tomcat6/webapps/nutch cd /var/lib/tomcat6/webapps/nutch/ jar xvf nutch-1.2.war service tomcat6 restart
You can now access Nutch at: http://xxx.xxx.xxx.xxx:8080/nutch/
Add site to Nutch
cd /var/nutch/nutch-1.2 mkdir seed vi seed/url.txt
Add a list of root URLs to crawl on seperate lines
Limit Nutch to certain domains (optional)
If you want Nutch to only crawl a certain domain edit the crawl-urlfiler.txt
vi /var/nutch/nutch-1.2/conf/crawl-urlfilter.txt
Set Nutch Crawler Options
vi /var/nutch/nutch-1.2/conf/nutch-site.xml
Add the following between configuration:
<property> <name>http.agent.name</name> <value>YourName</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty -please set this to a single word uniquely related to your organization. </description> </property>
Set Nutch Config Options
vi /var/lib/tomcat6/webapps/nutch/WEB-INF/classes/nutch-site.xml
Add the following between configuration:
<property> <name>searcher.dir</name> <value>/var/nutch/nutch-1.2/crawl</value> </property>
Start Nutch Crawl
bin/nutch crawl seed/url.txt -dir crawl -depth 3 -topN 50 bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/*