Bash Scripts to automatize Nutch
Author : Jbuenol
From TechnologicalWiki
Contents |
[edit] Crawling data
This article is focused on create a script ( Linux shell ) to automatize the crawling process :
1.Create a folder named scripts ( for example ) in search folder :
<NUTCH_PATH>/search/scripts
2. Create a file named crawl.sh ( for example ) into the scripts folder :
<NUTCH_PATH>/search/scripts/crawl.sh
Note : "Indexing is done using Solr".
#!/bin/sh
#The basedir for nutch installation
NUTCH_HOME=.
#The basedir used in storing the crawl content
BASEDIR=$1
#Number of repetitions
LOOP=$2
#Number of documents to fetch per round
NUMDOCS=1000
if [ $# -eq 0 ]
then
echo Usage is \"crawl.sh basedir\"
exit
fi
checkStatus ()
{
if test "$?" != "0"; then
echo Command exited with abnormal status, bailing out.
exit 1;
fi
}
cd $NUTCH_HOME
bin/nutch inject $BASEDIR/crawldb urls
checkStatus
for i in `seq 1 $LOOP`;
do
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS
checkStatus
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments/ | tail -1 | grep -o [a-zA-Z0-9/\-]* | tail -1 | grep -o [^/]* | tail -1`
echo processing segment $SEGMENT
bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
checkStatus
bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
checkStatus
done
bin/nutch invertlinks $BASEDIR/linkdb $BASEDIR/segments/*
checkStatus
echo indexing into Solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ $BASEDIR/crawldb $BASEDIR/linkdb $BASEDIR/segments/*
checkStatus
echo deleting duplicates from Solr
bin/nutch solrdedup http://127.0.0.1:8983/solr/
checkStatus
echo copying to searcher
bin/hadoop dfs -copyToLocal $BASEDIR /nutchsearch/local/
checkStatus
echo SUCCESS !!!
[edit] USAGE
from <NUTCH_PATH>/search/ folder :
command : ../scripts/crawl.sh <FOLDER_TO_STORE_DATA> <DEPTH> usage example : ../scripts/crawl.sh crawl 5
In the example, data is stored into /nutchsearch/local/crawl/ folder.
- <FOLDER_TO_STORE_DATA> : This parameter indicates the folder in which data of crawling process is stored.
- <DEPTH> : This parameter indicates the number of link jumps in the crawling process.
[edit] Merging data
The next script is focus on merging data from two indexed data sources.
1.Create a folder named scripts ( for example ) in search folder :
<NUTCHSEARCH_PATH>/search/scripts
2. Create a file named merge.sh ( for example ) into the scripts folder :
<NUTCHSEARCH_PATH>/search/scripts/merge.sh
Note : "Indexing is done using Solr".
# Nutch merge crawls script.
# Based on recrawl script
#
# The script merges 2 or more nutch crawls into a single crawl
#
# USE ABSOLUTE PATHS for the script args
# e.g. bin/merge_crawls.sh /home/ren/nutch/trunk/build/crawl /home/ren/nutch/trunk/build_f/crawl/ /home/ren/nutch/trunk/build_w/crawl/
if [ -n "$1" ]
then
crawl_dir=$1
if [ -d $1 ]; then
echo "error: crawl already exists: '$1'"
exit 1
fi
else
echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
if [ -n "$2" ]
then
crawl_1=$2
else
echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
if [ -n "$3" ]
then
crawl_2=$3
else
echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
exit 1
fi
#Sets the path to bin
nutch_dir=`dirname $0`
nutch_dir=$nutch_dir/bin
echo "Creating new crawl in: " $crawl_dir
mkdir $crawl_dir
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index
echo Merge linkdb
echo $nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb
$nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb
echo Merge crawldb
$nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb
echo Merge segments
segments_1=`ls -d $crawl_1/segments/*`
echo 1 $segments_1
segments_2=`ls -d $crawl_2/segments/*`
echo 2 $segments_2
$nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2
# From there, identical to recrawl.sh
echo Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir
echo Index segments
$nutch_dir/nutch solrindex http://127.0.0.1:8983/solr/ $crawl_dir/crawldb $crawl_dir/linkdb $crawl_dir/segments/*
echo De-duplicate indexes
$nutch_dir/nutch solrdedup http://127.0.0.1:8983/solr/
[edit] USAGE
from <NUTCHSEARCH_PATH>/search/ folder :
command : ../scripts/merge.sh <NEW_CRAWL> <CRAWL_1> <CRAWL_2> usage example : ../scripts/merge.sh /nutchsearch/local/crawl /nutchsearch/local/crawl1 /nutchsearch/local/crawl2
Note : : "Use absolute paths"
In the example, data is stored into /nutchsearch/local/crawl/ folder.
- <FOLDER_TO_STORE_DATA> : This parameter indicates the folder in which data of crawling process is stored.
- <DEPTH> : This parameter indicates the number of jumps ( through links ) in the crawling process.


