Subscribe to News

Bash Scripts to automatize Nutch

Author : Jbuenol

From TechnologicalWiki

Jump to: navigation, search

Contents

[edit] Crawling data

This article is focused on create a script ( Linux shell ) to automatize the crawling process :

1.Create a folder named scripts ( for example ) in search folder :

<NUTCH_PATH>/search/scripts

2. Create a file named crawl.sh ( for example ) into the scripts folder :

<NUTCH_PATH>/search/scripts/crawl.sh

Note : "Indexing is done using Solr".

#!/bin/sh
#The basedir for nutch installation
NUTCH_HOME=.

#The basedir used in storing the crawl content
BASEDIR=$1

#Number of repetitions
LOOP=$2

#Number of documents to fetch per round
NUMDOCS=1000

if [ $# -eq 0 ]
then
echo Usage is \"crawl.sh basedir\"
exit
fi

checkStatus ()
{
if test "$?" != "0"; then
  echo Command exited with abnormal status, bailing out.
  exit 1;
fi

}

cd $NUTCH_HOME

bin/nutch inject $BASEDIR/crawldb urls
checkStatus

for i in `seq 1 $LOOP`;
do

bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS
checkStatus
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments/ | tail -1 | grep -o [a-zA-Z0-9/\-]* | tail -1 | grep -o [^/]* | tail -1`
echo processing segment $SEGMENT
bin/nutch fetch $BASEDIR/segments/$SEGMENT -threads 10
checkStatus
bin/nutch updatedb $BASEDIR/crawldb $BASEDIR/segments/$SEGMENT -filter
checkStatus
done

bin/nutch invertlinks $BASEDIR/linkdb $BASEDIR/segments/*
checkStatus
echo indexing into Solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ $BASEDIR/crawldb $BASEDIR/linkdb $BASEDIR/segments/*
checkStatus
echo deleting duplicates from Solr
bin/nutch solrdedup http://127.0.0.1:8983/solr/
checkStatus
echo copying to searcher
bin/hadoop dfs -copyToLocal $BASEDIR /nutchsearch/local/
checkStatus

echo SUCCESS !!!

[edit] USAGE

from <NUTCH_PATH>/search/ folder :

command : ../scripts/crawl.sh <FOLDER_TO_STORE_DATA> <DEPTH> usage example : ../scripts/crawl.sh crawl 5

In the example, data is stored into /nutchsearch/local/crawl/ folder.

  • <FOLDER_TO_STORE_DATA> : This parameter indicates the folder in which data of crawling process is stored.
  • <DEPTH> : This parameter indicates the number of link jumps in the crawling process.

[edit] Merging data

The next script is focus on merging data from two indexed data sources.

1.Create a folder named scripts ( for example ) in search folder :

<NUTCHSEARCH_PATH>/search/scripts

2. Create a file named merge.sh ( for example ) into the scripts folder :

<NUTCHSEARCH_PATH>/search/scripts/merge.sh

Note : "Indexing is done using Solr".

# Nutch merge crawls script.
# Based on recrawl script
#
# The script merges 2 or more nutch crawls into a single crawl
#
# USE ABSOLUTE PATHS for the script args
# e.g. bin/merge_crawls.sh /home/ren/nutch/trunk/build/crawl /home/ren/nutch/trunk/build_f/crawl/ /home/ren/nutch/trunk/build_w/crawl/


if [ -n "$1" ]
then
  crawl_dir=$1
  if [ -d $1 ]; then
    echo "error: crawl already exists: '$1'"
    exit 1
  fi
else
  echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
  exit 1
fi

if [ -n "$2" ]
then
  crawl_1=$2
else
  echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
  exit 1
fi

if [ -n "$3" ]
then
  crawl_2=$3
else
  echo "Usage: ../scripts/merge.sh newcrawl-path crawl1-path crawl2-path, USE ABSOLUTE PATHS"
  exit 1
fi


#Sets the path to bin
nutch_dir=`dirname $0`
nutch_dir=$nutch_dir/bin

echo "Creating new crawl in: " $crawl_dir
mkdir $crawl_dir
webdb_dir=$crawl_dir/crawldb
segments_dir=$crawl_dir/segments
linkdb_dir=$crawl_dir/linkdb
index_dir=$crawl_dir/index

echo Merge linkdb
echo $nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb
$nutch_dir/nutch mergelinkdb $linkdb_dir $crawl_1/linkdb $crawl_2/linkdb

echo Merge crawldb
$nutch_dir/nutch mergedb $webdb_dir $crawl_1/crawldb $crawl_2/crawldb

echo Merge segments
segments_1=`ls -d $crawl_1/segments/*`
echo 1 $segments_1
segments_2=`ls -d $crawl_2/segments/*`
echo 2 $segments_2
$nutch_dir/nutch mergesegs $segments_dir $segments_1 $segments_2


# From there, identical to recrawl.sh

echo Update segments
$nutch_dir/nutch invertlinks $linkdb_dir -dir $segments_dir

echo Index segments
$nutch_dir/nutch solrindex http://127.0.0.1:8983/solr/ $crawl_dir/crawldb $crawl_dir/linkdb $crawl_dir/segments/*

echo De-duplicate indexes
$nutch_dir/nutch solrdedup http://127.0.0.1:8983/solr/

[edit] USAGE

from <NUTCHSEARCH_PATH>/search/ folder :

command : ../scripts/merge.sh <NEW_CRAWL> <CRAWL_1> <CRAWL_2> usage example : ../scripts/merge.sh /nutchsearch/local/crawl /nutchsearch/local/crawl1 /nutchsearch/local/crawl2

Note : : "Use absolute paths"

In the example, data is stored into /nutchsearch/local/crawl/ folder.

  • <FOLDER_TO_STORE_DATA> : This parameter indicates the folder in which data of crawling process is stored.
  • <DEPTH> : This parameter indicates the number of jumps ( through links ) in the crawling process.

[edit] See also

Nutch

Main Collaborators