Yes!!! It’s running. You can check the report on the cluster at this address on your web browser:
On the terminal,
will show you that
SecondaryNameNode are running.
Now go check
It shouldn’t be there anymore!
Alright, let’s put some data in.
Let’s make a directory for these
mkdir -p /home/hduser/textdata
First we’ll start with putting the data into our normal data system. If you have some text files, you can use them for this. If not, here are three ebooks (plain text
utf-8 encoding) you can
Ulyses by James Joyce
Notebooks of Leonardo Da Vinci
The Outline of Science by J Arthur Thomson
(For example, to get these, you can do this:
cd /home/hduser/textdata wget http://www.gutenberg.org/cache/epub/4300/pg4300.txt wget http://www.gutenberg.org/cache/epub/5000/pg5000.txt wget http://www.gutenberg.org/cache/epub/20417/pg20417.txt
First, let’s start the cluster again!
make some directories in the hadoop distributed file system!
hdfs dfs -mkdir /user/ hdfs dfs -mkdir /user/irmak/
Let’s check that they exist
hdfs dfs -ls / hdfs dfs -ls /user/
Ok, put some data in
hdfs dfs -put /home/hduser/textdata/* /user/irmak
Check and make sure it is in the hdfs
hdfs dfs -ls /user/irmak
count_mapper.py includes the following code:
#!/usr/bin/env python import sys from textblob import TextBlob for line in sys.stdin: line = line.decode('utf-8') words = TextBlob(line).words for word in words: word = word.encode('utf-8') print "%s\t%i" % (word, 1)
And our reducer
count_reducer.py looks like this:
#!/usr/bin/env python import sys current_word = None current_count = 0 word = None for line in sys.stdin: word, count = line.split('\t') count = int(count) if word == current_word: current_count += count else: if current_word: print '%s\t%i' % (current_word, current_count) current_word = word current_count = count if current_word == word: print '%s\t%i' % (current_word, current_count)
Before running these codes, we need to make sure that textblob has its nltk corpora downloaded, so that it can work without an error. To do that, execute this on the command line (as the hduser):
python -m textblob.download_corpora
Before giving the following command, don't forget to replace the
/user/irmak path (in the hdfs) with your own version, and the paths to
count_reducer.py (in your droplet's local filesystem) with your own versions.
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file /home/hduser/count_mapper.py -mapper /home/hduser/count_mapper.py -file /home/hduser/count_reducer.py -reducer /home/hduser/count_reducer.py -input /user/irmak/* -output /user/irmak/book-output
Booom ! It's running.
Once it's done,
hdfs dfs -ls /user/irmak/book-output
should show that there is a
_SUCCESS file (showing we did it!) and another file called
part-00000 is our output. To look in:
hdfs dfs -cat /user/irmak/book-output/part-00000
hdfs dfs -cat /user/irmak/book-output/*
will show the output of our job!
If you want to see the most common words, run:
hdfs dfs -cat /user/irmak/book-output/* | sort -rnk2 | less
If something went wrong when you ran your mapreduce job, you fix something and want to run it again, it will throw a different error, saying that the book-output directory already exists in hdfs. This error is thrown to avoid overwriting previous results. If you want to just rerun it anyway, you need to delete the output first, so it can be created again:
hdfs dfs -rm -r /user/irmak/book-output