Once you have the basics of Hadoop functioning, the obvious question is how do you put the data into the system. The corrollary is that how do you distribute the data between the various data nodes. Since the system can grow to thousands of data nodes, it is pretty obvious that you should not have to tell HDFS anything more than the bare minimum.
Your first test can be to take a large text file and load it into HDFS. So, start the hadoop cluster as last time with 3 nodes. On the first test, stop the datanode service on h-mstr and copy a large file on two slave nodes. (To create a text file for testing: take a hexdump of a 1gb tar and you should get around a 3 gb text file.)
You can verify that the data has been copied and how much on which node as follows:
The datafile is distributed on the h-slv1 & h-slv2. Now, start the datanode on h-mstr as well. Run the above commands again as follows:
The new data seems to be stored on the h-mstr only. If you run the command from h-slv2, the data is likely to be stored on h-slv2 datanode only. That is, hdfs will try to optimize and store the data close to the origin. You can know about the status of the hdfs, including location of data blocks, by browsing http://h-mstr:50070/.
The file you want to put in hdfs may be on the desktop. You can, of course, access that file using nfs. However, if you have installed the hdfs binaries on your desktop, you can copy the file by running the following command on the desktop:
The commands '-put' and '-copyFromLocal' are synonyms.
You should stop the hadoop services on all the nodes. In /etc/hadoop/hdfs-site.xml on each of the nodes, change the value of dfs.replication from 1 to 2 and the data blocks would be replicated on one additional datanode.
Restart the hadoop services. Increase the replication for existing files and explore http://h-mstr:50070/.
These experiments illustrate the resilience and flexibility of the Hadoop distributed file system and the implicit optimisation to minimize the transfer of data across the network. You don't need to worry about distribution of the data.
During your experiments, in case you come across the error, that the namenode is in safe mode, run the command:
Loading Data Programmatically
It is more likely that you will need to get the data from some existing sources like files and databases and put what you need into HDFS. As the next experiment, you can take the document files you may have collected over the years. Convert each into text and load it. Hadoop works best with large files and not a collection of small files. So, all files will go into a single HDFS file, where each file will be a single record with the name of the file as the prefix. You don't need to create a local file. The output of the program can be piped to the 'hdfs dfs -put' command.
The following Python program load_doc_files.py illustrates the idea.
You may run the above program as follows from the desktop:
The above program can be easily extended to convert and load pdf files as well. Or another program can be used to append to the existing hdfs file.
The obvious next step is that once the data is in hadoop, how do you use it and what can you do with it?
Hadoop map-reduce offers a very convenient streaming framework. It takes input lines from a hdfs file and passes them on standard input to one or more mapper programs. The standard output of the mapper programs is routed to the standard input of the reducer programs. The mapper and reducer programs could be in any language.
Each line of a mapper output is expected to be a keyword, value pair separated by a tab. All lines with the same keyword are routed to the same reducer.
A common first example is a word count in a text file. Each line of a text file is split into words by a mapper. It writes the word as the key with value 1. The reducer counts the number of times a particular word is received. Since the word is the key, all occurences of the same word will be routed to the same reducer.
In doc_files.txt, each line is actually a file. It is very likely that words will be repeated in a line. So, it is better if the mapper counts the words in a line before writing to stdout, as shown in the following mapper.py:
There is no need to change reducer function as long as it had not assumed that the count of each word was 1. The corresponding reducer.py:
So, having signed into fedora@h-mstr, you may run these command to count the words and examine the result:
As you have experimented, you can put store the data in a hdfs file using any programming language. Then, you can write fairly simple map and reduce programs in any programming language, without worrying about any issues related to distributed processing. Hadoop will make efforts to optimize the distribution of the data across the nodes and feeding the data to the appropriate mapper programs.