Amazon AWS, MS Azure, Google App Engine. Amazon AWS, MS Azure, Google App Engine. Purpose: Get hands-on experience with MapReduce on a PaaS by any major provider(e.g., Amazon AWS, MS Azure, Google App Engine, etc.)
Problem: Apply a brute force version of Motif search using median strings (details of this problem can be found here). An alternative project is to implement and test my new clustering algorithm using UCI dataset on cloud (detailes will be discussed individually with interested students).
Description: The median string Motif search algorithm is a good example that can be tranformed to fit the MapReduce programming model, which was designed to speed up a time-consuming computing process on a large data set by splitting the computing into a large number of smaller instances (via a map function) executed in parallel, each runing on a portion of the data set, and then merging the intermediate results (name-value pairs, via a reduce function) again in parallel to consolidate the results. This project requires you to use MapReduce to solve the median string motif search problem on a given gene sequence data set. It is your option of using either Java or Python and the corresponding API of MapReduce.
Requirement: You must genuinely exploit the parallel processing power of MapReduce by properly designing your map and reduce funcitons that spawn to many runing instances, and correctly compute the result of this motif search task with the given sequence data set.
Hints: Supposedly, each running instance of your map function works on one sequence of the in put data set. Your map funciton loops over all the enumerated median words (of length L in general, and 8 in particular for the project), steps through the input sequence and finds the best match (i.e., with minimum matching distance); the produced intermediate name and value pairs are median word and matching-distance pairs. Your reduce function, loops over the median words to sum up the total matching distances for all median words (need to retain the corresponding matches in the input sequences and the position indexes for final output). The final result is the median word with the minimum total matching distance.
Output: Your generated output shall include at least the following four pieces of information (listed as four columns): the found motif(s) of length 8, the (first, if more than one) found best matching subsequence from each input sequence, the matching score (between the motif and the best matching subsequence), and the corresponding position of the found best matching subsequence in each original input sequence.
Submission Requirement: (1) a brief report with a description of your program structure and discussion/comment (short or long) on the insight you might have gained from doing this project; (2) the requred ouput described above (as a seperate file); (3) the source code of you program (only the code that you wrote) with necessary comments; (4) a detailed readme file describing the steps of preparing and running your program (so that we can easily test your program without asking any further details from you). Please submit via email attachement to me with subject line "CS425/591 Project", besides submission of a printed screenshot of your result output.
MapReduce Overview – Google App Engine
Word Frequency Count Example in Java
- Among other benefits, we guarantee:
Essays written from scratch – 100% original,
Competitive prices and excellent quality,
24/7 customer support,
Priority on customer’s privacy,
Unlimited free revisions upon request, and
Plagiarism free work.