Algorithm Question to Repartition Data, I/O Cost. Algorithm Question to Repartition Data, I/O Cost. A telecommunications company offers a service in which users can exchange short messages with each
other. These messages are recorded in a relation msg(userid, time, msg_txt). You are a data
analyst who got an extract of the msg relation and wishes to derive valuable insights from this data.
However, the data is very large. You store the data for performance in the distributed main memory of a
cluster of N nodes. The extract of the relation you got consists of M pages, and each node stores M/N
pages of the relation (M >> N). These M/N pages almost saturate the main memory of an individual node;
however, there are enough pages of main memory left for I/O with the node’s disk subsystem and
communication operations with all other nodes (2N pages). In addition, there is enough remaining
memory for any small data structures.
The msg relation is given to you sorted by the time attribute, and thus each node stores a range of time
values and the contents of each node’s main memory is sorted by time. After discussing possible
analyses with your team, you decide that you would like to repartition the msg relation instead by
userid, but keep the time sorting within each partition. In other words, each node in the cluster should
store in main memory a well-defined set of userids; however, all records in the memory of a single
node should still be sorted by time, and not by userid.
You may assume that the number of records per userid is small enough that there is no need to break
down the same userid across different nodes. In addition, you may assume that there is no significant
skew in the number of messages per userid, so that a partitioning by userid will again leave each
node’s main memory with roughly M/N pages of the msg relation. Finally, you may assume
communication and I/O at a node is perfectly overlapped and parallel, given that the disk subsystem can
also sustain a degree of parallelism of N. In other words, N-1 transfers of pages in main memory from
different nodes to the disk of a given node k take the time of 1 I/O, but have a cost of N-1 I/Os to write the
pages to the disk subsystem at k. Similarly, up to N page reads from the disk subsystem at k take 1 I/O
time and incur N I/O cost. The network has enough bandwidth to support parallel all-to-all node transfers
without any further delays. In effect, you may assume network transfer to have zero cost.
By answering the questions below, you will describe and analyze an algorithm involving parallelism and
external memory to efficiently achieve this repartitioning.
1. State an algorithm to repartition the data of msg as specified above. Argue for the algorithm’s
correctness and efficiency. Clearly state which steps of the algorithm can be performed in
2. State the total I/O cost and the total I/O time of the algorithm you designed in part 1 above in
terms of M and N. Explain why the algorithm has the costs stated.
NOTE 1: The total I/O cost corresponds to the total number of pages read or written to disk at all nodes,
assuming network communication costs can be fully overlapped or are minimal by comparison. The total
I/O time corresponds to the sequential I/O cost of the algorithm taking into account that several I/Os
happening perfectly in parallel take the time of a single sequential I/O (c.f., parallel work and depth).
NOTE 2: To state an algorithm, you can reference existing sort-based or hash-based external memory
algorithms. You should not state all the steps of these existing algorithms from scratch again, but you
should clearly state the steps that you need to change in the algorithms you reference, and also how you
change these steps. To describe how you change a step, refer to the step and list the sub-steps that need to
be executed to achieve your goal.
NOTE 3: Instead of just using several existing external memory algorithms in sequence as black boxes,
you should design a single algorithm that addresses the whole task holistically. That is why in NOTE 2
we expect that you will need to show changes to steps of existing algorithms, if necessary.
For best quality essays, written from scratch, delivered on time, at affordable rates!
Among other benefits, we guarantee:
Essays written from scratch – 100% original,
Competitive prices and excellent quality,
24/7 customer support,
Priority on customer’s privacy,
Unlimited free revisions upon request, and
Plagiarism free work