Applying data mining techniques over big data
Al-Hashemi, Idrees Yousef
MetadataShow full item record
The rapid development of information technology in recent decades means that data appear in a wide variety of formats — sensor data, tweets, photographs, raw data, and unstructured data. Statistics show that there were 800,000 Petabytes stored in the world in 2000. Today’s internet has about 0.1 Zettabytes of data (ZB is about 1021 bytes), and this number will reach 35 ZB by 2020. With such an overwhelming flood of information, present data management systems are not able to scale to this huge amount of raw, unstructured data—in today’s parlance, Big Data. In the present study, we show the basic concepts and design of Big Data tools, algorithms, and techniques. We compare the classical data mining algorithms to the Big Data algorithms by using Hadoop/MapReduce as a core implementation of Big Data for scalable algorithms. We implemented the K-means algorithm and A-priori algorithm with Hadoop/MapReduce on a 5 nodes Hadoop cluster. We explore NoSQL databases for semi-structured, massively large-scaling of data by using MongoDB as an example. Finally, we show the performance between HDFS (Hadoop Distributed File System) and MongoDB data storage for these two algorithms.
Thesis (M.S.C.S.) PLEASE NOTE: Boston University Libraries did not receive an Authorization To Manage form for this thesis or dissertation. It is therefore not openly accessible, though it may be available by request. If you are the author or principal advisor of this work and would like to request open access for it, please contact us at firstname.lastname@example.org. Thank you.