Idiro has migrated its social network analysis (SNA) platform from a traditional database-centric model to one using the Hadoop processing framework. Hadoop is an open-source software framework written in Java, and supported by the Apache foundation under an Apache v2 open-source licence. The emphasis on ‘big data’ applications has been driven by the possibilities offered through using Hadoop within large business environments such as Yahoo!, Google and Facebook. Effectively, Hadoop takes a simple divide-and-conquer approach towards processing very large scale datasets (e.g., Terabyte, Petabyte or more), by splitting the data into more manageable chunks which are each processed by individual servers acting as a single distributed cluster environment. In the case of Idiro’s SNA data, the Hadoop approach allows us to provide linearly scaled solutions to our customers, as well as integrating with Hadoop-based platforms such as Hive and HBase that offer database and data-warehouse functionality on top of the distributed processing environment.
Take a trivial case of an application to count the number of individual words within a piece of text (i.e. compute a concordance for a given text). Traditionally, each new word would be held in memory, or within a database table, and the text processed line by line. Within Hadoop, discrete sections of the text would be divided between a number of servers (known as a Map process), the words in each section counted and the various diverse result sets merged together to form the final answer (known as a Reduce process). In Hadoop, the basis for all processing is this MapReduce process, which is especially amenable within domains requiring the summarization of very large volumes of structured and unstructured data.
Since Idiro can receive raw social network data consisting of gigabytes of information on a daily basis, the development of Hadoop was a natural progression away from the hardware/software limitations of single-server database deployments. In the current Idiro Hadoop-based system, adding additional capacity, processing additional social network features, or evaluating subscriber behaviour over longer periods can be effectively supported though adding low-cost commodity servers to the existing cluster. Using the fundamental data-locality of processing within each node, as well as replication of data between nodes, networking can be achieved using modest off-the-shelf switches and routers, and can provide node-level fail-over in the case of connection loss or even site failure. Added to this is the linear performance scaling of Hadoop, where the throughput of the system increases as more servers are added to the cluster. For medium-sized processing or up to several hundred million subscribers, Idiro recommends a deployment consisting of no more than one or two cabinets of commodity servers.
In summary Hadoop provides Idiro with the capability to scale processing to match any network size, and offers overall reduced capital costs compared to the deployments of well-known enterprise database licences.