26/06/2015
By Ziad Nejmeldeen, Chief Scientist, Dynamic Science Labs
There is no shortage of problems that Big Data cannot solve, or so the story is abundantly told today. But is Big Data really the solution to problems with data in today’s organisations?
Almost all modern disciplines involve massive amounts of data. Looking back to 1995 however, areas like Political Science, Cultural Studies, and Sociology did not suffer from issues associated with these vast quantities. The advancements today allow for the rapid capturing and storage of this content from humans and devices influencing nearly every field.
When the term Big Data is invoked (and today, that is rather frequently) however, the intuition and overall capability of storing and then intelligently extracting meaningful information is becoming marginalized by hype. The topic is transformative, lending to important advancements involving storage, networking, and processing. But even some prominent experts believe a Big Data winter is approaching, that is, a bubble where promises do not meet expectations.
This is certainly true in Enterprise Software, where former DBAs are now Data Scientists, Excel macros are defined as Machine Learning algorithms, and vendors that rely on on-premise software licenses for the bread and butter of their business embrace terms like ‘SaaS’ or ‘Cloud’ ‘Analytics’. Many of these assertions not based on an understanding of what the real problems are or how they really should be solved, which are in fact major engineering and mathematical challenges.
In order to address these scientific problems, experts in a variety of quantitative disciplines are required across areas such as Mathematics, Computer Science, and Operations Research in order to address some of the challenges inherent across customers and industries.
Instead of speaking in such broad-brush terms about Big Data with regards to data management and data analysis, the conversation should be about specifics of the problem in context of the industry or discipline with methods from statistics, computation, and optimisation. Even these are enormous areas, but at least it orients the focus for a real discussion about solving problems through analysis.
One major area which has seen less hype is Big Computation and the overall complexity of algorithms crunching large amounts of data. This complexity is measured in something called Big O notation, which describes the growth rate of functions or algorithms over time and space. Much research has been devoted to analysing common functions and determining worst, average, and best-case time complexity categorisations. Big O notation is merely a way to present these time estimates in a way that makes sense. When building a model or applying a function to this data, it is vitally important to consider how they perform.
Common algorithms can be used for things like performing calculations, sorting and searching. Constant time is expressed as O(1) which just means that the algorithm requires the same fixed number of steps regardless of the amount of data for the task. You can’t see this plotted because it is completely horizontal. On the other end of the spectrum, you have O(n!) which is a factorial expression indicating that even an extremely slight increase in task size will move the number of operations required substantially. You want to avoid methods like this.
Combinatorial complexity can have a huge impact on even small sample sizes when constrained by time considerations. This is important with the movement to form factors like mobile and tablet, which has fundamentally shifted how humans interact with computers. Instead of hitting calculate and heading to the water cooler, humans are demanding real time responses delivered on their mobile device. Furthermore, the proliferation of sensors and incorporation of these into systems has altered how machines behave. Consider the utilization of sensor data in automobiles, which has the ability to in near real time provide safety instructions to the vehicle when it can sense that a crash is imminent.
One area seeing major benefits through this approach involves the optimal assignment of patients to nurses at the start of a shift. There are a myriad of considerations a charge nurse makes when performing this manual ritual. One important consideration is room location, indicating which rooms a nurse will visit on their shift. Of course, the acuity, type of patient in that room, the type of nurse potentially visiting that room, etc. will have some determining factor in that decision. But if the rooms are too far apart, you probably don’t want to have nurses running from one end of the unit to the other. To computationally determine this factor of room proximity, the algorithm needs to investigate ALL rooms for EACH nurse in conjunction with the other constraints. In a unit with 32 patients and 8 nurses, there are 10.5M combinations. In the shift requires only two more nurses, and this quickly jumps to 64M combinations needed for evaluation. If you were to evaluate multiple units, you could see where this problem becomes very large very quickly.
If you look at the amount of data used in this example, you are still looking at fewer than 100 records. Ultimately, while this problem does not follow constant time in Big O notation, it is surely not an issue for modern processors, which can process millions if not billions of instructions per second. If the problem was altered and suddenly considering evaluation of multiple large units at once, statistical methods like heuristics could be evaluated to reduce complexity and time. However, in its current form, this problem has little if anything to do with Big Data.
As near real-time solutions become more important, both from how device interaction occurs and how smart sensors interact with each other, it is increasingly important that the computational complexity is evaluated. To really solve challenging problems, it is vital we realise that everything is not considered just a general Big Data problem and in fact the term can be misleading. Many more mature organisations would agree that not everything is a nail that needs to be smashed by the Big Data hammer.