Big Data? Data Scientists? Random-Buzzword?

At Cadre we prefer to avoid buzzwords. Rather than jumping on the catch-phrase of the moment and over-promising its performance, we'd rather provide our clients well grounded analysis that provides key business and research insight. So in the following paragraphs, rather than jump on the "Big Data" bandwagon, we'll simply give you an overview of Big Data and the role of the Data Scientist.

Lots of Data

Today's computers, sensors, and information systems generate lots of data, lots and lots of data. Sales data, patient responses, customer feedback, manufacturing data, employee performance, sensor readings, experimental measurements, and screening results. When these datasets become too large for conventional analysis they are refered to as "Big Data". Working with high-volume, high-velocity, and/or high-variety data of this scale requires the use of advanced analytic techniques. In most cases, this big data contains within it key pieces of actionable intelligence -- knowledge that could provide insight, opportunity, and advantage -- if only you could get at it.

Deep Learning

Deep Learning is emerging as an extremely promising new approach to pattern recognition. The newest variant of artificial neural networks, Deep Learning attempts to construct a learning framework using a multi-layered hierarchy of nodes. Lower-levels of the network learn a representation of the raw data and higher-levels learn to recognize informative features (or patterns) within the data. In this manner, deep learning utilizes the overwhelming amount of data in 'big-data' to automatically learn the features that might otherwise be specified by a human expert. Deep learning has achieved great success in speech and image recognition. Google's voice recognition software uses this approach. The unofficial birthplace of deep learning is the University of Toronto, where several of Cadre's researchers were faculty or students.

Finding the Signal in the Noise

Data Scientists come from a range of disciplines but are generally statistics and machine learning gurus. A key component of success in the analysis of Big Data is matching the science with the math. In other words, the Data Scientist should not simply treat the dataset as a collection of numbers. To build an accurate model, the researcher needs to fully understand the source of the data and how it relates to potential measurable outcomes. Collaboration between the two disciplines, is therefore critical for success.

Delicate Work

While the datasets are large, it's important to handle them carefully. Each application is different, but there are generally three steps.
  •  First, prior to analysis, data should be appropriately normalized and outliers handled.
  •  Second, the team selects an appropriate data model and fits model parameters.
  •  Finally, a number of methods are used to interrogate the model and identify findings of statistical significance.

Of course the analysis is not always straightforward and the above steps don't apply to all areas or problems within machine learning. However, in most cases, the two most insidious problems to strike the novice data scientist are the phenomenon of multiple hypothesis testing and over-fitting. Without correcting for multiple hypothesis testing, data scientists are likely to find patterns in a large dataset that are not truly there and which do not hold up to future analysis. If a data scientist over-fits the data you are likely left with a model that doesn't work well when deployed in a real-world environment. Both problems will cost you time and money -- not to mention possible embarrassment.