TLDR: A few things to think about when thinking about adding new data to your decision process.
Everywhere one looks there are thought pieces about more data, big data, hot data, and fast data. More does not necessarily equal better. It is critically important to start with a thorough understanding of why data. In a decision process, what actions are being chosen between and what are the implications, stakeholders, success factors, and risks associated with bringing new data into the process.
Not all data contains something useful or unique. It is quite normal for most data to be unhelpful. This may be part of why “big data” solutions seem to be on the downward slope of the hype cycle. While the technology to store anything and everything has improved, the uses of all that data are not expanding at the same extreme rate. This has led many a “data lake” to turn into “data swamps”
When considering a new dataset for some decision process, it will be important to thoroughly examine it for a number of issues:
- Is it from a trusted source?
- Are there gaps in the data which must be ignored or treated in some way?
- How many transformations has it undergone up until this point
- Could a more “pure” version be sourced further upstream?
- Does it cover the full population, proportionately?
- Will the information in this data be useful under all decision circumstances, or only in certain segments?
- How does this information link to data already available to the process?
- Are the values providing information that can’t already be derived? (E.g. If I know birthdate, I don’t need a data field for age)
Linking data can be a difficult task, particularly when sourced from different places or contexts. Transformations and assumptions are necessary to bring data together. Observations about individuals need to map to their associated accounts. Observations about events in time need to map against information that remains static. Locations can be expressed in a number of different ways, and may require third datasets to map together (e.g. latitude and longitude mappings into regional boundaries). Each change of data away from its native structure poses the risk of introducing new errors.
Even without errors from mappings, data can be collected in specific contexts that are not relevant to the analysis at hand. Sensors can be calibrated differently. Forms can have pre-filled answers. Values can be aggregated or summarized in ways that obscure other useful details. Skepticism must always be deployed to detect suspicious patterns and ensure that data which may contain new information is included properly.
Once that data has been acquired and linked, it can still be of questionable value. For example, values which are perfectly correlated to each other provide no new information. Values which vary completely randomly from the target decision metrics may also convey little or no use. Exploratory data analysis should always be a part of the process, and analysis like principal component analysis can help clarify the value of the new fields.
After all of the above, it may be worth also considering if the new information is actually worth the effort. That worth can be evaluated by considering the motivating purpose for the analysis. While it is fun to create new data analysis, sometimes proven alternatives already exist. Sometimes a vendor can provide the dataset in a clean, consistent fashion better than your internal efforts. Data vendors have an economic incentive to do a good job with data collection where other parts of your organization may not.
As an example, FICO scores used in credit risk analytics are themselves the results of many transformations and calculations which factor in income, credit history, outstanding debt, and other credit risk factors. Lenders use these scores because they do not have the same breadth of information available to calibrate all people in a country and apply a rating. Lenders only have data about their active customers in their accounts, not anything about what happens at competitors or other kinds of financial institutions. A credit score is not a perfect measure, and can be improved upon by the lender by using in-house data. But it is a very good baseline.