There has been a lot of talk about AI over the last few months – mostly because of generative AI platforms like ChatGPT, Github Copilot, Stable Diffusion, and DALL-E – just to name a few. But how exactly are these technologies being used in the data industry?
We’ll answer all of that and more in this multi-part series entitled, “How is AI being used in the data industry?”
First, let’s cover the basics. What is Machine Learning?
Machine learning, or ML for short, is any kind of computer algorithm that’s able to learn and adapt without using any specific instructions on how to do so.
That’s a bit vague, so perhaps an example is best here:
You could write a program that knew every combination of X’s and O’s in the game of tic-tac-toe. Every single layout of every possible game. That’s a lot of conditions to code into your program (19,683, to be precise). The program would, however, know the exact right move to make in any situation. But that’s a lot of overhead.
What if instead you allowed the program to make random choices about where to play its X pieces? If it places a piece and ends up winning, that choice was a good one. If it loses, it was a bad choice. It then ranks that choice by how good or bad it was. Later when it comes across a similar state in a different game, it knows which choices tend to work better than others. This is called a “model.”
A model is the set of ranked cause and effect pairings (or choice and outcome pairings) that the program uses to determine the best move. They are generally created by observing millions, even billions, of inputs and outputs.
This is how your credit card company spots fraud. They watch millions of transactions per day. Any time a customer disputes a transaction as fraud, they’ll use the fraudster’s purchasing patterns to train a model. That model can later be used to predict whether other customers’ transactions are fraudulent and alert them.
This kind of ML is called “Predictive Analytics.” More formally, predictive analytics is the process of using existing data to predict future outcomes.
The main pitfall with predictive analytics is that the engineer who designs the model needs to have an “answer key” to properly train the model. For instance, if a data company wanted to determine when you were likely to buy a new car, have a baby, or move houses, it could look at all of your data: your credit history, your age, your location, whether you’re married, etc. But unless something is fed back into the algorithm to tell it, “yes, this person moved houses recently”, it’s not going to know whether the patterns in your data led to a certain outcome. It therefore cannot rank that prediction’s accuracy.
This is why data companies often seek “ground truth” data – things they know 100% are true. We may see two people of a similar age living in the same household with the same last name, but we’ll never know for sure if they’re siblings or if they’re married – not without some sort of ground truth like a marriage certificate.
Property records, court records, purchasing habits, payroll records, and surveys are all good sources of ground truth data. If you’re looking for bulk court data to use in your ML model or for statistical research, please contact our sales team.
In our next article we’ll talk about generative AI and image recognition and how those might be used in the data industry.