Data Done Right: The Foundation for Successful AI

What data do I need?

The best course of action is often to think about your problem before considering the data. “I think AI should go through all of our data and identify problems” is a sentiment we hear often but this creates an unfocused objective. In this blog post, we discussed how having a clear problem definition will help you create better solutions.

Once the problem is clearly defined it is time to think about the data. The quality of the data used to create your algorithmic solution will directly affect the quality that the solution outputs. Shortly put: Garbage in, garbage out. One simple way to test whether your data is up to the task is by checking if an expert can solve the problem given the data.

If a human can solve a problem with the given data, so can a computer. A computer is generally faster and more efficient which is why we usually want to automate things in the first place but a human can often easily tell whether the data is garbage or not.

There however is one major area where computers outperform human experts and that is in combining information from several data sources. A senior mechanic might be able to tell how an engine is performing based on the sound alone. This comes with years of experience and is hard to teach to a new guy. So an expert can tell the performance based on sound. So what if we outfit the machine with not only a microphone but also other sensors that measure pressure, temperature, and vibrations? This additional information would allow an AI system to be much more precise in estimating the engine quality which would normally be invisible to the engineer.

Collecting data the smart way

So what if I know exactly what data I need but I don’t already have the required data? The most straightforward way is to simply start collecting data. However, some AI techniques require a large amount of data which might be cost- or time-prohibitive to collect. Instead of manually collecting data, it might be possible to generate data using a simulated environment. This could mean simulating a full environment using programs such as Blender or Unity. For other applications, we can create statistical models where we can sample as much data as we need.

Using Blender you can simulate complete environments. You are in control of the light, background, reflectivity of objects, and even their exact size and shape. If your goal is to detect objects in a warehouse, if a person walks around with a camera taking pictures of objects all day it would not get the variety of data that you might need for the system to work in every environment. On top of that, you also immediately have the correct labels. Read more about this in our blog post about data augmentation.

Do I need to label my data?

Utilizing synthetic data immediately produces ground truth labels that can be used in supervised learning techniques. However, synthetic data cannot be used to solve all problems. In these cases, it is worth considering whether techniques that do not require labeled data can be used.

To illustrate, let's consider the task of detecting bad welds from pictures taken with a simple phone camera. One way is to have a group of people label a large set of pictures as good or bad. However, this can be error-prone and time-consuming. Using unsupervised learning techniques allows us to cluster the pictures based on the distinctive features of each weld without the need for labeled data. Good welds would be clustered together away from bad welds making it possible to tell good welds from bad.

Lastly, foundational models such as large language models can also be used to build solutions without having to introduce any of your own data at all. The boom in language models such as ChatGPT, Gemini, and DeepSeek has made tasks such as summarizing or sentiment analysis much more accessible. Privacy concerns aside, you could give it your text and get a good result. This gives a chance to build low-cost experiments without having to train a whole model yourself.

How to get started building my AI solution

When building AI solutions for your business problem the data required to make it happen should be considered. Here is my suggestion for how to utilize data when starting to build an AI application.

Check if foundational models can be used to solve your problem. This might include using public-domain data or small sets of your own data. This is often possible in the computer vision or language domain.
If there is some data available specifically for your task, try building a simple model and check the quality of the model. If the model performs okay but does not match the requirements, this can be a good indicator that more quality data is required to get the results you need
After testing with either step 1 or 2 and the results look ok but not good enough it is worth exploring if synthetic data can be employed to make your model fit exactly to your problem.
Lastly, push for the last improvement in accuracy by labeling a small set of data by hand. This can bridge the gap between the synthetic data and real-world data.

Want to know how this can be applied to your specific problem? Please reach out to us at Emblica and we can discuss what would fit best to your problem.

Emblica is not your average data team. We build customized solutions for collecting, processing, and utilizing data for all sectors, especially at the R&D interface. Whether our target is a factory line, an online store or a field, you can find us busy at work, hands in the clay - at least at our office in Helsinki.

Data Done Right: The Foundation for Successful AI

Rick Joosten

Rick Joosten

What data do I need?

Introducing EmbliCats: Empowering Women in Tech at Emblica

Looking Back at 2024: A Year of Growth and New Adventures