Is Your Data Really Cloud-Ready?

According to Forbes, 49% of IT decision-makers say that although they would love to deploy AI and ML technologies, their data is not ready to support the requirements. Data is the single most important commodity for businesses today, so your Machine Learning strategy has to be built on great data hygiene. Last time, we talked about setting organizational goals that allow you to think big. Data will be the conduit to get you to those goals.

Here are your 3 top tips for getting your data strategy up to scratch:

1. Collect everything

What Machine Learning can achieve today is nothing compared to its likely capabilities for tomorrow. Your business also needs time to mature into it’s AI strategy moving forward. Because of this, all data should be collected, even if you don’t have a current use for it. Data can be aggregated from multiple sources and should be stored in an S3 bucket.

The more training examples you have, the more accurate your models will be. Some companies have public APIs that allow you to gather data, for example, Twitter or the NY Times. These companies may have data that can complement or add to your own. Make sure you put mechanisms in place to collect data, as manual collection methods are slow, inaccurate, and can frustrate staff who you need to get on board.

2. Clean and Format Data

Before you can use data to establish patterns, you need to check it for consistency and integrity. By each attribute, you can check for issues with the data, such as missing features, outlying values that are invalid, or badly-defined labels.

This will also identify problems like disparate methods of measurement, for example, if one instance measures temperature in Fahrenheit and another uses Celsius. A common issue is related to financial information, where some datasets might use decimal points such as $400.00 while another might say $400, or even 4 hundred dollars. Formatting these datasets so that they are consistent is important at this stage, or training models will be slowed down or become inaccurate.

3. Split your Data into Training and Building Models

Creating a ML model needs two disparate sets of data, one which is used for training, and then another to test it by evaluation. Take advice on the ratio of your split, which can differ depending on the type of data and best practices for your specific technology. The important things to remember are that your test set is representative of your entire dataset, and that it’s large enough to give you statistically meaningful results. AWS tools can support you with this, as well as in versioning your original data to keep track of cataloging it and following the algorithms as they evolve and go through the refinement process.

Your data is foundational in order to see the benefits from Machine Learning models, but without the right preparation, its value is going to be heavily decreased. The right partner in AWS can guide you in how to gather, aggregate, clean, filter and sort your data, getting you in the best shape to start your Machine Learning journey.

Now you’ve got the data side sorted – our next blog will turn to the people that matter the most, your current employees and engineers.

Shlomi Itzhak

VP Delivery, Data And DevOps