Original Source Here

Machine learning continues to become more available daily, and one exciting development is the straightforward availability of machine learning models since data is at the essence of any machine learning problem. Such data is used for the training, validation, and testing of models, and the performance reports of a machine learning model need to be calculated on the independent test data rather than the training or validation tests. Lastly, the data needs to be split so that all three datasets, like training, test, and validation, can have related statistical characteristics.

The first crucial step in a standard machine learning workflow after data cleansing is training — the method of passing training data to a model to learn to identify patterns. After training, the subsequent step is testing, where we examine how the model performs on data outside of the training set. This workflow is known as model evaluation.

We may need to run training and evaluation multiple times, deliver additional feature engineering, and tweak the model architecture. Once the model’s performance is high-grade during the evaluation phase, the model is launched so that others can access it to make predictions.

Figure 2: The machine learning model development process.

As data scientists, it is vital to translate the product team’s needs into the features of a model by stating false negatives are five times more costly than false positives. Therefore, the model should be optimized for recall over precision to satisfy this when designing a model. It is also essential to balance the product team’s goal of optimizing for accuracy and minimizing the model’s loss.

Tools for the Data and Model

There’s a variety of products available, which provide tools for solving data and machine learning problems. These are a few tools:

BigQuery

BigQuery is an enterprise data warehouse designed for analyzing large datasets quickly with SQL. Datasets organize data in BigQuery, and a dataset can have multiple tables.

Figure 3: Big Query [2].

BigQuery ML

BigQuery ML is a tool for building models from data stored in BigQuery. With BigQuery ML, we can train, evaluate, and generate predictions on our models using SQL. It supports classification and regression models, along with unsupervised clustering models. It’s also possible to import previously trained TensorFlow models to BigQuery ML for prediction.

Figure 4: BigQuery ML [3].

Challenges in Machine Learning

The process of building ML systems offers many different challenges that influence ML architecture design. By using the following methods, we can eliminate the identification of these challenges.

These are some fundamental challenges in machine learning:

Data Quality

Machine learning models are only reliable if it is trained and generalize. It should neither be overfitted nor under-fitted. Data is a significant factor for the reliability of any model. Suppose a model is trained on a deficient dataset, on poorly selected features, or on data that doesn’t accurately interpret the population using the model. In that case, the model’s predictions will directly imputation that data. Data should have quality, and its quality should be based on accuracy, completeness, consistency, and timeliness.

Data Accuracy

Data accuracy belongs to both training data’s features and important truth labels agreeing with those features. Suppose a machine learning model is trained on a deficient dataset, on data with inadequately selected features, or on data that doesn’t accurately represent the population using the model. In that case, the model’s predictions will be a direct reflection of that data. So, the model will either overfit or underfit.

Figure 5: Underfitting and overfitting [4]

Duplicates in the training dataset, for example, can cause an ML model to assign more weight to these data points inaccurately.

These are the operations to perform and maintain data quality:

Understand where data came from and any potential errors in the data collection steps can help assure feature accuracy.
Analysis screen for typos.
Identification of duplicate entries.
Measurement of inconsistencies in tabular data.
Analysis missing features.
Identification of any other errors that may affect data quality.

Accurate data labels are just as crucial as feature precision. As a result, wrongly labeled training examples can produce misleading model accuracy. The model relies only on the ground truth labels in training data to update its weights and minimize loss.

For instance:

Let’s say you are developing a sentiment analysis model, and 25% of your “positive” training examples have been wrongly labeled as “negative.” Your model will have a wrong picture of what should be counted as negative sentiment, which will be directly reflected in its predictions.

Data Completeness

It is easy to understand data completeness by taking an example.

Figure 6: Incomplete data

Let’s take an example of a model that is being trained to identify cat breeds.

You train the model on a large dataset of cat images, and the resulting model can classify these images into 1 out of 10 possible categories of cat races such as Bengal, Siamese, etc., with 99% accuracy.

Now deploy this model on production, so you find that in interest to uploading cat photos for classification, multiple users are uploading pictures of dogs and are frustrated with the model’s results.

Due to the model being trained to identify 10 separate cat breeds, you can foresee it slot it into one of these 10 categories no matter what you input the model. It may even do so with a big resolution for an image that looks nothing like a cat. An, there is no say “not a cat” if this data and label weren’t included in the training dataset.

An essential aspect of data completeness is to ensure training data should contain a diverse exhibition of each label. For example, suppose you are developing a model to predict the price of real estate in a particular city but only cover training examples of houses larger than 3,000 square feet. In that case, your resulting model will work poorly on smaller houses.

Data Consistency

Data inconsistencies can be observed in both data features and labels. There should be standards to help ensure consistency across datasets. Let’s take an example of this.

Let’s say the government is collecting atmospheric data from temperature sensors. If each sensor has been calibrated to different standards, this will follow inaccurate and deceptive model predictions [1]. This data have the following inconsistencies:

The difference in measurement units like miles and kilometers.
A problem in location data like some people may write out a complete street address as “Main Street,” and others may abbreviate it as “Main St.”

Figure 7: Data inconsistency [5].

Data Timeliness

Timeliness in data belongs to the latency between when an event happened and added to the database.

For example, it might take one day from when the transaction happened before it is reported in the system for a dataset capturing credit card transactions. To tackle timelines, it is helpful to record as much information as possible about a particular data point and ensure that information is displayed when you change your data into features for a machine learning model.

Figure 8: Timelines of data.

Data Reproducibility

Machine learning models have an integral element of randomness. During training, an ML model’s weights are initialized with random values. These weights are then converged during training as the model iterates and learns from the data. Due to this, the corresponding model code given with comparable training data will produce significantly different results across training runs. This difference acquaints a challenge of reproducibility. If you train a model to 98.1% accuracy, a repeated training run is not guaranteed to reach the same result, making it hard to run measurements across experiments [1].

To address this predicament of repeatability, it is customary to set the random seed value used by the model to guarantee that the same randomness will be applied each time run training.

Following the ways by which training an ML model involves that need to be fixed to ensure reproducibility:

The data used
The splitting mechanism is used to generate datasets for training and validation.
Data preparation and model hyperparameters
Apply variables like the batch size
Learning rate schedule

Data Drift

Machine learning models mainly represent a static connection between inputs and outputs, where the data can change significantly over time. Data drift leads to the difficulty of ensuring machine learning models stay relevant and those model predictions are an accurate representation of the environment in which they are being used.

Example:

Let’s there is a model getting trained to classify news article headlines like “politics,” “business,” and “technology.” So, if you train and evaluate your model on historical news articles from the 20th century, it likely won’t perform well on current data. Today, it is known that an article with the word “smartphone” in the headline is probably about technology. A model trained on historical data would not know this word. This technicality is known as a data drift.

A solution to solve data drift:

Continually update your training dataset.
Retrain model.
Modify the weight of the model assigns to particular groups of input data.

Figure 9: model with data drift.

Scale

When ingesting and developing data for a machine learning model, the size of the dataset will deliver the tooling required for your solution. It is frequently the job of data engineers to build out data pipelines that can scale to handle datasets with millions of rows.

For model training, ML engineers are accountable for managing the necessary infrastructure for a specific training job. Depending on the type and size of the dataset, model training can be time-consuming and computationally expensive, requiring infrastructure (like GPUs) explicitly designed for ML workloads. Image models, for example, typically require much more training infrastructure than models trained entirely on tabular data.

Lack of scaling also influences the efficacy of L1 or L2 regularization. The magnitude of weights for a feature depends on the extent of that feature’s values, so different features will be affected differently by regularization. By scaling all features to endure between [–1, 1], we assure that there is not much of a difference in the relative magnitudes of different features.

Developers and ML engineers are typically accountable for handling the scaling challenges associated with model deployment and serving prediction requests.

Scaling can be further categorized:

Linear Scaling.
Non-linear Transformation.

Summary

We have seen that designing, building, and deploying machine learning systems are the essential steps in a machine learning workflow. And building production machine learning models continue to become an engineering system, taking advantage of ML methods established in research environments and applying them to business problems.

As machine learning becomes more mainstream, practitioners must benefit from tried-and-proven methods to address recurring problems. We are lucky to work with the TensorFlow, Keras, BigQuery ML, TPU, and Cloud AI Platform teams, driving the democratization of machine learning research and infrastructure.

Once you have collected your dataset and discovered the features for your model, data validation is the process of computing statistics on your data, knowing your schema, and evaluating the dataset to recognize problems like drift and training-serving skew. At the core of any machine learning (ML) model is a mathematical function defined to work on particular data types only.

Similarly, real-world machine learning models require to run on data that may not be directly pluggable into the mathematical function. Most modern, large-scale machine learning models like random forests, support vector machines, neural networks, etc., work on numerical values. So if our input is numeric, we can pass it through to the model consistently.

It is crucial to scale ML models because some machine learning algorithms and techniques are susceptible to the relative magnitudes of the different features of data. For example, a k-means clustering algorithm that uses the Euclidean distance as its closeness measure will end up relying massively on features with larger magnitudes.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of any company (directly or indirectly) associated with the author(s). This work does not intend to be a final product, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

All images are from the author(s) unless stated otherwise.

Published via Towards AI

References

[1] “Machine Learning Design Patterns”. 2021. O’Reilly Online Learning. https://www.oreilly.com/library/view/machine-learning-design/9781098115777/ch01.html.

[2] “Google BigQuery: A Tutorial for Marketers”. 2019. Business 2 Community. https://www.business2community.com/marketing/google-bigquery-a-tutorial-for-marketers-02252216.

[3] “Twitter Status”. 2018. Twitter status by SFEIR. https://twitter.com/sfeir/status/1039135212633042945.

[4] “UNDERFIT And OVERFIT Explained”. 2020. Medium. https://medium.com/@minions.k/underfit-and-overfit-explained-8161559b37db.

[5] “Data Consistency In Microservices Architecture”. 2019. Medium. https://ebaytech.berlin/data-consistency-in-microservices-architecture-bf99ba31636f.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.io/2021/05/31/common-challenges-in-machine-learning-and-how-to-tackle-them/

Common Challenges in Machine Learning and How to Tackle Them

Tools for the Data and Model

BigQuery

BigQuery ML

Challenges in Machine Learning

Data Quality

Data Accuracy

Data Reproducibility

Data Drift

Scale

Summary

Further Reading

References

Popular posts from this blog

Fully Explained DBScan Clustering Algorithm with Python

Streamlit — Deploy your app in just a few minutes

Hierarchical clustering explained