The focus on models has driven innovative and stable machine learning tools for research and enterprise. The rise of data-centric AI shifts the focus from models and code to the quality and context of your datasets.
In this ebook, you will find:
Computing approaches for improving datasets
Management and communication strategies for cultivating higher-quality data
Why machine learning operations can no longer afford to ignore data versioning and lineage
As enterprise applications of machine learning have matured, so have the needs of the scientists and engineers building these tools: the ingestion, transformation and interpretation of data has evolved to suit specific audiences, data structures, and use cases.
Selecting a model is a solved problem for many scenarios. However, things are not so simple on the data side. Some data problems still need to be fixed: for example, science and analyst teams can’t access the data they need for an application, pipeline tools can’t handle streaming data, and datasets can be either over-or under-conforming.
In Practical Data-Centric AI for the Real World, Pachyderm’s Dan Jeffries presents common project delivery challenges when building machine learning applications for internal use cases like quality control and production line monitoring.
This ebook covers cases where a model’s output is insufficient for its purpose and helps you reorient your solutions to be data-centric instead of seeking model-based solutions:
Data doesn’t appear out of thin air, especially usable data for enterprise machine learning. However, when working with rich media like image detection and natural language processing, generating more data may not be feasible from an operational or budgetary perspective. So what’s a data engineer to do?
This ebook describes methods to enhance, augment, or increase the amount of high-quality data to feed your model. It also covers best practices for managing complex data labeling and describes how Pachyderm’s data versioning and lineage ensure you don’t lose a high-performing dataset to an accidental overwrite.
Understanding what your data has produced and how it got there makes your machine learning operations more efficient. In addition, it builds deeper trust across your organization in machine learning capabilities for your use cases.
Why focus on versioning and lineage for data and models? Because versioning models is only half of the equation. What dataset was used, when, and why is critical to understand where things have broken down, allowing you to test the same dataset on different versions. This report from Winder AI is an excellent overview of the importance of provenance and lineage in machine learning.