The era of Industry 4.0 opens up the possibility of optimizing production systems in a data-driven way. To turn data into value, machine learning (ML) models are trained on production data aiming at identifying patterns to optimize processes. A crucial prerequisite for achieving performant ML models is the availability of high quality data. Since raw data generated in production exhibits multiple quality issues, data preprocessing (DPP) is required to increase the data quality. One of the key design decisions in any ML project is the choice of suitable DPP methods. The search space further increases when DPP methods are configured into DPP pipelines. Due to the high number of possible DPP pipelines, data scientists commonly select suitable pipelines manually via trial and error. For these reasons, DPP nowadays accounts for approximately 80 % of the time in ML projects. To guide data scientists, decision support systems (DSS) have been developed that assist in the selection of suitable DPP pipelines but do not cover production-specific requirements. Therefore, the main research question was: Can a DSS be developed that supports in recommending DPP pipelines for ML applications in production? To be able to answer the main research question, a meta learning-based decision support system, called Meta-DPP, was developed. Meta-DPP relies on three core components: the meta target selector, meta features database, and meta model. The meta target selector chooses between two preselected sets of overall well performing pipelines, called pipeline pools, for both classification and regression tasks. Further, the meta features database stores learning task-specific information about the data set, e. g., the number of instances. The meta model then recommends a pipeline from the pipeline pool based on the meta features from the database. When applying Meta-DPP, a user interface enables the data scientist, or production expert to input their data set, learning task, ML algorithm and information about explainability. Given these four inputs, Meta-DPP provides a ranked recommendation of the DPP pipelines from the pool. Verifying and validating revealed the correct development and implementation of Meta-DPP. The validation on 324 production use cases further prove that Meta-DPP outperform essential pipelines on average, whereby essential pipelines ensure the functioning of ML algorithms by performing minimum DPP. Thus, the main research question was positively answered.
Eigene Bewertung schreiben
Produktionsqualität und Messtechnik
Recommending Data Preprocessing Pipelines for Machine Learning Applications in Production
Lieferzeit: 2-3 Tage
inkl. 7% MwSt.
High data quality is the key for performant machine learning (ML) models in production. In practice, data quality is preprocessed using several DPP methods that are configured into DPP pipelines. The choice of the DPP pipelines poses a major challenge. To guide data scientists, a meta learning-based decision support system (DSS) have been developed, called Meta-DPP, which assists in the selection of suitable DPP pipelines but do not cover production-specific requirements.