Towards the assisted design of data science pipelines
Sergey Redyuk
Database Systems and Information Management Group, TU Berlin
https://sergred.github.io/
Existing automated machine learning solutions and intelligent discovery assistants are popular tools that facilitate the end-user with the design of data science (DS) pipelines. However, they yield limited applicability for a wide range of real-world use cases and application domains due to (a) the limited support of DS tasks; (b) a small, static set of available operators; and (c) restriction to evaluation processes with quantifiable loss functions. We present DORIAN, a human-in-the-loop approach for the assisted design of data science pipelines that supports a large and growing set of DS tasks, operators, and arbitrary user-defined evaluation processes. Based on the user query, i.e., a dataset and a DS task, DORIAN computes a ranked list of candidate pipelines that the end-user can choose from, alter, execute and evaluate. It stores executed pipelines in an experiment database and utilizes similarity-based search to identify relevant previously-run pipelines from the experiment database. DORIAN also takes user interaction into account to improve suggestions over time. We show how users can interact with DORIAN to create and compare DS pipelines on various real-world DS tasks without the need to write any code.
Слайды доклада
Видео доклада.
Литература:
- B. Bilalli, A. Abello, T. Aluja-Banet, and R. Wrembel. 2018. Intelligent assistance for data pre-processing. Computer Standards & Interfaces 57 (2018), 101--109.Google ScholarCross Ref
- I. Drori et al. 2018. AlphaD3M: Machine learning pipeline synthesis. In AutoML Workshop at ICML.Google Scholar
- Y. Gil et al. 2018. P4ML: A phased performance-based pipeline planner for automated machine learning. In ICML'18 AutoML Workshop.Google Scholar
- Xin He, Kaiyong Zhao, and Xiaowen Chu. 2021. AutoML: A survey of the state-of-the-art. Knowledge-Based Systems 212 (2021), 106622.Google ScholarCross Ref
- P. Nguyen, M. Hilario, and A. Kalousis. 2014. Using Meta-Mining to Support Data Mining Workflow Planning and Optimization. J. Artif. Int. Res. 51, 1 (Sept. 2014), 605--644.Google Scholar
- R.S. Olson and J.H. Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In ICML'16 AutoML Workshop. JMLR, 66--74.Google Scholar
- Z. Shang et al. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD'19 (Amsterdam, Netherlands). ACM, 1171--1188. Google ScholarDigital Library
- M.D. Wever, F. Mohr, and E. Hullermeier. 2018. Ml-plan for unlimited-length machine learning pipelines. In ICML'18 AutoML Workshop.Google Scholar
- Schelter, S., Biessmann, F., Januschowski, T., Salinas, D., Seufert, S. and Szarvas, G., 2015. On challenges in machine learning model management.
- Polyzotis, N., Roy, S., Whang, S.E., & Zinkevich, M.A. 2018. Data Lifecycle Challenges in Production Machine Learning. ACM SIGMOD Record, 47, 17 - 28.
|