Machine Learning – Beyond the Hype

By MARIA EMILIA LOPEZ MARINO

Maria Emilia Lopez Marino (LGO ’19) was the recipient of the 2019 Best Thesis Award for her work in collaboration with Amgen, Inc., advised by MIT professors Roy Welsch and Philip Gschwend, which involved the development of a predictive framework using machine learning techniques to assess the impact of raw material variability on the performance of commercial biologic manufacturing processes. The resulting model demonstrated an average of 89 % accuracy predicting product quality attributes and other critical process performance variables, and it was implemented as an internal Amgen web-tool [4]. 

The implementation of machine learning techniques in industry is still relatively nascent, and as a result many companies encounter unanticipated challenges when first exploring their potential applications. In this article, I identify three key challenges of developing and implementing a predictive model that emerged over the course of my project at Amgen and discuss how they can be avoided.

Earlier this year, LinkedIn co-founder Allen Blue reported that data science and machine learning-related jobs represent five of the top 15 fastest-growing jobs in America today [1]. This phenomenon is driven by computational advances that have enabled long-existing artificial intelligence (AI) techniques, as well as increased understanding of the value of automating processes in the private sector. Machine learning (ML), a branch of AI, has already begun revolutionizing the field of operations. Today, companies are using ML to predict when a manufacturing part will fail, to decide whether or not to treat a crop with a pesticide by analyzing visual data, and to improve customer experience in retail through the use of virtual assistants [2]. The growth of applied ML is also reflected in the LGO program’s activities: ML was the research focus of 10% of theses in 2018, 22% in 2019, and 25% in 2020 [3]. Addressing the implementation challenges described below can help companies that are starting to apply ML-based tools in their business better prepare for success.

TIP: Defining the Scope of a Machine Learning Project
Challenge 1: Incorporating problem-specific input into the data preparation process 

Data extraction and clean-up is often one of the most time-consuming (and painful) steps of any analytics project, especially when dealing with “legacy” systems that were designed for now decades-old computational processes. One might assume that this task should fall to data scientists; however, a cross-functional approach that considers both ontology and feature engineering is necessary to develop a successful model. When I first began extracting data for the model, I spent countless hours scrubbing and curating it into a format that was compatible with the predictive algorithms. Afterward, I tested the data by running it through the model and—the results were disappointing. The model’s accuracy level was similar to the results one would achieve from randomly guessing. I realized that in the process of preparing the data, I had focused primarily on ontology engineering.

After talking with some of Amgen’s experts in cell culture processing, I discovered that my data set was missing features (process and material attributes) with greater predictive power. For example, instead of just using the daily record of a process variable as one of the key features, I could further transform the data, deriving relationships from original features, and include the integral value of the daily record, which turned out to be a better predictor judging from model performance metrics. Although good ontology engineering is critical for developing a data set for a predictive model, collaboration with domain experts enables the development of richer, more accurate models. In other words, you are then helping your algorithm focus on what your subject matter expert knows is important, based on mechanistic or first principle modeling.

Challenge 2: Spending too much effort (with diminishing returns) refining an overly sophisticated model

After the data has been prepared, you are ready to begin the real “machine learning”: training and evaluating the model. A key challenge of this stage is to strike the right balance between refining the model to achieve superior results and settling for a simpler model that meets the criteria for the specific application. I addressed this trade-off by defining a target quality measure for model performance up front and implementing ready-made ML algorithms from data science platforms [5]. We defined the bounds for model performance by implementing off-the-shelf methods which are known to perform well on a range of predictive model problems, such as random forests. We then aimed to find simpler models that achieved similar or improved performance. First, we implemented very simple models, such as linear regressions and simple regression trees, to achieve model convergence. We then increased the model complexity by looping around different algorithms, performing a search of the model that performed better according to our selected quality metric (coefficient of determination or area under the ROC curve depending on the type of problem in hand: regression or classification). 

Although we could have further increased the complexity of the model by incorporating additional ML algorithms, such as neural networks, we recognized that it would not increase model performance radically and could possibly compromise interpretability. Another downside of developing an extremely sophisticated model is that it can be more difficult to generalize to other applications.

Challenge 3: Designing a user interface for your model that is suitable for non-experts

In the excitement of getting positive results from your model, it can be easy to forget to consider the most important factor for long-term success: the end user. In many cases, the end users of ML models will not be data scientists. As such, it is essential to develop a user-friendly interface for the model, as well as a guide for using the tool, while keeping the client’s original motivation for the project in mind. The end users for the model we developed for Amgen were manufacturing processes monitoring teams and upstream cell culture manufacturing teams, consisting of mostly engineers and technicians. Some of the teams had already started using ML tools in their daily routines, while others had little to no prior experience applying data science in their work. In addition to the interface design, running workshops with end users helped ensure the usability and sustainability of the tool.

In order to successfully develop a ML-based tool, it is important to take a cross-functional approach, use the simplest model possible, and ensure that the end-product is user-friendly. This list of challenges is not exhaustive, but they are likely to be encountered in any application of ML. By navigating them successfully, you can unlock the potential of predictive modeling to better achieve your business goals.

References

[1] What is driving the demand for data scientists?, Wharton, UPenn, March 8th 2019.

[2] Marr, Bernard, 27 Incredible Examples Of AI And Machine Learning In Practice, Forbes, April 30th 2018.

[3] LGO Program Office, September 8th 2019.

[4] María Emilia López Marino, Big Data Analysis Interrogating Raw Material Variability and the Impact on Process Performance, LGO Thesis, May 8th 2019.

[5] Scikit-Learn, https://scikit-learn.org

[6] Sekhar, Amit, What Is Feature Engineering for Machine Learning?, Medium, February 14th 2018.

[7] Descriptive, predictive, prescriptive: Transforming asset and facilities management with analytics, Watson IoT, IBM, December 2017.

About the Author

Trained as a Chemical Engineer in Argentina, Maria Emilia has served in roles varying from logistics engineer to manufacturing engineer and manager at large consumer goods firms, inlcuding Procter & Gamble and Johnson & Johnson. Maria Emilia graduated from the LGO program in 2019 and recently joined Amgen’s Operations Leadership Program.