In July, we visited Carnegie Mellon University to participate in the 17th annual Simon Initiative’s LearnLab Summer School on Educational Data Mining. During the program, we gave a talk on feature engineering and helped teams complete projects using Featuretools, an open source library for automated feature engineering.
Introducing and teaching Featuretools at LearnLab marked the first time we attempted to help groups simultaneously build projects in real time using the library. We had two main objectives:
I. To help scientists who don’t measure success through predictive accuracy
The scientists at LearnLab and in the broader educational community explore the datasets they collect with a number of specific goals in mind:
- to measure and assess how well students are learning with technology
- to personalize objectives for each student, and
- to validate the efficacy of adaptive mechanisms that researchers have designed.
Existing tools that help data scientists get highly predictive accuracies often do not help experts achieve their specific modeling goals. A domain expert will come up with a hypothesis in English, for example: “gritmay predict persistence”. The expert then must translate both variables, grit and persistence, into calculable quantities for machine learning. Those quantities are called features. By looking at the data they’ve obtained, they decide how to perform these translations — for example, a way to quantify the ‘grit’ of a student might be to find the average of the number of attempts made per problem attempted. This process, called feature engineering, is the focus of the Featuretools library.
II. To apply automation to a new dataset without prior preparation
Featuretools is designed to automatically prepare a wide range of temporal datasets for analysis and machine learning within a few hours. Running an event where we ask users to “bring your own data” (BYOD) is a substantial challenge for data science automation. The idea behind a BYOD workshop is that participants:
- bring their own data of varied shapes, sizes and types
- integrate their data with Featuretools and generate features to create predictive models,
- uncover actionable insights or identify clusters in their data and at last
- leave with a proof of concept that can greenlight future work.
Step (1) of this process is a notoriously difficult task: since the participants have not communicated their needs ahead of time the system had to easily accommodate an immense range of problems. Nevertheless, we were happy to find that Featuretools performed well at enabling all four of those steps.
LearnLab projects that were developed with Featuretools
“Feature generation and prediction: an iterative process”
Team A developed a model that correlated student behavior to the “enjoyment” of using a new platform intended to make learning fractions and decimals more exciting. The team used Featuretools to generate features from the students’ transactional data, and they used R to build models and analyze correlations. They also explored how the enjoyment label was defined and tried to predict these different definitions using the same features.
With features generated by Featuretools, Team A focused on (a) trying to analyze features and (b) trying different predictive models.
“Comparing what students say to what they do: a MOOC data analysis”
Team B wanted to understand whether a student’s motivation and goals before starting an edX course correlated with his or her behavior during the course. They used Featuretools to generate features and another open source library called Lifelines to build a student survival model. They concluded that there is no significant difference in how students behaved based on their survey answers at the beginning of the course.
Team B was focused on deriving certain types of features from the data, such as the time students spent watching videos. To achieve this, they applied the TimeSince primitive in the library.
“To what extent can we predict students’ success within a cognitive tutor?”
Team C built a model to predict the number of lessons a student would complete by the end of a course using behavioral data from the beginning. By the time we introduced Featuretools, the team had already performed initial feature engineering. They still tested Featuretools and found that the feature, number of attempts, was predictive of success. They had not previously extracted this feature because they had a different hypothesis, that the feature, number of hints, would be predictive.
Team C relied on Featuretools to overcome the time it would take to generate features manually and found a feature that, though easy to calculate, was not in their original calculations.
What we learned
Having a standardized data model helped: Participants in this workshop used a repository of student activity data called Datashop. Regardless of where the data came from, it was formatted in a similar manner. This enabled us to create one prototypical example that could be modified by all of the teams.
The amount of data was perfect: Most of the teams had data that was between 200,000 and 1,000,000 transactions and between 300 and 3,000 instances of an entity that they were interested in modeling—in this case, online students. The data size meant that participants could interactively examine features, develop explanations and refine parameters using just their laptops. This interactivity led to an engaging experience.
Participants manipulated the data in different languages prior to Featuretools: All of the teams applied standard preparation steps: fixing timestamps, dropping columns/rows, prediction engineering, etc. Not all teams were versed in Python or Pandas, and they preferred to do these operations in a language that they were comfortable using (most often R, Matlab or SAS).
Prediction engineering took place a step before feature engineering: Before taking advantage of automated feature engineering, all the teams had to manually implement logic to extract labeled training data. This occupied most of the time the teams spent on the projects.
How Featuretools enabled domain experts
Automation inverted a classic workflow: Typically, domain experts iteratively brainstorm specific feature ideas and then implement them. Featuretools inverts this painstaking process and instead generates numerous potential features for a user to interpret or include in their models. This lowers the barrier to feature engineering, making it both easier and more interesting. The scientists working on all three projects took advantage of this inverse process in one way or another.
Establishing trust is required: Scientists in this domain push back against techniques like deep learning that don’t provide them with models they can interpret. Since they have the ability to understand the features generated by Featuretools and to tweak the algorithm to add new features, domain experts could audit, explain and trust this process.
Learning feature engineering as a methodological skill: Due to the ad hoc nature of how features were written, feature engineering was not generally recognized as a methodological skill. After our talk and the students’ own experimentation with Featuretools, all of the teams began using feature engineering in a systematic way that will enable better research in the future.
Tweets from @featuretools_py
Featuretools v0.3.0 is out! We're particularly excited about this release because feature calculations now run 2x faster on average and over 10x faster in some cases. See all the changes in our documentation https://t.co/KhuRJ0Tvx4 pic.twitter.com/0rhFo4RBjd— Featuretools (@featuretools_py) August 28, 2018
Better features means better models. Check out this tutorial by @koehrsen_will on how Featuretools can help you automatically generate hundreds of potentially predictive features for your data. More demos here: https://t.co/htmqGFDnok https://t.co/2ni4M7eMSB— Featuretools (@featuretools_py) June 12, 2018