Rapid Feature Engineering through SQL

Swapnil Chougule
6 min readNov 17, 2021

--

Feature Engineering is one of the most important aspect of Machine Learning ecosystem. It came into limelight in last couple of years & probably undermined layer from earlier Machine Learning life cycles.

Feature is attribute/explanatory variables from data which is useful for ML model training & inference, derived based on business/domain expertise. (e.g. average rating of product). And Feature engineering is the process of taking a dataset and constructing explanatory variables — features — that can be used to train a machine learning model for a business problem. Won’t go in much detail regarding features, feature engineering, feature store as lot of literature is available around them. Will keep scope of blog to how to ease out feature engineering through SQL (A code free way)

It is painful challenge to compute any feature as it requires data access & retrieval from various systems, processing steps, storage & domain expertise. To ease this process, lot efforts are being put by community & organizations. Many automated feature engineering frameworks/tools have been developed. But still there is lot scope for improvement. Two major aspects to be addressed are flexibility & speed.

Need of the hour:

Most of the available solution still needs complex process to compute any feature, specially writing code to carry out all feature computation steps like reading data, transformations & writing back to stores. It is very time consuming task. It becomes really complex when you work with around hundred machine learning use cases which requires few hundreds of features. It adds overhead to data/ML teams of organisations to take care feature computation. If data/ML team is not available, then data scientists end up spending lot of their time in this process. Thus organizations experience compute bottlenecks, the iterative process is elongated, costing valuable time and resources. Thus feature engineering should be made flexible enough enabling all stakeholders to create features on the fly within no time

SQL (A code-free way):

Here SQL comes as rescue tool for us. SQL is very handy tool — everybody will agree on it. If we can design our ML platforms to define any feature in terms of SQL, then feature can be created on the fly within no time. A SQL query along with couple more configs can serve our purpose.

Benefits:

  1. Less efforts & more intuitive
  2. Saves time
  3. Anybody who knows SQL can create features
  4. Faster ML algorithm prototypes
  5. Reuse existing feature to create new features
  6. Reuse existing infrastructure for computation & storage

Architecture

Data Sources:

There can be multiple sources where data resides. Message queue like Kafka contains streaming data & realtime feature are computed on it. Hive/S3 contains batch data, pre-aggregated data, output of ETL jobs etc. Data which serves time travel queries (e.g. change in price of product a week prior and now) can be present stores like Apache Hudi etc.

Feature Stores:

Computed features are stored in feature stores based on it’s type. Realtime feature gets stored in realtime feature stores (powered by redis etc). Batch features gets stored in batch stores(Hive/S3 etc). ML modeling reads features from these store to train & serve models. Most of the organizations follow similar architecture for feature stores.

One additional component has been added here — Feature Metadata. It will contains metadata regarding feature group. Feature group is logical grouping of features, here this set of feature gets computed by same SQL query. It contains following details for given feature group:

  • feature_group_id
  • feature_group_name
  • version
  • features: array of feature_name
  • features_type: array of data type of feature values
  • features_defaults: array of default values
  • sql_query: query to compute given set of features
  • source_details: data source details in json format
  • execution_frequency: Scheduling frequency to compute these feature. eg. “daily”, “0 * * * *” etc
  • upstream_dependencies: Lists feature groups which can be reused.

We will come across how this metadata gets updated & also how it gets used for feature discovery & reusability.

Feature Engineering Components:

This is most important part of architecture, it tells us how we can leverage SQL for our feature engineering use case. We will visit each component in detail:

Feature Dashboard:

It consists of primarily following 2 dashboards:

  1. Feature Definition:

It provides provision to define a feature group in terms of SQL along with details like feature group name, source, scheduling frequency, features (optional), features_type (optional), features_defaults (optional), version (optional).

  • features, features_type, features_defaults are optional configurations. If not provided, they get inferred from query execution.
  • For realtime features, scheduling_frequency is ‘realtime’
  • Some attributes like today’s date (today), current_time etc are available to use in the query.

Sample config:

  • Also few features needs transformation like categorical encoding, feature vectorization, binarize label etc. Such functionalities are served by Hivemall library through same sql interface.
  • Each feature group is associated with version (denoted by number). Multiple instances of same feature group can be created using version. If not provided initially, it is generated automatically & it is auto-incremental
  • Existing features can be reused to create new features (Datasource in this case is feature store itself). Dependent features can be mentioned under ‘upstream_dependencies’ which will take care feature execution order.

2. Feature Discovery:
Second dashboard is feature discovery. With this dashboard, user can search for existing features, modify or delete. It is powered by feature metadata store.

Web Server:

It is server written in python flask which connects UI dashboard to underneath processing components & feature metadata store through http end points.

  • It interacts with feature metadata store to power feature discovery functionalities.
  • For feature group definition requests, it updates feature metadata, creates airflow task & adds it to airflow dags (created based on scheduling frequency) using airflow python APIs
  • In case of realtime feature groups, it is directly submitted to computing cluster using bash commands or unix supervisor service. It case of feature group modification, these spark streaming jobs are automatically restarted with modified config. Feature group id (referencing to group entry in metadata) is passed as application argument to spark job.

Airflow:

Airflow, job workflow scheduler, consists of DAGs (Directed Acyclic Graph) which is collection of tasks reflecting their relationships and dependencies.

  • Each feature group will correspond to one task in DAG.
  • DAGs are created based on execution frequency of jobs. Hourly DAG contains all feature groups to computed on hourly basis. Multiple DAGs can be created based on frequency e.g. 5_minute_DAG, 15_minute_DAG, hourly, daily etc
  • A typical airflow config for feature groups (spark batch jobs):
  • Feature group id (referencing to group entry in metadata) is passed as application argument to spark job.

Spark Jobs:

This is core component of whole feature engineering infrastructure. It takes care of actual sql execution over the compute cluster. Organization can utilize same existing spark clusters to power feature engineering.

  • Spark jobs (batch + streaming) supports reading input data from various input datasources like Kafka, Hive, S3, Hudi, Feature Stores etc. New datasources can be added whenever required.
  • Features get computed & stored in respective feature store.
  • These jobs can be run over Hadoop yarn cluster, Aws EMR, Kubernetes etc
  • Various sql functions can be used from hive, spark & custom libraries. Custom libraries consist of functions/UDFs for one’s business use cases & these are reusable across feature groups
  • Additionally Hivemall is used to power many data science functionalities. Hivemall is a collection of machine learning algorithms and versatile data analytics functions. It provides a number of ease of use machine learning functionalities through the Apache Hive UDF/UDAF/UDTF interface. List of supported functions can be found here.
  • If optional feature group config is not provided by user, same is inferred post execution of feature group sql in spark job but only during very first run of given feature group & subsequently updated in feature metadata.
  1. Spark Batch Job Skeleton:

2. Spark Streaming Job Skeletion:
Generally realtime feature computation comes with additional challenges compare to batch feature computation. With this sql based architecture, complexity gets reduced to large extent. Similar approach (of batch features) can be extended for it.
Temporary views are created on streaming datasets & sql are executed over it.

Advanced Spark Configurations:
It is not possible to use same spark configurations for all jobs. Additional spark job configurations can be linked to feature group entry in metadata.

Thus, Feature computation through SQL makes feature engineering accessible to a broader audience of data workers and can result in speed and efficiency boosts. Although it does not replace the domain experts, it can assist and cut down on time spent developing new features. Given the productivity benefits, it would be handy tool for rapid and automated feature engineering.

--

--