Representation: Feature Engineering  |  Machine Learning  |  Google for Developers (2024)

  • Home
  • Products
  • Machine Learning
  • Foundational courses
  • Crash Course
Stay organized with collections Save and categorize content based on your preferences.

In traditional programming, the focus is on code. In machine learningprojects, the focus shifts to representation. That is, one way developers honea model is by adding and improving its features.

Mapping Raw Data to Features

The left side of Figure 1 illustrates raw data from an input data source;the right side illustrates a feature vector, which is the set offloating-point values comprising the examples in your data set.Feature engineering means transforming raw data intoa feature vector. Expect to spend significant time doing featureengineering.

Many machine learning models must represent the features asreal-numbered vectors since the feature values must be multiplied by themodel weights.

Representation: Feature Engineering | Machine Learning | Google for Developers (1)

Figure 1. Feature engineering maps raw data to ML features.

Mapping numeric values

Integer and floating-point data don't need a special encoding because they canbe multiplied by a numeric weight. As suggested in Figure 2, converting the rawinteger value 6 to the feature value 6.0 is trivial:

Representation: Feature Engineering | Machine Learning | Google for Developers (2)

Figure 2. Mapping integer values to floating-point values.

Mapping categorical values

Categoricalfeatures have a discrete set of possible values.For example, theremight be a feature called street_name with options that include:

{'Charleston Road', 'North Shoreline Boulevard', 'Shorebird Way', 'Rengstorff Avenue'}

Since models cannot multiply strings by the learned weights, we use featureengineering to convert strings to numeric values.

We can accomplish this by defining a mapping from the feature values, whichwe'll refer to as the vocabulary of possible values, to integers. Since notevery street in the world will appear in our dataset, we can group all otherstreets into a catch-all "other" category, known as an OOV (out-of-vocabulary)bucket.

Using this approach, here's how we can map our street names to numbers:

  • map Charleston Road to 0
  • map North Shoreline Boulevard to 1
  • map Shorebird Way to 2
  • map Rengstorff Avenue to 3
  • map everything else (OOV) to 4

However, if we incorporate these index numbers directly into our model, it willimpose some constraints that might be problematic:

  • We'll be learning a single weight that applies to all streets. For example, ifwe learn a weight of 6 for street_name, then we will multiply it by 0 forCharleston Road, by 1 for North Shoreline Boulevard, 2 for Shorebird Way andso on. Consider a model that predicts house prices using street_name as afeature. It is unlikely that there is a linear adjustment of price basedon the street name, and furthermore this would assume you have ordered thestreets based on their average house price. Our model needs the flexibilityof learning different weights for each street that will be added to the price estimated using the other features.

  • We aren't accounting for cases where street_name may take multiplevalues. For example, many houses are located at the corner of two streets, andthere's no way to encode that information in the street_name value if itcontains a single index.

To remove both these constraints, we can instead create a binary vector for eachcategorical feature in our model that represents values as follows:

  • For values that apply to the example, set corresponding vector elements to 1.
  • Set all other elements to 0.

The length of this vector is equal to the number of elements in the vocabulary.This representation is called a one-hot encoding when a single value is 1,and a multi-hot encoding when multiple values are 1.

Figure 3 illustrates a one-hot encoding of a particular street: Shorebird Way.The element in the binary vector for Shorebird Way has a value of 1, while theelements for all other streets have values of 0.

Representation: Feature Engineering | Machine Learning | Google for Developers (3)

Figure 3. Mapping street address via one-hot encoding.

This approach effectively creates a Boolean variable for every feature value(e.g., street name). Here, if a house is on Shorebird Way then the binary valueis 1 only for Shorebird Way. Thus, the model uses only the weight for ShorebirdWay.

Similarly, if a house is at the corner of two streets, then two binary valuesare set to 1, and the model uses both their respective weights.

Sparse Representation

Suppose that you had 1,000,000 different street names in your data setthat you wanted to include as values for street_name. Explicitly creating abinary vector of 1,000,000 elements where only 1 or 2 elements are true is avery inefficient representation in terms of both storage and computation timewhen processing these vectors. In this situation, a common approach is to use asparse representation in which only nonzero values are stored. In sparserepresentations, an independent model weight is still learned for each featurevalue, as described above.

Help Center

Previous arrow_back Video Lecture
Next Qualities of Good Features arrow_forward

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2022-07-18 UTC.

Representation: Feature Engineering  |  Machine Learning  |  Google for Developers (2024)
Top Articles
Latest Posts
Article information

Author: Lidia Grady

Last Updated:

Views: 5835

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Lidia Grady

Birthday: 1992-01-22

Address: Suite 493 356 Dale Fall, New Wanda, RI 52485

Phone: +29914464387516

Job: Customer Engineer

Hobby: Cryptography, Writing, Dowsing, Stand-up comedy, Calligraphy, Web surfing, Ghost hunting

Introduction: My name is Lidia Grady, I am a thankful, fine, glamorous, lucky, lively, pleasant, shiny person who loves writing and wants to share my knowledge and understanding with you.