Skip to Main Content
Feature types for data exploration and validation

About This Workshop

Youtube Video

About This Workshop
The feature type system allows data scientists to separate the concept of how data is represented physically from what the data actually measures. That is, the data can have feature types that classify the data based on what it represents and not how the data is stored in memory. Each set of data can have multiple feature types through a system of multiple inheritances. As a concrete example, an organization that sells cars might have a set of data that represents the purchase price of a car, that is the wholesale price. You could have a feature set of wholesale_price, car_price, USD, and continuous.

All default feature types have methods for creating summary statistics and a plot to represent the data. This allows you to have summary information for each feature of your dataset while only using a single command. However, the default feature types may not provide the exact details needed in your specific use case. Therefore, feature types have been designed with expandability in mind. When creating a new feature type, the summary statistics and plots that are specific to your feature type can be customized.

The feature type system works at the Pandas dataframe and series levels. This allows you to create summary information across all of your data and at the same time dig into the specifics of one feature.

The feature type framework comes with some common feature types. However, the power of using feature types is that you can easily create your own and apply them to your specific data. You don't need to try to represent your data in a synthetic way that does not match the nature of your data. This framework allows you to create methods that validate whether the data fits the specifications of your organization. For example, for a medical record type, you could create methods to validate that the data is properly formatted. You can also have the system generate warnings to sure the data is valid as a whole or create graphs for summary plots.

The framework allows you to create and assign multiple feature types. For example, a medical record id could also have a feature type id and the integer feature type. It also allows you to customize summary statistics, plots and correlations. Select columns based on feature types.

Feature type warnings are used for rapid validation of the data. For example, the wholesale_price might have a method that ensures that the value is a positive number because you can't purchase a car with negative money. The car_price feature type may have a check to ensure that it is within a reasonable price range. USD can check the value to make sure that it represents a valid US dollar amount. It can't have values below one cent. The continuous feature type is the default feature type and it represents the way the data is stored internally.

The feature type validators are a set of is_* methods, where * is generally the name of the feature type. For example, the method .is_wholesale_price() can create a boolean Pandas series that indicates what values meet the validation criteria. It allows you to quickly identify which values need to be filtered or require future examination into problems in the data pipeline. The feature type validators can be as complex as they need to be. For example, they might take a client ID and call an API to validate that each client ID is active.

Workshop Info

1 hour, 30 minutes
  • Lab 1: Configure the Data Science Service
  • Lab 2: Create a Project
  • Lab 3: Create a Notebook Session
  • Lab 4: Accelerated Data Science SDK
  • Lab 5: Feature Type Tutorial
  • Lab 6: Shutting Down a Notebook Session
  • Familiarity with Oracle Cloud Infrastructure (OCI) is helpful.
  • Familiarity with the Data Science service is desirable, but not required.
  • Basic understanding of the Python programming language.
  • Experience with exploratory data analysis (EDA) and data cleaning is desirable.

Other Workshops you might like