Data Preprocessing with Orange Tool
Preprocessing is a key component in Data Science. The orange tool has various ways to achieve it.
Introduction
Preprocessing is crucial for achieving better-quality analysis results. The Preprocess widget offers several preprocessing methods that can be combined in a single preprocessing pipeline. Some methods are available as separate widgets, which offer advanced techniques and greater parameter tuning.
In the Orange tool canvas, take the Python script from the left panel and double click on it.
Discretization
For certain tasks you might want to resort to binning, which is what Discretize does. It effectively distributes your continuous values into a selected number of bins, thus making the variable discrete-like. You can either discretize all your data variables at once, using selected discretization type, or select a particular discretization method for each attribute
import Orange
store = Orange.data.Table(“iris.tab”)
iris = Orange.preprocess.Discretize()
iris.method = Orange.preprocess.discretize.EqualFreq(n=3)
d_store = iris(store)
print(“Original dataset:”)
for e in store[:3]:
print(e)
print(“Discretized dataset:”)
for e in d_store[:3]:
print(e)
Continuization
This widget essentially creates new attributes out of your discrete ones. If you have, for example, an attribute with people’s eye color, where values can be either blue, brown or green, you would probably want to have three separate attributes ‘blue’, ‘green’ and ‘brown’ with 0 or 1 if a person has that eye color. Some learners perform much better if data is transformed in such a way. You can also only have attributes where you would presume 0 is a normal condition and would only like to have deviations from the normal state recorded (‘target or first value as base’) or the normal state would be the most common value (‘most frequent value as base’). Continuize widget offers you a lot of room to play.
import Orange
titanic = Orange.data.Table("titanic")
continuizer = Orange.preprocess.Continuize()
titanic1 = continuizer(titanic)
Normalization
Construct a preprocessor for normalization of features. Given a data table, preprocessor returns a new table in which the continuous attributes are normalized.
from Orange.data import Table
from Orange.preprocess import Normalize
data = Table("iris.tab")
normalizer = Normalize(norm_type=Normalize.NormalizeBySpan)
normalized_data = normalizer(data)
Randomization
With randomization, given a data table, the preprocessor returns a new table in which the data is shuffled. Randomize function is used from the Orange library to perform randomization.
class Orange.preprocess.Randomize(rand_type=Randomize.RandomizeClasses, rand_seed=None)
Construct a preprocessor for randomization of classes, attributes and/or metas. Given a data table, preprocessor returns a new table in which the data is shuffled.
from Orange.data import Table
from Orange.preprocess import Randomize
data = Table("iris")
randomizer = Randomize(Randomize.RandomizeClasses)
randomized_data = randomizer(data)