Alternatives to Iris: Finding Drop-In Replacements for Overused Example Datasets
Are you a little tired of seeing the Iris dataset being used in so many code packages and tutorials? Me too. What follows is an exploration of why the Iris dataset is so common as an example dataset, what features we might want to replicate in drop-in replacements for it, how we might find (or make) such a replacement, and some options for sharing such a replacement with others.
What Do We Mean By Example Datasets?
Example datasets are datasets packaged with a software application or code library, used in a tutorial for how to do something, or compiled in a lists of “good starter datasets”. They are often used when the actual content represented by the dataset is secondary to just having one that is easy to work with. There is an extremely high amount of reuse of example datasets to the point where some of the “standards” are reused in thousands of places.
Why Makes for a Good Example Dataset?
- Column names are immediately understandable by everyone.
- Data content revolves around a question almost anyone could relate to.
- No nulls or other complications that require data preparation.
- Data type, range, and distribution makes the tasks you’re using the example dataset for possible.
Basically, a good example dataset should require as close to nothing on the part of the end-user as possible in either operations or thought.
Why Might We Want Alternatives to the Few Highly Reused Example Datasets?
The Dataset Isn’t Great For Showing ____ Technique
Sometimes an example dataset just isn’t great at showing a particular technique or method. It can be done, but it might be a poor example of that specific thing. For example, the Iris dataset doesn’t have columns with properties that allow for features to be created from the raw data and used in machine-learning. It is only a good example for doing operations based on the original columns. This is explained a bit more in this blog post by another author.
Certain datasets can also have problematic histories. The Iris dataset is a good example of this. It was originally published in the Annals of Eugenics by R A Fisher, who had racist views on genetics.
Users seeing the same examples again and again can lead to boredom. Some users wish they had a different dataset to use out of boredom. Others wish there was another dataset available for comparison purposes.
Example Datasets as a Way to Increase Engagement in a Subject/Problem/Organization
Example datasets can also be a way to bring eyeballs and brains to data on a particular topic. Providing a good example dataset could offer benefits to the dataset supplier.
Example datasets can be a way to help people engage with a topic, subject matter, or cause.
For open-data sites that provide a catalog of open-data from a city, state, governmental agency, or non-profit, being able to promote a few datasets as potential drop-in example dataset replacements could be a way to draw users to their content.
There aren’t many good examples of this, but there are examples of datasets that people have taken, sometimes repackaged into easier to use forms, and reused in many different side projects and tutorials. HadCrut4 is an example of a climate dataset that has been widely used by end users of a wide variety of skill levels. You might be familiar with as climate stripes. It and NASA’s GISS Surface Temperature Analysis (GISTEMP v4) dataset have been used as example datasets for multiple climate data visualization challenges and side projects.
Suggested Alternatives to Iris by Others
If you google “alternatives to Iris dataset”, a variety of things pop up. Some of them are lists of alternatives. Here are two lists:
- 4 alternatives to IRIS: https://www.meganstodel.com/posts/no-to-iris/
- 10 datasets including iris for ML: https://machinelearningmastery.com/standard-machine-learning-datasets/
The Penguin Dataset (A drop-in replacement for Iris)
There is also a “penguin dataset” that has been put forth by several authors as “the” replacement for Iris as it replicates many of the traits of the original Iris dataset. The dataset is available in a github repository, as a kaggle dataset with explanation, as a tutorial on “toward data science” a Medium channel, and has been used in many tutorials for methods or code packages, such as this Streamlit example. The original creator of the dataset, Allison Horst, has a nice github pages webpage that goes through use of the dataset for data exploration and visualization. The page contains visuals that do a good job of showing just how similar not only the data structure is to Iris but also the distribution of the classes in the dataset.
Why is the Iris Dataset so Popular?
It is hard to definitively “know” this. What follows are guesses.
The Iris dataset was first published in 1936. It’s author published several important works in biostatistics. I first became familiar with the dataset not while writing code but in my high school biology class while being introduced to biostatistics. It is small, easy to understand, and has been around for a long-time.
In addition to the Iris dataset’s history that predates widespread data analysis and data visualization with code, there’s its more recent history as an example dataset in all the places people find example datasets. Although there are many data archives out there, few of them specialize in example datasets. The University of California Irvine Machine Learning Repository was started in 1987 by David Aha and others. It is one of the options for datasets you’ll get in a google search and probably the oldest of the results. It has Iris as one of its example dataset options. Scikit-learn, Tensorflow, and the R language all have the Iris dataset built in. Tableau, Kaggle, and other online tools also feature the iris datasets.
When people need a small dataset with numerical columns and categorical labels without any nulls and some overlap between the classes, they think of the Iris Dataset because they’ve already seen it so much.
Dataset Attributes (how the information is represented as data)
One of the main advantages of the Iris Dataset is that it is simple. There are no nulls. The data is all numerical. The presence of strings or categorical columns would make a dataset slightly harder to work with, and, in some applications, require converting the categories to numerical fields. The fact the user doesn’t have to deal with this, missing values, or any other complication makes it easy to work with as an example dataset.
Additionally, the three classes of Iris overlap but not completely. This is a useful characteristic as it makes it amenable to tasks that involve prediction of classes, uncertainty analysis, and visualization. If the classes were extremely different, the dataset wouldn’t show uncertainty as well, either numerically or visually. At the same time, if the three classes of Iris overlapped perfectly, applying prediction to it would feel like a waste.
Part of what makes Iris a good example dataset is the distribution of its data works for a variety of tasks.
Dataset Content (what the data represents)
To be a good example dataset, the data has to represent something people can easily and quickly relate to.
The iris dataset revolves around the question of “what type of flower is this”, which is a simple common question that pretty much everyone has thought at one time or another.
What are Data Attributes of Iris Might We Want to Mimic in a Drop-In Replacement?
The names may be in Latin but there’s only three classes. Even if the end user doesn’t have a lot of familiarity with different irises, they likely understand the concept that there are different types of them. The class names for any drop-in iris replacement shouldn’t require any extra work by the end user to understand the dataset.
A flat data structure is generally easier for people to work with in the widest variety of tools. This means there is no nesting. There are only columns and rows. Each instance of an iris is a new row. All rows have the same number of columns.
The Iris dataset is stored in a variety of formats in different places by different parties. If one was storing a new dataset, CSV files are likely the easiest to open file format making use available to the highest number of people. If it can be stored in CSV, it can be stored in anything more complicated.
Number of Columns:
The iris dataset has 4 data columns. Datasets that have 1 or 2 data columns probably rule out some tasks. Datasets with 200 columns require too much investigation by the end user. Although there are probably exceptions to this, we’re probably looking for a dataset with 3–10 data columns in addition to the classes column.
Data Type of Columns:
Numbers are the easiest data type for most people to use. Categorical strings are next easiest. Nulls require additional actions by the end-user. Arrays and dictionaries nested inside columns again require additional actions by the end-user and might be out of reach of some users’ technical abilities.
What are Data Content Characteristics Might We Want to Mimic in a Drop-In Replacement?
These characteristics are harder to define. They have to do with whether people can understand the dataset without additional information and whether they can relate to the question it poses.
As an example, let’s imagine an example dataset with class labels “A”, “B”, “C” and feature columns with names “something”, “somethingElse”, and “otherThing”. The data columns contains floats with a distribution identical to Iris. Would that make a good example dataset? Probably not. It isn’t relatable.
Ideally, column names should be understandable to anyone. Users shouldn’t have to read material about the dataset. Ironically, it could be argued the iris Dataset fails this as not that many people could define a ‘sepal’, the length and width of which makes up two columns in the iris datasets. Wikipedia has a nice definition of sepal with pictures here.
What Question Does the Dataset Answer:
A dataset that answers an obvious question if preferable over one that does not. A abstract way to phrase the question the iris dataset helps answer is “There are several types of ___, and this data can be used to distinguish each type”. That phrasing gives us some potential guidance on where to look for drop-in replacements that answer similar questions. There are many other things that have classes of ___ and multiple numerical characteristics that describe instances of each type whose data distributions only partially separate.
Animals, plants, minerals, rock types, planetary bodies, cars, book genres, and movie genres are all potential places to look for drop-in replacement example datasets.
Can we Find Alternatives Programmatically?
Downloading many datasets one after another to examine their characteristics by hand does not sound like a fun experience. Doing the same task programmatically might speed things up a bit.
Data Characteristics that could be Determined Using Code?
- File format
- Data structure (flat or nested)
- Number of columns
- Number of rows
- Number of labels
- Data types of each column
- Level of overlap between the data for each label class
Difficulties Programmatically Profiling Datasets in Bulk
Although code exists to load CSV and JSONs and determine the features above, getting the many files programmatically in the first place is often the bigger challenge. Many large data catalogs don’t hold the datasets themselves, only the metadata. Often this metadata lacks a direct link to the files but instead have a link back to the original data catalogs landing page. Programmatically getting datasets from a data catalog that doesn’t hold the data files itself can get very complicated quickly. Too many data catalogs assume a human will always be in the loop somewhere in the process. If data profiling works, it is often because you are working with a data catalog that holds the data itself and the metadata contains both direct download links and file format information.
When Searching for Potential Alternative Datasets, What Likely Requires a Human?
Certain dataset characteristics are hard to get at programmatically. What question a dataset seeks to answer and whether that question is one a large percent of the user population will be able to relate to easily is difficult to get programmatically from the data and metadata alone. Whether column names are understandable on first glance is another question that is hard to answer programmatically.
Where to Find Potential Example Datasets?
When searching for open data that can be reused as example datasets, you can start searching in very large meta catalogs that ingest smaller catalogs or you can start searching in data catalogs focused on a single topic that hold the data themselves. The former is better for random discovery but the filtering power is often poor. The later option often has better filtering capabilities, and you’re more likely to be able to filter based on file format, number of columns, size of file, etc.
Data.gov is a collection of data catalogs from various federal agencies. It is extremely large. Unfortunately, most of the datasets lack good metadata on file format or file size. They also frequently don’t have direct download links to files, requiring a human to click through a few pages to get to the actual dataset.
There are also datasets available for specific categories of things. For example, this is a mineral database. Although it is very large, it is possible some small subset of it could be used as a replacement for Iris. In fact, you might be able to create a program that could generate different combinations of subsets of this dataset for use as example datasets. There are likely other data catalogs focused on large categories of ___ that might be good searching grounds for iris replacements.
Instead of Finding Datasets that would Make Good Example Datasets, Could We Create One?
From a code standpoint, it should be very easy to create a small CSV with labels that are three strings and four columns of numerical data. Generating data with distributions necessary for a range of tasks is slightly more difficult. Perhaps the hardest part might be coming up with a story about those fake values that is interesting, immediately understood, and not diminished by the fact that the data is fake.
How Might We Share Alternative Example Datasets with Others?
Submit a Pull Requests to Add Your Example Dataset to the Example Datasets Used by a Heavily Used Code Package
Add Your Replacement Example Dataset to Popular Lists of Example Datasets
Provide Easy to Use Example Datasets on Front-page of a Data Catalog
Many times large data catalogs have the problem that most members of the general public want simple, easy, and small datasets to work with for some side project, a hackathon, or a homework assignment. Unfortunately, their search results are swamped by many large, complex datasets meant for specialists that make up the bulk of the datasets in the catalog.
Presenting a few example datasets from within a larger data catalog and pointing out what well known example datasets they’re similar to might be a way to get lower-skill level end users to take advantage of the large data catalogs.
Related Things that Don’t Exist Yet But Could…
There are a variety of things that could be built that relate to the ideas discussed in this post.
Themed Collections of Example Datasets
For an organization that wants to attract users to their data catalog, it might be possible to find a handful of datasets that could serve as drop-in replacement example datasets for the standards. These could be included into tutorials and code packages and offer a way for end users to discover their data catalog. For example, what if NASA had example datasets that served as drop-in replacements for the standard example datasets: Iris, Boston Housing, Wine Scores, and Titanic?
Tooling to Help Find Similar Datasets
What if there was a data profiling tool that scanned CSVs and JSONs and identified datasets most similar to a user provided dataset? What if it was a CKAN add-on (CKAN is a common software for running open-data portals) that could be easily added to various open-data catalogs that already exist?
Let Users Pick From Multiple Similar Example Datasets
What if code packages had example dataset tooling that let people pick from not just 3–5 datasets but 5 example dataset types and 3 examples of each type. Slightly different data distributions might enable different aspects of the package to be better understood.
Tooling to Create Fake Example Datasets that Mimic Well-known Ones
What if there was a simple web-app that would help people create fake datasets with interesting stories whose data distributions fit with tutorial goals?
Maybe these are things you could create?