Introductory Datasets

Posted by Matteo Kimura on May 17, 2019


Currently, the world is accelerating into a data revolution, and the skills of a data scientist are becoming increasingly valuable in most if not all industries. At the foundation of all data science and machine learning projects is data. The most fundamental skill of a data scientist is to be able to gather and manipulate data effectively, and after this step, they can apply their other abilities to make their data meaningful. In order to learn and develop these skills, it is essential to have a general knowledge of where one can find data. As a result, this blog post compiles some data sources that you can use to develop your skills as a data scientist or even use them for personal projects.

Iris Flower Dataset

To start this post, we have the Iris flower dataset. This dataset contains three different types of Iris flowers (Iris Setosa, Iris Versicolour, Iris Virgnica) with 50 plants for each kind that contain five attributes (sepal length, sepal width, petal length, and petal width). The data that the Iris dataset includes is cleaned up and easy to work with. Furthermore, this dataset is an excellent introductory set for people that are beginning to learn machine learning, and it is possibly the best-known dataset in pattern recognition literature. However, this dataset lacks samples for more complicated data science projects.


The second dataset we will be covering is the MNIST dataset. This dataset contains pictures of handwritten letters and digits from data collected by the National Institute of Standards and Technology. Similar to the Iris flower dataset, the MNIST dataset is excellent for people learning pattern recognition techniques without all the preprocessing that is usually required for a data science project. Furthermore, it provides significantly more data than the Iris flower dataset, and there are probably more useful applications that come with the MNIST dataset. However, it still is an introductory set, and there are other places where you can encounter a wider variety of data for your data science projects.

UCI Machine Learning Repository

The UCI Machine learning repository is public repository sponsored by UC Irvine and currently hosts 473 datasets. Furthermore, there is a large variety of domains for the datasets since they range from topics such as car evaluation to bank marketing. As a result, the UCI Machine Learning Repository is a great place for aspiring data scientists to find datasets since anyone from beginners to more advanced data scientists can discover datasets that fit their skills and the vast domain of datasets allows for a lot of different possible projects that one could work on.

Twitter API

Another dataset we will be discussing is Twitter’s API. Twitter’s API provides access to several sources of data such as tweets and account activity; however, it is sometimes limited because some of the more comprehensive data plans are premium and require money. Regardless, even Twitter’s free programs provide extensive and varied datasets that Twitter has made easy to navigate. As a result, these datasets that you can get from Twitter provide large amounts of data with a broad spectrum of possible applications and projects with this data. However, this dataset requires significantly more skill to use it compared to the first three examples.


The final and perhaps most crucial resource for datasets that we will be discussing is Kaggle. Kaggle is a website where people can upload their datasets and assign certain tasks that people can complete, and after a certain amount of time all the submissions are ranked, and the winner can sometimes obtain a lot of prize money. As a result, this website is a haven for many data scientists because it not only provides a great environment to learn and practice the applications of data science skills on real data sets but it sometimes comes with monetary compensation if you win the competition.


In conclusion, data is critical for any data science project, and there are many great places to find it. The resources in this blog are a great start and can be used to practice and increase your skills as a data scientist or for your projects. However, this is only a start, and this blog mentioned only a few of the spots in an ocean of public data ranging from public government data to public, corporate data. Just remember, if you’re not satisfied with the data found from these sources, there is a lot more data beyond the scope of this blog post.