If you are a programmer, you would have heard about docker and may be you have used it already. In simple words, docker is an open source product that automates deployment of applications in software containers. With docker, you can easily package your application along with everything your application needs including libraries and any other dependencies etc. and finally, using that package you can be sure that your application would run on any Linux or Windows machine with native docker installed on it. Docker provides many benefits like version control, isolation, CI etc. This was a brief introduction of docker. Now, let’s talk about the real problem.
These days, many companies are leveraging the use of data science to perform analysis and understand problems in a better way to make important decisions. These data science jobs may involve large amounts of data which needs to be first prepared before performing analysis. Data preparation is a time consuming job and according to various studies, 80% average time of a data scientist is spent in only data preparation. Mostly, data is scattered, repeated and not in a form that can save a data scientist’s time. Another challenge is to maintain versions of data for tracking changes and ensuring that nothing gets lost.
To address most of these data related problems for data scientists and businesses in the same way as docker does it for software applications, we would like to introduce you to Quilt Data, our latest portfolio startup from YCombinator’s Summer 2017 batch.
Quilt Data treats your data like code and allows you to create reusable data packages by combining your files and folders. Every data package is versioned by Quilt Data and you can review complete version history of your data packages. In case, if you lose some data, you can always roll back to a previous version easily. This version control feature helps customers to always generate consistent and expected results from their data.
Moreover, Quilt Data internally stores data packages in Apache Parquet format (a columnar DBMS format) which also improves the performance of I/O operations and helps with quicker data analysis. Tools like Presto DB and Hive also benefit from Apache Parquet format and run even faster. Quilt Data also provides deeper integration with Python and developers can directly import Quilt Data packages to save time just as they use pip command to install and manage python packages.
Quilt Data offers affordable monthly pricing plans for individuals and businesses. However, if any business is interested in on-premise service, they can contact Quilt Data for more details. Quilt Data also offers Free plan for everyone and it allows to create unlimited data packages for no charges but these packages have to be public. By allowing everyone to create unlimited public data packages, Quilt Data encourages its users to build and share more data packages with each other.
On the corporate side, technology offered by Quilt Data also helps businesses integrate their data sources keeping everyone in the company on the same page and many businesses including some of the largest banks have already adopted it. You should also give it a try.