dask 并行读取csv_Dask –使用Python处理大型CSV文件的更好方法-白红宇

dask 并行读取csv_Dask –使用Python处理大型CSV文件的更好方法

阅读量：2519 次

发布时间：2019-05-11

本文共 6876 字，大约阅读时间需要 22 分钟。

dask 并行读取csv

In a recent post titled

, I shared an approach I use when I have very large CSV files (and other file types) that are too large to load into memory. While the approach I previously highlighted works well, it can be tedious to first load data into sqllite (or any other database) and then access that database to analyze data. I just found a better approach using Dask.

在最近的一篇标题为，我分享了一种方法，当我有非常大的CSV文件（和其他文件类型）太大而无法加载到内存中时，可以使用该方法。尽管我之前强调的方法效果很好，但首先将数据加载到sqllite（或任何其他数据库）中，然后访问该数据库以分析数据可能很繁琐。我只是发现使用Dask更好的方法。

While looking around the web to learn about some parallel processing capabilities, I ran across a , which describes itself as:

在网上浏览以了解一些并行处理功能时，我遇到了一个的，该将自己描述为：

…is a flexible parallel computing library for analytic computing.

…是用于分析计算的灵活并行计算库。

When I saw that, I was intrigued. There’s a lot that can be done with that statement and I’ve got plans to introduce Dask into my various tool sets for data analytics.

当我看到那件事时，我很感兴趣。该语句可以完成很多工作，而且我已经计划将Dask引入我的各种数据分析工具集中。

While reading the docs, I ran across the ‘‘ concept and immediately new I’d found a new tool for working with large CSV files. With Dask’s dataframe concept, you can do out-of-core analysis (e.g., analyze data in the CSV without loading the entire CSV file into memory). Other than out-of-core manipulation, dask’s dataframe uses the pandas API, which makes things extremely easy for those of us who use and love pandas.

在阅读文档时，我遇到了“ ”概念，立即发现一个用于处理大型CSV文件的新工具。使用Dask的数据框概念，您可以进行核心外分析（例如，无需将整个CSV文件加载到内存中即可分析CSV中的数据）。除了进行核心操作外，dask的数据帧还使用了熊猫API，这对于那些使用和喜爱熊猫的人来说非常容易。

With Dask and its dataframe construct, you set up the dataframe must like you would in pandas but rather than loading the data into pandas, this appraoch keeps the dataframe as a sort of ‘pointer’ to the data file and doesn’t load anything until you specifically tell it to do so.

使用Dask及其数据框构造，您必须像在大熊猫中一样设置数据框，而不是将数据加载到大熊猫中，此方法使数据框成为数据文件的一种“指针”，并且在加载之前不会加载任何内容您专门告诉它这样做。

One note (that I always have to share): If you are planning on working with your data set over time, its probably best to get the data into a database of some type.

需要注意的一点（我总是要分享）：如果您打算随着时间的推移使用数据集，最好将数据放入某种类型的数据库中。

使用Dask和数据框的示例 (An example using Dask and the Dataframe)

First, let’s get everything installed. The documentation claims that you just need to install dask, but I had to install ‘toolz’ and ‘cloudpickle’ to get dask’s dataframe to import. To install dask and its requirements, open a terminal and type (you need pip for this):

首先，让我们安装一切。该文档声称您只需要安装dask，但是我必须安装'toolz'和'cloudpickle'才能导入dask的数据框。要安装dask及其要求，请打开一个终端并输入（为此需要pip）：

pip install dask toolz cloudpickle

Now, let’s write some code to load csv data and and start analyzing it. For this example, I’m using the 311 Service Requests dataset from . You can download the dataset here:

现在，让我们编写一些代码来加载csv数据并开始进行分析。在此示例中，我使用了来自的311服务请求数据集。您可以在此处下载数据集：

Set up your dataframe so you can analyze the 311_Service_Requests.csv file. This file is assumed to be stored in the directory that you are working in.

设置数据框，以便您可以分析311_Service_Requests.csv文件。假定此文件存储在您正在使用的目录中。

import dask.dataframe as ddfilename = '311_Service_Requests.csv'df = dd.read_csv(filename, dtype='str')

Unlike pandas, the data isn’t read into memory…we’ve just set up the dataframe to be ready to do some compute functions on the data in the csv file using familiar functions from pandas. Note: I used “dtype=’str’” in the read_csv to get around some strange formatting issues in this particular file.

与熊猫不同，数据不会读入内存...我们只是将数据帧设置为可以使用熊猫熟悉的函数对csv文件中的数据执行一些计算功能。注意：我在read_csv中使用“ dtype ='str'”来解决此特定文件中的一些奇怪的格式设置问题。

Let’s take a look at the first few rows of the file using pandas’ head() call. When you run this, the first X rows (however many rows you are looking at with head(X)) and then displays those rows.

让我们使用pandas的head（）调用来查看文件的前几行。运行此命令时，将显示前X行（无论您用head（X）查看多少行），然后显示这些行。

df.head(2)

Note: a small subset of the columns are shown below for simplicity

注意：为简单起见，下面显示了列的一小部分

Unique Key	唯一键	Created Date	创建日期	Closed Date	截止日期	Agency	机构
25513481	25513481	05/09/2013 12:00:00 AM	2013/05/09上午12:00:00	05/14/2013 12:00:00 AM	2013/05/14上午12:00:00	HPD	HPD
25513482	25513482	05/09/2013 12:00:00 AM	2013/05/09上午12:00:00	05/13/2013 12:00:00 AM	2013/05/13上午12:00:00	HPD	HPD
25513483	25513483	05/09/2013 12:00:00 AM	2013/05/09上午12:00:00	05/22/2013 12:00:00 AM	2013年5月22日12:00:00	HPD	HPD
25513484	25513484	05/09/2013 12:00:00 AM	2013/05/09上午12:00:00	05/12/2013 12:00:00 AM	2013/05/12上午12:00:00	HPD	HPD
25513485	25513485	05/09/2013 12:00:00 AM	2013/05/09上午12:00:00	05/11/2013 12:00:00 AM	2013/05/11上午12:00:00	HPD	HPD

We see that there’s some spaces in the column names. Let’s remove those spaces to make things easier to work with.

我们看到列名称中有一些空格。让我们删除那些空间以使事情更容易使用。

df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})

The cool thing about dask is that you can do things like renaming columns without loading all the data into memory.

关于dask的很酷的事情是，您可以执行诸如重命名列之类的操作，而无需将所有数据加载到内存中。

There’s a column in this data called ‘Descriptor’ that has the problem types, and “radiator” is one of those problem types. Let’s take a look at how many service requests were because of some problem with a radiator. To do this, you can filter the dataframe using standard pandas filtering (see below) to create a new dataframe.

此数据中有一个列称为“描述符”，其中包含问题类型，而“辐射器”就是这些问题类型之一。让我们看一下由于散热器出现问题而产生的服务请求数量。为此，您可以使用标准的熊猫过滤（见下文）过滤数据框以创建一个新的数据框。

# create a new dataframe with only 'RADIATOR' service callsradiator_df=df[df.Descriptor=='RADIATOR']

Let’s see how many rows we have using the ‘count’ command

让我们看看使用'count'命令有多少行

radiator_df.Descriptor.count()

You’ll notice that when you run the above command, you don’t actually get count returned. You get a descriptor back similar like “dd.Scalar<series-…, dtype=int64>”

您会注意到，当您运行上述命令时，实际上并没有获得计数返回。您会得到类似“ dd.Scalar <series-…，dtype = int64>”的描述符

To actually compute the count, you have to call “compute” to get dask to run through the dataframe and count the number of records.

要实际计算计数，您必须调用“计算”以使dask在数据帧中运行并计算记录数。

radiator_df.compute()

When you run this command, you should get something like the following

运行此命令时，应获得类似以下内容的信息

[52077 rows x 52 columns]

The above are just some samples for using dask’s dataframe construct. Remember, we built a new dataframe using pandas’ filters without loading the entire original data set into memory. They may not seem like much, but when working with a 7Gb+ file, you can save a great deal of time and effort using dask when compared to using the .

以上只是使用dask的dataframe构造的一些示例。记住，我们使用熊猫的过滤器构建了一个新的数据框，而没有将整个原始数据集加载到内存中。它们看起来似乎并不多，但是当使用7Gb +文件时，与使用的相比，使用dask可以节省大量时间和精力。

Dask seems to have a ton of other great features that I’ll be diving into at some point in the near future, but for now, the dataframe construct has been an awesome find.

在不久的将来，Dask似乎还有很多其他很棒的功能，我将在其中进行深入探讨，但是就目前而言，dataframe构造是一个了不起的发现。