Cryto Eth

Which of the following filing methods is considered the most efficient group of answer choices?

pandas .....: .....: .....: """ .....: In [324]: df = pd.read_html[ .....: html_table, .....: extract_links="all" .....: ][0] .....: In [325]: df Out[325]: [GitHub, None] 0 [pandas, //github.com/pandas-dev/pandas] In [326]: df[["GitHub", None]] Out[326]: 0 [pandas, //github.com/pandas-dev/pandas] Name: [GitHub, None], dtype: object In [327]: df[["GitHub", None]].str[1] Out[327]: 0 //github.com/pandas-dev/pandas Name: [GitHub, None], dtype: object

New in version 1.5.0.

Writing to HTML files#

DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an HTML table. The function arguments are as in the method to_string described above.

Note

Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See to_html[] for the full set of options.

Note

In an HTML-rendering supported environment like a Jupyter Notebook, display[HTML[...]]` will render the raw HTML into the environment.

In [328]: from IPython.display import display, HTML In [329]: df = pd.DataFrame[np.random.randn[2, 2]] In [330]: df Out[330]: 0 1 0 0.070319 1.773907 1 0.253908 0.414581 In [331]: html = df.to_html[] In [332]: print[html] # raw html

0 1 0 1

0.070319	1.773907
0.253908	0.414581

In [333]: display[HTML[html]]

The columns argument will limit the columns shown:

In [334]: html = df.to_html[columns=[0]] In [335]: print[html]

0 0 1

0.070319
0.253908

In [336]: display[HTML[html]]

float_format takes a Python callable to control the precision of floating point values:

In [337]: html = df.to_html[float_format="{0:.10f}".format] In [338]: print[html]

0 1 0 1

0.0703192665	1.7739074228
0.2539083433	0.4145805920

In [339]: display[HTML[html]]

bold_rows will make the row labels bold by default, but you can turn that off:

In [340]: html = df.to_html[bold_rows=False] In [341]: print[html]

0 1

0	0.070319	1.773907
1	0.253908	0.414581

In [342]: display[HTML[html]]

The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these classes are appended to the existing 'dataframe' class.

In [343]: print[df.to_html[classes=["awesome_table_class", "even_more_awesome_class"]]]

0 1 0 1

0.070319	1.773907
0.253908	0.414581

The render_links argument provides the ability to add hyperlinks to cells that contain URLs.

In [344]: url_df = pd.DataFrame[ .....: { .....: "name": ["Python", "pandas"], .....: "url": ["//www.python.org/", "//pandas.pydata.org"], .....: } .....: ] .....: In [345]: html = url_df.to_html[render_links=True] In [346]: print[html]

name url 0 1

Python	//www.python.org/
pandas	//pandas.pydata.org

In [347]: display[HTML[html]]

Finally, the escape argument allows you to control whether the “” and “&” characters escaped in the resulting HTML [by default it is True]. So to get the HTML without escaped characters pass escape=False

In [348]: df = pd.DataFrame[{"a": list["&"], "b": np.random.randn[3]}]

Escaped:

In [349]: html = df.to_html[] In [350]: print[html]

a b 0 1 2

&	0.842321
<	0.211337
>	-1.055427

In [351]: display[HTML[html]]

Not escaped:

In [352]: html = df.to_html[escape=False] In [353]: print[html]

a b 0 1

&	0.842321
	-1.055427

In [354]: display[HTML[html]]

Note

Some browsers may not show a difference in the rendering of the previous two HTML tables.

HTML Table Parsing Gotchas#

There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level pandas io function read_html.

Issues with lxml

Benefits
- lxml is very fast.
- lxml requires Cython to install correctly.
Drawbacks
- lxml does not make any guarantees about the results of its parse unless it is given strictly valid markup.
- In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this backend will use html5lib if lxml fails to parse
- It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that you will still get a valid result [provided everything else is valid] even if lxml fails.

Issues with BeautifulSoup4 using lxml as a backend

The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser backend.

Issues with BeautifulSoup4 using html5lib as a backend

Benefits
- html5lib is far more lenient than lxml and consequently deals with real-life markup in a much saner way rather than just, e.g., dropping an element without notifying you.
- html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is “correct”, since the process of fixing markup does not have a single definition.
- html5lib is pure Python and requires no additional build steps beyond its own installation.
Drawbacks
- The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact that many tables on the web are not big enough for the parsing algorithm runtime to matter. It is more likely that the bottleneck will be in the process of reading the raw text from the URL over the web, i.e., IO [input-output]. For very large tables, this might not be true.

LaTeX#

New in version 1.3.0.

Currently there are no methods to read from LaTeX, only output methods.

Writing to LaTeX files#

Note

DataFrame and Styler objects currently have a to_latex method. We recommend using the Styler.to_latex[] method over DataFrame.to_latex[] due to the former’s greater flexibility with conditional styling, and the latter’s possible future deprecation.

Review the documentation for Styler.to_latex, which gives examples of conditional styling and explains the operation of its keyword arguments.

For simple application the following pattern is sufficient.

In [355]: df = pd.DataFrame[[[1, 2], [3, 4]], index=["a", "b"], columns=["c", "d"]] In [356]: print[df.style.to_latex[]] \begin{tabular}{lrr} & c & d \\ a & 1 & 2 \\ b & 3 & 4 \\ \end{tabular}

To format values before output, chain the Styler.format method.

In [357]: print[df.style.format["€ {}"].to_latex[]] \begin{tabular}{lrr} & c & d \\ a & € 1 & € 2 \\ b & € 3 & € 4 \\ \end{tabular}

XML#

Reading XML#

New in version 1.3.0.

The top-level read_xml[] function can accept an XML string/file/URL and will parse nodes and attributes into a pandas DataFrame.

Note

Since there is no standard XML structure where design types can vary in many ways, read_xml works best with flatter, shallow versions. If an XML document is deeply nested, use the stylesheet feature to transform XML into a flatter version.

Let’s look at a few examples.

Read an XML string:

In [358]: xml = """ .....: .....: .....: Everyday Italian .....: Giada De Laurentiis .....: 2005 .....: 30.00 .....: .....: .....: Harry Potter .....: J K. Rowling .....: 2005 .....: 29.99 .....: .....: .....: Learning XML .....: Erik T. Ray .....: 2003 .....: 39.95 .....: .....: """ .....: In [359]: df = pd.read_xml[xml] In [360]: df Out[360]: category title author year price 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99 2 web Learning XML Erik T. Ray 2003 39.95

Read a URL with no options:

In [361]: df = pd.read_xml["//www.w3schools.com/xml/books.xml"] In [362]: df Out[362]: category title author year price cover 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 None 1 children Harry Potter J K. Rowling 2005 29.99 None 2 web XQuery Kick Start Vaidyanathan Nagarajan 2003 49.99 None 3 web Learning XML Erik T. Ray 2003 39.95 paperback

Read in the content of the “books.xml” file and pass it to read_xml as a string:

In [363]: file_path = "books.xml" In [364]: with open[file_path, "w"] as f: .....: f.write[xml] .....: In [365]: with open[file_path, "r"] as f: .....: df = pd.read_xml[f.read[]] .....: In [366]: df Out[366]: category title author year price 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99 2 web Learning XML Erik T. Ray 2003 39.95

Read in the content of the “books.xml” as instance of StringIO or BytesIO and pass it to read_xml:

In [367]: with open[file_path, "r"] as f: .....: sio = StringIO[f.read[]] .....: In [368]: df = pd.read_xml[sio] In [369]: df Out[369]: category title author year price 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99 2 web Learning XML Erik T. Ray 2003 39.95

In [370]: with open[file_path, "rb"] as f: .....: bio = BytesIO[f.read[]] .....: In [371]: df = pd.read_xml[bio] In [372]: df Out[372]: category title author year price 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99 2 web Learning XML Erik T. Ray 2003 39.95

Even read XML from AWS S3 buckets such as NIH NCBI PMC Article Datasets providing Biomedical and Life Science Jorurnals:

In [373]: df = pd.read_xml[ .....: "s3://pmc-oa-opendata/oa_comm/xml/all/PMC1236943.xml", .....: xpath=".//journal-meta", .....: ] .....: In [374]: df Out[374]: journal-id journal-title issn publisher 0 Cardiovasc Ultrasound Cardiovascular Ultrasound 1476-7120 NaN

With lxml as default parser, you access the full-featured XML library that extends Python’s ElementTree API. One powerful tool is ability to query nodes selectively or conditionally with more expressive XPath:

In [375]: df = pd.read_xml[file_path, xpath="//book[year=2005]"] In [376]: df Out[376]: category title author year price 0 cooking Everyday Italian Giada De Laurentiis 2005 30.00 1 children Harry Potter J K. Rowling 2005 29.99

Specify only elements or only attributes to parse:

In [377]: df = pd.read_xml[file_path, elems_only=True] In [378]: df Out[378]: title author year price 0 Everyday Italian Giada De Laurentiis 2005 30.00 1 Harry Potter J K. Rowling 2005 29.99 2 Learning XML Erik T. Ray 2003 39.95

In [379]: df = pd.read_xml[file_path, attrs_only=True] In [380]: df Out[380]: category 0 cooking 1 children 2 web

XML documents can have namespaces with prefixes and default namespaces without prefixes both of which are denoted with a special attribute xmlns. In order to parse by node under a namespace context, xpath must reference a prefix.

For example, below XML contains a namespace with prefix, doc, and URI at //example.com. In order to parse doc:row nodes, namespaces must be used.

In [381]: xml = """ .....: .....: .....: square .....: 360 .....: 4.0 .....: .....: .....: circle .....: 360 .....: .....: .....: .....: triangle .....: 180 .....: 3.0 .....: .....: """ .....: In [382]: df = pd.read_xml[xml, .....: xpath="//doc:row", .....: namespaces={"doc": "//example.com"}] .....: In [383]: df Out[383]: shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0

Similarly, an XML document can have a default namespace without prefix. Failing to assign a temporary prefix will return no nodes and raise a ValueError. But assigning any temporary name to correct URI allows parsing by nodes.

In [384]: xml = """ .....: .....: .....: square .....: 360 .....: 4.0 .....: .....: .....: circle .....: 360 .....: .....: .....: .....: triangle .....: 180 .....: 3.0 .....: .....: """ .....: In [385]: df = pd.read_xml[xml, .....: xpath="//pandas:row", .....: namespaces={"pandas": "//example.com"}] .....: In [386]: df Out[386]: shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0

However, if XPath does not reference node names such as default, /*, then namespaces is not required.

With lxml as parser, you can flatten nested XML documents with an XSLT script which also can be string/file/URL types. As background, XSLT is a special-purpose language written in a special XML file that can transform original XML documents into other XML, HTML, even text [CSV, JSON, etc.] using an XSLT processor.

For example, consider this somewhat nested structure of Chicago “L” Rides where station and rides elements encapsulate data in their own sections. With below XSLT, lxml can transform original nested document into a flatter output [as shown below for demonstration] for easier parse into DataFrame:

In [387]: xml = """ .....: .....: .....: .....: 2020-09-01T00:00:00 .....: .....: 864.2 .....: 534 .....: 417.2 .....: .....: .....: .....: .....: 2020-09-01T00:00:00 .....: .....: 2707.4 .....: 1909.8 .....: 1438.6 .....: .....: .....: .....: .....: 2020-09-01T00:00:00 .....: .....: 2949.6 .....: 1657 .....: 1453.8 .....: .....: .....: """ .....: In [388]: xsl = """ .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: """ .....: In [389]: output = """ .....: .....: .....: 40850 .....: Library .....: 2020-09-01T00:00:00 .....: 864.2 .....: 534 .....: 417.2 .....: .....: .....: 41700 .....: Washington/Wabash .....: 2020-09-01T00:00:00 .....: 2707.4 .....: 1909.8 .....: 1438.6 .....: .....: .....: 40380 .....: Clark/Lake .....: 2020-09-01T00:00:00 .....: 2949.6 .....: 1657 .....: 1453.8 .....: .....: """ .....: In [390]: df = pd.read_xml[xml, stylesheet=xsl] In [391]: df Out[391]: station_id station_name ... avg_saturday_rides avg_sunday_holiday_rides 0 40850 Library ... 534.0 417.2 1 41700 Washington/Wabash ... 1909.8 1438.6 2 40380 Clark/Lake ... 1657.0 1453.8 [3 rows x 6 columns]

For very large XML files that can range in hundreds of megabytes to gigabytes, pandas.read_xml[] supports parsing such sizeable files using lxml’s iterparse and etree’s iterparse which are memory-efficient methods to iterate through an XML tree and extract specific elements and attributes. without holding entire tree in memory.

To use this feature, you must pass a physical XML file path into read_xml and use the iterparse argument. Files should not be compressed or point to online sources but stored on local disk. Also, iterparse should be a dictionary where the key is the repeating nodes in document [which become the rows] and the value is a list of any element or attribute that is a descendant [i.e., child, grandchild] of repeating node. Since XPath is not used in this method, descendants do not need to share same relationship with one another. Below shows example of reading in Wikipedia’s very large [12 GB+] latest article data dump.

In [1]: df = pd.read_xml[ ... "/path/to/downloaded/enwikisource-latest-pages-articles.xml", ... iterparse = {"page": ["title", "ns", "id"]} ... ] ... df Out[2]: title ns id 0 Gettysburg Address 0 21450 1 Main Page 0 42950 2 Declaration by United Nations 0 8435 3 Constitution of the United States of America 0 8435 4 Declaration of Independence [Israel] 0 17858 ... ... ... ... 3578760 Page:Black cat 1897 07 v2 n10.pdf/17 104 219649 3578761 Page:Black cat 1897 07 v2 n10.pdf/43 104 219649 3578762 Page:Black cat 1897 07 v2 n10.pdf/44 104 219649 3578763 The History of Tom Jones, a Foundling/Book IX 0 12084291 3578764 Page:Shakespeare of Stratford [1926] Yale.djvu/91 104 21450 [3578765 rows x 3 columns]

Writing XML#

New in version 1.3.0.

DataFrame objects have an instance method to_xml which renders the contents of the DataFrame as an XML document.

Note

This method does not support special properties of XML including DTD, CData, XSD schemas, processing instructions, comments, and others. Only namespaces at the root level is supported. However, stylesheet allows design changes after initial output.

Let’s look at a few examples.

Write an XML without options:

In [392]: geom_df = pd.DataFrame[ .....: { .....: "shape": ["square", "circle", "triangle"], .....: "degrees": [360, 360, 180], .....: "sides": [4, np.nan, 3], .....: } .....: ] .....: In [393]: print[geom_df.to_xml[]] 0 square 360 4.0 1 circle 360 2 triangle 180 3.0

Write an XML with new root and row name:

In [394]: print[geom_df.to_xml[root_name="geometry", row_name="objects"]] 0 square 360 4.0 1 circle 360 2 triangle 180 3.0

Write an attribute-centric XML:

In [395]: print[geom_df.to_xml[attr_cols=geom_df.columns.tolist[]]]

Write a mix of elements and attributes:

In [396]: print[ .....: geom_df.to_xml[ .....: index=False, .....: attr_cols=['shape'], .....: elem_cols=['degrees', 'sides']] .....: ] .....: 360 4.0 360 180 3.0

Any DataFrames with hierarchical columns will be flattened for XML element names with levels delimited by underscores:

In [397]: ext_geom_df = pd.DataFrame[ .....: { .....: "type": ["polygon", "other", "polygon"], .....: "shape": ["square", "circle", "triangle"], .....: "degrees": [360, 360, 180], .....: "sides": [4, np.nan, 3], .....: } .....: ] .....: In [398]: pvt_df = ext_geom_df.pivot_table[index='shape', .....: columns='type', .....: values=['degrees', 'sides'], .....: aggfunc='sum'] .....: In [399]: pvt_df Out[399]: degrees sides type other polygon other polygon shape circle 360.0 NaN 0.0 NaN square NaN 360.0 NaN 4.0 triangle NaN 180.0 NaN 3.0 In [400]: print[pvt_df.to_xml[]] circle 360.0 0.0 square 360.0 4.0 triangle 180.0 3.0

Write an XML with default namespace:

In [401]: print[geom_df.to_xml[namespaces={"": "//example.com"}]] 0 square 360 4.0 1 circle 360 2 triangle 180 3.0

Write an XML with namespace prefix:

In [402]: print[ .....: geom_df.to_xml[namespaces={"doc": "//example.com"}, .....: prefix="doc"] .....: ] .....: 0 square 360 4.0 1 circle 360 2 triangle 180 3.0

Write an XML without declaration or pretty print:

In [403]: print[ .....: geom_df.to_xml[xml_declaration=False, .....: pretty_print=False] .....: ] .....: 0square3604.01circle3602triangle1803.0

Write an XML and transform with stylesheet:

In [404]: xsl = """ .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: .....: polygon .....: .....: .....: .....: .....: .....: .....: .....: """ .....: In [405]: print[geom_df.to_xml[stylesheet=xsl]] square 360 4.0 circle 360 triangle 180 3.0

XML Final Notes#

All XML documents adhere to W3C specifications. Both etree and lxml parsers will fail to parse any markup document that is not well-formed or follows XML syntax rules. Do be aware HTML is not an XML document unless it follows XHTML specs. However, other popular markup types including KML, XAML, RSS, MusicML, MathML are compliant XML schemas.
For above reason, if your application builds XML prior to pandas operations, use appropriate DOM libraries like etree and lxml to build the necessary document and not by string concatenation or regex adjustments. Always remember XML is a special text file with markup rules.
With very large XML files [several hundred MBs to GBs], XPath and XSLT can become memory-intensive operations. Be sure to have enough available RAM for reading and writing to large XML files [roughly about 5 times the size of text].
Because XSLT is a programming language, use it with caution since such scripts can pose a security risk in your environment and can run large or infinite recursive operations. Always test scripts on small fragments before full run.
The etree parser supports all functionality of both read_xml and to_xml except for complex XPath and any XSLT. Though limited in features, etree is still a reliable and capable parser and tree builder. Its performance may trail lxml to a certain degree for larger files but relatively unnoticeable on small to medium size files.

Excel files#

The read_excel[] method can read Excel 2007+ [.xlsx] files using the openpyxl Python module. Excel 2003 [.xls] files can be read using xlrd. Binary Excel [.xlsb] files can be read using pyxlsb. The to_excel[] instance method is used for saving a DataFrame to Excel. Generally the semantics are similar to working with csv data. See the cookbook for some advanced strategies.

Warning

The xlwt package for writing old-style .xls excel files is no longer maintained. The xlrd package is now only for reading old-style .xls files.

Before pandas 1.3.0, the default argument engine=None to read_excel[] would result in using the xlrd engine in many cases, including new Excel 2007+ [.xlsx] files. pandas will now default to using the openpyxl engine.

It is strongly encouraged to install openpyxl to read Excel 2007+ [.xlsx] files. Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. This is no longer supported, switch to using openpyxl instead.

Attempting to use the xlwt engine will raise a FutureWarning unless the option io.excel.xls.writer is set to "xlwt". While this option is now deprecated and will also raise a FutureWarning, it can be globally set and the warning suppressed. Users are recommended to write .xlsx files using the openpyxl engine instead.

Reading Excel files#

In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which sheet to parse.

# Returns a DataFrame pd.read_excel["path_to_file.xls", sheet_name="Sheet1"]

ExcelFile class#

To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read into memory only once.

xlsx = pd.ExcelFile["path_to_file.xls"] df = pd.read_excel[xlsx, "Sheet1"]

The ExcelFile class can also be used as a context manager.

with pd.ExcelFile["path_to_file.xls"] as xls: df1 = pd.read_excel[xls, "Sheet1"] df2 = pd.read_excel[xls, "Sheet2"]

The sheet_names property will generate a list of the sheet names in the file.

The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:

data = {} # For when Sheet1's format differs from Sheet2 with pd.ExcelFile["path_to_file.xls"] as xls: data["Sheet1"] = pd.read_excel[xls, "Sheet1", index_col=None, na_values=["NA"]] data["Sheet2"] = pd.read_excel[xls, "Sheet2", index_col=1]

Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed to read_excel with no loss in performance.

# using the ExcelFile class data = {} with pd.ExcelFile["path_to_file.xls"] as xls: data["Sheet1"] = pd.read_excel[xls, "Sheet1", index_col=None, na_values=["NA"]] data["Sheet2"] = pd.read_excel[xls, "Sheet2", index_col=None, na_values=["NA"]] # equivalent using the read_excel function data = pd.read_excel[ "path_to_file.xls", ["Sheet1", "Sheet2"], index_col=None, na_values=["NA"] ]

ExcelFile can also be called with a xlrd.book.Book object as a parameter. This allows the user to control how the excel file is read. For example, sheets can be loaded on demand by calling xlrd.open_workbook[] with on_demand=True.

import xlrd xlrd_book = xlrd.open_workbook["path_to_file.xls", on_demand=True] with pd.ExcelFile[xlrd_book] as xls: df1 = pd.read_excel[xls, "Sheet1"] df2 = pd.read_excel[xls, "Sheet2"]

Specifying sheets#

Note

The second argument is sheet_name, not to be confused with ExcelFile.sheet_names.

Note

An ExcelFile’s attribute sheet_names provides access to a list of sheets.

The arguments sheet_name allows specifying the sheet or sheets to read.
The default value for sheet_name is 0, indicating to read the first sheet
Pass a string to refer to the name of a particular sheet in the workbook.
Pass an integer to refer to the index of a sheet. Indices follow Python convention, beginning at 0.
Pass a list of either strings or integers, to return a dictionary of specified sheets.
Pass a None to return a dictionary of all available sheets.

# Returns a DataFrame pd.read_excel["path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"]]

Using the sheet index:

# Returns a DataFrame pd.read_excel["path_to_file.xls", 0, index_col=None, na_values=["NA"]]

Using all default values:

# Returns a DataFrame pd.read_excel["path_to_file.xls"]

Using None to get all sheets:

# Returns a dictionary of DataFrames pd.read_excel["path_to_file.xls", sheet_name=None]

Using a list to get multiple sheets:

# Returns the 1st and 4th sheet, as a dictionary of DataFrames. pd.read_excel["path_to_file.xls", sheet_name=["Sheet1", 3]]

read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of sheet positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an integer or string, respectively.

Reading a MultiIndex#

read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex column by passing a list of rows to header. If either the index or columns have serialized level names those will be read in as well by specifying the rows/columns that make up the levels.

For example, to read in a MultiIndex index without names:

In [406]: df = pd.DataFrame[ .....: {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]}, .....: index=pd.MultiIndex.from_product[[["a", "b"], ["c", "d"]]], .....: ] .....: In [407]: df.to_excel["path_to_file.xlsx"] In [408]: df = pd.read_excel["path_to_file.xlsx", index_col=[0, 1]] In [409]: df Out[409]: a b a c 1 5 d 2 6 b c 3 7 d 4 8

If the index has level names, they will parsed as well, using the same parameters.

In [410]: df.index = df.index.set_names[["lvl1", "lvl2"]] In [411]: df.to_excel["path_to_file.xlsx"] In [412]: df = pd.read_excel["path_to_file.xlsx", index_col=[0, 1]] In [413]: df Out[413]: a b lvl1 lvl2 a c 1 5 d 2 6 b c 3 7 d 4 8

If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col and header:

In [414]: df.columns = pd.MultiIndex.from_product[[["a"], ["b", "d"]], names=["c1", "c2"]] In [415]: df.to_excel["path_to_file.xlsx"] In [416]: df = pd.read_excel["path_to_file.xlsx", index_col=[0, 1], header=[0, 1]] In [417]: df Out[417]: c1 a c2 b d lvl1 lvl2 a c 1 5 d 2 6 b c 3 7 d 4 8

Missing values in columns specified in index_col will be forward filled to allow roundtripping with to_excel for merged_cells=True. To avoid forward filling the missing values use set_index after reading the data instead of index_col.

Parsing specific columns#

It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read in those columns. read_excel takes a usecols keyword to allow you to specify a subset of columns to parse.

Changed in version 1.0.0.

Passing in an integer for usecols will no longer work. Please pass in a list of ints from 0 to usecols inclusive instead.

You can specify a comma-delimited set of Excel columns and ranges as a string:

pd.read_excel["path_to_file.xls", "Sheet1", usecols="A,C:E"]

If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.

pd.read_excel["path_to_file.xls", "Sheet1", usecols=[0, 2, 3]]

Element order is ignored, so usecols=[0, 1] is the same as [1, 0].

If usecols is a list of strings, it is assumed that each string corresponds to a column name provided either by the user in names or inferred from the document header row[s]. Those strings define which columns will be parsed:

pd.read_excel["path_to_file.xls", "Sheet1", usecols=["foo", "bar"]]

Element order is ignored, so usecols=['baz', 'joe'] is the same as ['joe', 'baz'].

If usecols is callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True.

pd.read_excel["path_to_file.xls", "Sheet1", usecols=lambda x: x.isalpha[]]

Parsing dates#

Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel file. But if you have a column of strings that look like dates [but are not actually formatted as dates in excel], you can use the parse_dates keyword to parse those strings to datetimes:

pd.read_excel["path_to_file.xls", "Sheet1", parse_dates=["date_strings"]]

Cell converters#

It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a column to boolean:

pd.read_excel["path_to_file.xls", "Sheet1", converters={"MyBools": bool}]

This options handles missing values and treats exceptions in the converters as missing data. Transformations are applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float. You can manually mask missing data to recover integer dtype:

def cfun[x]: return int[x] if x else -1 pd.read_excel["path_to_file.xls", "Sheet1", converters={"MyInts": cfun}]

Dtype specifications#

As an alternative to converters, the type for an entire column can be specified using the dtype keyword, which takes a dictionary mapping column names to types. To interpret data with no type inference, use the type str or object.

pd.read_excel["path_to_file.xls", dtype={"MyInts": "int64", "MyText": str}]

Writing Excel files#

Writing Excel files to disk#

To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The arguments are largely the same as to_csv described above, the first argument being the name of the excel file, and the optional second argument the name of the sheet to which the DataFrame should be written. For example:

df.to_excel["path_to_file.xlsx", sheet_name="Sheet1"]

Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written using xlsxwriter [if available] or openpyxl.

The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be placed in the second row instead of the first. You can place it in the first row by setting the merge_cells option in to_excel[] to False:

df.to_excel["path_to_file.xlsx", index_label="label", merge_cells=False]

In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.

with pd.ExcelWriter["path_to_file.xlsx"] as writer: df1.to_excel[writer, sheet_name="Sheet1"] df2.to_excel[writer, sheet_name="Sheet2"]

Writing Excel files to memory#

pandas supports writing Excel files to buffer-like objects such as StringIO or BytesIO using ExcelWriter.

from io import BytesIO bio = BytesIO[] # By setting the 'engine' in the ExcelWriter constructor. writer = pd.ExcelWriter[bio, engine="xlsxwriter"] df.to_excel[writer, sheet_name="Sheet1"] # Save the workbook writer.save[] # Seek to the beginning and read to copy the workbook to a variable in memory bio.seek[0] workbook = bio.read[]

Note

engine is optional but recommended. Setting the engine determines the version of workbook produced. Setting engine='xlrd' will produce an Excel 2003-format workbook [xls]. Using either 'openpyxl' or 'xlsxwriter' will produce an Excel 2007-format workbook [xlsx]. If omitted, an Excel 2007-formatted workbook is produced.

Excel writer engines#

Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the xlwt engine will be removed from a future version of pandas. This is the only engine in pandas that supports writing to .xls files.

pandas chooses an Excel writer via two methods:

the engine keyword argument
the filename extension [via the default specified in config options]

By default, pandas uses the XlsxWriter for .xlsx, openpyxl for .xlsm, and xlwt for .xls files. If you have multiple engines installed, you can set the default engine through setting the config options io.excel.xlsx.writer and io.excel.xls.writer. pandas will fall back on openpyxl for .xlsx files if Xlsxwriter is not available.

To specify which writer you want to use, you can pass an engine keyword argument to to_excel and to ExcelWriter. The built-in engines are:

openpyxl: version 2.4 or higher is required
xlsxwriter
xlwt

# By setting the 'engine' in the DataFrame 'to_excel[]' methods. df.to_excel["path_to_file.xlsx", sheet_name="Sheet1", engine="xlsxwriter"] # By setting the 'engine' in the ExcelWriter constructor. writer = pd.ExcelWriter["path_to_file.xlsx", engine="xlsxwriter"] # Or via pandas configuration. from pandas import options # noqa: E402 options.io.excel.xlsx.writer = "xlsxwriter" df.to_excel["path_to_file.xlsx", sheet_name="Sheet1"]

Style and formatting#

The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the DataFrame’s to_excel method.

float_format : Format string for floating point numbers [default None].
freeze_panes : A tuple of two integers representing the bottommost row and rightmost column to freeze. Each of these parameters is one-based, so [1, 1] will freeze the first row and first column [default None].

Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created with the to_excel method. Excellent examples can be found in the Xlsxwriter documentation here: //xlsxwriter.readthedocs.io/working_with_pandas.html

OpenDocument Spreadsheets#

New in version 0.25.

The read_excel[] method can also read OpenDocument spreadsheets using the odfpy module. The semantics and features for reading OpenDocument spreadsheets match what can be done for Excel files using engine='odf'.

# Returns a DataFrame pd.read_excel["path_to_file.ods", engine="odf"]

Note

Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented.

Binary Excel [.xlsb] files#

New in version 1.0.0.

The read_excel[] method can also read binary Excel files using the pyxlsb module. The semantics and features for reading binary Excel files mostly match what can be done for Excel files using engine='pyxlsb'. pyxlsb does not recognize datetime types in files and will return floats instead.

# Returns a DataFrame pd.read_excel["path_to_file.xlsb", engine="pyxlsb"]

Note

Currently pandas only supports reading binary Excel files. Writing is not implemented.

Clipboard#

A handy way to grab data is to use the read_clipboard[] method, which takes the contents of the clipboard buffer and passes them to the read_csv method. For instance, you can copy the following text to the clipboard [CTRL-C on many operating systems]:

A B C x 1 4 p y 2 5 q z 3 6 r

And then import the data directly to a DataFrame by calling:

>>> clipdf = pd.read_clipboard[] >>> clipdf A B C x 1 4 p y 2 5 q z 3 6 r

The to_clipboard method can be used to write the contents of a DataFrame to the clipboard. Following which you can paste the clipboard contents into other applications [CTRL-V on many operating systems]. Here we illustrate writing a DataFrame into clipboard and reading it back.

>>> df = pd.DataFrame[ ... {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"] ... ] >>> df A B C x 1 4 p y 2 5 q z 3 6 r >>> df.to_clipboard[] >>> pd.read_clipboard[] A B C x 1 4 p y 2 5 q z 3 6 r

We can see that we got the same content back, which we had earlier written to the clipboard.

Note

You may need to install xclip or xsel [with PyQt5, PyQt4 or qtpy] on Linux to use these methods.

Pickling#

All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data structures to disk using the pickle format.

In [418]: df Out[418]: c1 a c2 b d lvl1 lvl2 a c 1 5 d 2 6 b c 3 7 d 4 8 In [419]: df.to_pickle["foo.pkl"]

The read_pickle function in the pandas namespace can be used to load any pickled pandas object [or any other pickled object] from file:

In [420]: pd.read_pickle["foo.pkl"] Out[420]: c1 a c2 b d lvl1 lvl2 a c 1 5 d 2 6 b c 3 7 d 4 8

Warning

read_pickle[] is only guaranteed backwards compatible back to pandas version 0.20.3

Compressed pickle files#

read_pickle[], DataFrame.to_pickle[] and Series.to_pickle[] can read and write compressed pickle files. The compression types of gzip, bz2, xz, zstd are supported for reading and writing. The zip file format only supports reading and must contain only one data file to be read.

The compression type can be an explicit parameter or be inferred from the file extension. If ‘infer’, then use gzip, bz2, zip, xz, zstd if filename ends in '.gz', '.bz2', '.zip', '.xz', or '.zst', respectively.

The compression parameter can also be a dict in order to pass options to the compression protocol. It must have a 'method' key set to the name of the compression protocol, which must be one of {'zip', 'gzip', 'bz2', 'xz', 'zstd'}. All other key-value pairs are passed to the underlying compression library.

In [421]: df = pd.DataFrame[ .....: { .....: "A": np.random.randn[1000], .....: "B": "foo", .....: "C": pd.date_range["20130101", periods=1000, freq="s"], .....: } .....: ] .....: In [422]: df Out[422]: A B C 0 -0.828876 foo 2013-01-01 00:00:00 1 -0.110383 foo 2013-01-01 00:00:01 2 2.357598 foo 2013-01-01 00:00:02 3 -1.620073 foo 2013-01-01 00:00:03 4 0.440903 foo 2013-01-01 00:00:04 .. ... ... ... 995 -1.177365 foo 2013-01-01 00:16:35 996 1.236988 foo 2013-01-01 00:16:36 997 0.743946 foo 2013-01-01 00:16:37 998 -0.533097 foo 2013-01-01 00:16:38 999 -0.140850 foo 2013-01-01 00:16:39 [1000 rows x 3 columns]

Using an explicit compression type:

In [423]: df.to_pickle["data.pkl.compress", compression="gzip"] In [424]: rt = pd.read_pickle["data.pkl.compress", compression="gzip"] In [425]: rt Out[425]: A B C 0 -0.828876 foo 2013-01-01 00:00:00 1 -0.110383 foo 2013-01-01 00:00:01 2 2.357598 foo 2013-01-01 00:00:02 3 -1.620073 foo 2013-01-01 00:00:03 4 0.440903 foo 2013-01-01 00:00:04 .. ... ... ... 995 -1.177365 foo 2013-01-01 00:16:35 996 1.236988 foo 2013-01-01 00:16:36 997 0.743946 foo 2013-01-01 00:16:37 998 -0.533097 foo 2013-01-01 00:16:38 999 -0.140850 foo 2013-01-01 00:16:39 [1000 rows x 3 columns]

Inferring compression type from the extension:

In [426]: df.to_pickle["data.pkl.xz", compression="infer"] In [427]: rt = pd.read_pickle["data.pkl.xz", compression="infer"] In [428]: rt Out[428]: A B C 0 -0.828876 foo 2013-01-01 00:00:00 1 -0.110383 foo 2013-01-01 00:00:01 2 2.357598 foo 2013-01-01 00:00:02 3 -1.620073 foo 2013-01-01 00:00:03 4 0.440903 foo 2013-01-01 00:00:04 .. ... ... ... 995 -1.177365 foo 2013-01-01 00:16:35 996 1.236988 foo 2013-01-01 00:16:36 997 0.743946 foo 2013-01-01 00:16:37 998 -0.533097 foo 2013-01-01 00:16:38 999 -0.140850 foo 2013-01-01 00:16:39 [1000 rows x 3 columns]

The default is to ‘infer’:

In [429]: df.to_pickle["data.pkl.gz"] In [430]: rt = pd.read_pickle["data.pkl.gz"] In [431]: rt Out[431]: A B C 0 -0.828876 foo 2013-01-01 00:00:00 1 -0.110383 foo 2013-01-01 00:00:01 2 2.357598 foo 2013-01-01 00:00:02 3 -1.620073 foo 2013-01-01 00:00:03 4 0.440903 foo 2013-01-01 00:00:04 .. ... ... ... 995 -1.177365 foo 2013-01-01 00:16:35 996 1.236988 foo 2013-01-01 00:16:36 997 0.743946 foo 2013-01-01 00:16:37 998 -0.533097 foo 2013-01-01 00:16:38 999 -0.140850 foo 2013-01-01 00:16:39 [1000 rows x 3 columns] In [432]: df["A"].to_pickle["s1.pkl.bz2"] In [433]: rt = pd.read_pickle["s1.pkl.bz2"] In [434]: rt Out[434]: 0 -0.828876 1 -0.110383 2 2.357598 3 -1.620073 4 0.440903 ... 995 -1.177365 996 1.236988 997 0.743946 998 -0.533097 999 -0.140850 Name: A, Length: 1000, dtype: float64

Passing options to the compression protocol in order to speed up compression:

In [435]: df.to_pickle["data.pkl.gz", compression={"method": "gzip", "compresslevel": 1}]

msgpack#

pandas support for msgpack has been removed in version 1.0.0. It is recommended to use pickle instead.

Alternatively, you can also the Arrow IPC serialization format for on-the-wire transmission of pandas objects. For documentation on pyarrow, see here.

HDF5 [PyTables]#

HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the cookbook for some advanced strategies

Warning

pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle. Loading pickled data received from untrusted sources can be unsafe.

See: //docs.python.org/3/library/pickle.html for more.

In [436]: store = pd.HDFStore["store.h5"] In [437]: print[store] File path: store.h5

Objects can be written to the file just like adding key-value pairs to a dict:

In [438]: index = pd.date_range["1/1/2000", periods=8] In [439]: s = pd.Series[np.random.randn[5], index=["a", "b", "c", "d", "e"]] In [440]: df = pd.DataFrame[np.random.randn[8, 3], index=index, columns=["A", "B", "C"]] # store.put['s', s] is an equivalent method In [441]: store["s"] = s In [442]: store["df"] = df In [443]: store Out[443]: File path: store.h5

In a current or later Python session, you can retrieve stored objects:

# store.get['df'] is an equivalent method In [444]: store["df"] Out[444]: A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908 # dotted [attribute] access provides get as well In [445]: store.df Out[445]: A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908

Deletion of the object specified by the key:

# store.remove['df'] is an equivalent method In [446]: del store["df"] In [447]: store Out[447]: File path: store.h5

Closing a Store and using a context manager:

In [448]: store.close[] In [449]: store Out[449]: File path: store.h5 In [450]: store.is_open Out[450]: False # Working with, and automatically closing the store using a context manager In [451]: with pd.HDFStore["store.h5"] as store: .....: store.keys[] .....:

Read/write API#

HDFStore supports a top-level API using read_hdf for reading and to_hdf for writing, similar to how read_csv and to_csv work.

In [452]: df_tl = pd.DataFrame[{"A": list[range[5]], "B": list[range[5]]}] In [453]: df_tl.to_hdf["store_tl.h5", "table", append=True] In [454]: pd.read_hdf["store_tl.h5", "table", where=["index>2"]] Out[454]: A B 3 3 3 4 4 4

HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting dropna=True.

In [455]: df_with_missing = pd.DataFrame[ .....: { .....: "col1": [0, np.nan, 2], .....: "col2": [1, np.nan, np.nan], .....: } .....: ] .....: In [456]: df_with_missing Out[456]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN In [457]: df_with_missing.to_hdf["file.h5", "df_with_missing", format="table", mode="w"] In [458]: pd.read_hdf["file.h5", "df_with_missing"] Out[458]: col1 col2 0 0.0 1.0 1 NaN NaN 2 2.0 NaN In [459]: df_with_missing.to_hdf[ .....: "file.h5", "df_with_missing", format="table", mode="w", dropna=True .....: ] .....: In [460]: pd.read_hdf["file.h5", "df_with_missing"] Out[460]: col1 col2 0 0.0 1.0 2 2.0 NaN

Fixed format#

The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format, called the fixed format. These types of stores are not appendable once written [though you can simply remove them and rewrite]. Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores. This format is specified by default when using put or to_hdf or by format='fixed' or format='f'.

Warning

A fixed format will raise a TypeError if you try to retrieve using a where:

>>> pd.DataFrame[np.random.randn[10, 2]].to_hdf["test_fixed.h5", "df"] >>> pd.read_hdf["test_fixed.h5", "df", where="index>5"] TypeError: cannot pass a where specification when reading a fixed format. this store must be selected in its entirety

Table format#

HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by format='table' or format='t' to append or put or to_hdf.

This format can be set as an option as well pd.set_option['io.hdf.default_format','table'] to enable put/append/to_hdf to by default store in the table format.

In [461]: store = pd.HDFStore["store.h5"] In [462]: df1 = df[0:4] In [463]: df2 = df[4:] # append data [creates a table automatically] In [464]: store.append["df", df1] In [465]: store.append["df", df2] In [466]: store Out[466]: File path: store.h5 # select the entire object In [467]: store.select["df"] Out[467]: A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908 # the type of stored data In [468]: store.root.df._v_attrs.pandas_type Out[468]: 'frame_table'

Note

You can also create a table by passing format='table' or format='t' to a put operation.

Hierarchical keys#

Keys to a store can be specified as a string. These can be in a hierarchical path-name like format [e.g. foo/bar/bah], which will generate a hierarchy of sub-stores [or Groups in PyTables parlance]. Keys can be specified without the leading ‘/’ and are always absolute [e.g. ‘foo’ refers to ‘/foo’]. Removal operations can remove everything in the sub-store and below, so be careful.

In [469]: store.put["foo/bar/bah", df] In [470]: store.append["food/orange", df] In [471]: store.append["food/apple", df] In [472]: store Out[472]: File path: store.h5 # a list of keys are returned In [473]: store.keys[] Out[473]: ['/df', '/food/apple', '/food/orange', '/foo/bar/bah'] # remove all nodes under this level In [474]: store.remove["food"] In [475]: store Out[475]: File path: store.h5

You can walk through the group hierarchy using the walk method which will yield a tuple for each group key along with the relative keys of its contents.

In [476]: for [path, subgroups, subkeys] in store.walk[]: .....: for subgroup in subgroups: .....: print["GROUP: {}/{}".format[path, subgroup]] .....: for subkey in subkeys: .....: key = "/".join[[path, subkey]] .....: print["KEY: {}".format[key]] .....: print[store.get[key]] .....: GROUP: /foo KEY: /df A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908 GROUP: /foo/bar KEY: /foo/bar/bah A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908

Warning

Hierarchical keys cannot be retrieved as dotted [attribute] access as described above for items stored under the root node.

In [8]: store.foo.bar.bah AttributeError: 'HDFStore' object has no attribute 'foo' # you can directly access the actual PyTables node but using the root node In [9]: store.root.foo.bar.bah Out[9]: /foo/bar/bah [Group] '' children := ['block0_items' [Array], 'block0_values' [Array], 'axis0' [Array], 'axis1' [Array]]

Instead, use explicit string based keys:

In [477]: store["foo/bar/bah"] Out[477]: A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908

Storing types#

Storing mixed types in a table#

Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the appended column. Subsequent attempts at appending longer strings will raise a ValueError.

Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the string columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string columns, passing nan_rep = 'nan' to append will change the default nan representation on disk [which converts to/from np.nan], this defaults to nan.

In [478]: df_mixed = pd.DataFrame[ .....: { .....: "A": np.random.randn[8], .....: "B": np.random.randn[8], .....: "C": np.array[np.random.randn[8], dtype="float32"], .....: "string": "string", .....: "int": 1, .....: "bool": True, .....: "datetime64": pd.Timestamp["20010102"], .....: }, .....: index=list[range[8]], .....: ] .....: In [479]: df_mixed.loc[df_mixed.index[3:5], ["A", "B", "string", "datetime64"]] = np.nan In [480]: store.append["df_mixed", df_mixed, min_itemsize={"values": 50}] In [481]: df_mixed1 = store.select["df_mixed"] In [482]: df_mixed1 Out[482]: A B C string int bool datetime64 0 1.778161 -0.898283 -0.263043 string 1 True 2001-01-02 1 -0.913867 -0.218499 -0.639244 string 1 True 2001-01-02 2 -0.030004 1.408028 -0.866305 string 1 True 2001-01-02 3 NaN NaN -0.225250 NaN 1 True NaT 4 NaN NaN -0.890978 NaN 1 True NaT 5 0.081323 0.520995 -0.553839 string 1 True 2001-01-02 6 -0.268494 0.620028 -2.762875 string 1 True 2001-01-02 7 0.168016 0.159416 -1.244763 string 1 True 2001-01-02 In [483]: df_mixed1.dtypes.value_counts[] Out[483]: float64 2 float32 1 object 1 int64 1 bool 1 datetime64[ns] 1 dtype: int64 # we have provided a minimum string column size In [484]: store.root.df_mixed.table Out[484]: /df_mixed/table [Table[8,]] '' description := { "index": Int64Col[shape=[], dflt=0, pos=0], "values_block_0": Float64Col[shape=[2,], dflt=0.0, pos=1], "values_block_1": Float32Col[shape=[1,], dflt=0.0, pos=2], "values_block_2": StringCol[itemsize=50, shape=[1,], dflt=b'', pos=3], "values_block_3": Int64Col[shape=[1,], dflt=0, pos=4], "values_block_4": BoolCol[shape=[1,], dflt=False, pos=5], "values_block_5": Int64Col[shape=[1,], dflt=0, pos=6]} byteorder := 'little' chunkshape := [689,] autoindex := True colindexes := { "index": Index[6, mediumshuffle, zlib[1]].is_csi=False}

Storing MultiIndex DataFrames#

Storing MultiIndex DataFrames as tables is very similar to storing/selecting from homogeneous index DataFrames.

In [485]: index = pd.MultiIndex[ .....: levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]], .....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]], .....: names=["foo", "bar"], .....: ] .....: In [486]: df_mi = pd.DataFrame[np.random.randn[10, 3], index=index, columns=["A", "B", "C"]] In [487]: df_mi Out[487]: A B C foo bar foo one -1.280289 0.692545 -0.536722 two 1.005707 0.296917 0.139796 three -1.083889 0.811865 1.648435 bar one -0.164377 -0.402227 1.618922 two -1.424723 -0.023232 0.948196 baz two 0.183573 0.145277 0.308146 three -1.043530 -0.708145 1.430905 qux one -0.850136 0.813949 1.508891 two -1.556154 0.187597 1.176488 three -1.246093 -0.002726 -0.444249 In [488]: store.append["df_mi", df_mi] In [489]: store.select["df_mi"] Out[489]: A B C foo bar foo one -1.280289 0.692545 -0.536722 two 1.005707 0.296917 0.139796 three -1.083889 0.811865 1.648435 bar one -0.164377 -0.402227 1.618922 two -1.424723 -0.023232 0.948196 baz two 0.183573 0.145277 0.308146 three -1.043530 -0.708145 1.430905 qux one -0.850136 0.813949 1.508891 two -1.556154 0.187597 1.176488 three -1.246093 -0.002726 -0.444249 # the levels are automatically included as data columns In [490]: store.select["df_mi", "foo=bar"] Out[490]: A B C foo bar bar one -0.164377 -0.402227 1.618922 two -1.424723 -0.023232 0.948196

Note

The index keyword is reserved and cannot be use as a level name.

Querying#

Querying a table#

select and delete operations have an optional criterion that can be specified to select/delete only a subset of the data. This allows one to have a very large on-disk table and retrieve only a portion of the data.

A query is specified using the Term class under the hood, as a boolean expression.

index and columns are supported indexers of DataFrames.
if data_columns are specified, these can be used as additional indexers.
level name in a MultiIndex, with default name level_0, level_1, … if not provided.

Valid comparison operators are:

=, ==, !=, >, >=, 0] & [df_dc.C > 0] & [df_dc.string == "foo"]] Out[533]: A B C string string2 2000-01-02 -1.167564 1.0 1.0 foo cool 2000-01-03 -0.131959 1.0 1.0 foo cool # we have automagically created this index and the B/C/string/string2 # columns are stored separately as ``PyTables`` columns In [534]: store.root.df_dc.table Out[534]: /df_dc/table [Table[8,]] '' description := { "index": Int64Col[shape=[], dflt=0, pos=0], "values_block_0": Float64Col[shape=[1,], dflt=0.0, pos=1], "B": Float64Col[shape=[], dflt=0.0, pos=2], "C": Float64Col[shape=[], dflt=0.0, pos=3], "string": StringCol[itemsize=3, shape=[], dflt=b'', pos=4], "string2": StringCol[itemsize=4, shape=[], dflt=b'', pos=5]} byteorder := 'little' chunkshape := [1680,] autoindex := True colindexes := { "index": Index[6, mediumshuffle, zlib[1]].is_csi=False, "B": Index[6, mediumshuffle, zlib[1]].is_csi=False, "C": Index[6, mediumshuffle, zlib[1]].is_csi=False, "string": Index[6, mediumshuffle, zlib[1]].is_csi=False, "string2": Index[6, mediumshuffle, zlib[1]].is_csi=False}

There is some performance degradation by making lots of columns into data columns, so it is up to the user to designate these. In addition, you cannot change data columns [nor indexables] after the first append/put operation [Of course you can simply read in the data and create a new table!].

Iterator#

You can pass iterator=True or chunksize=number_in_a_chunk to select and select_as_multiple to return an iterator on the results. The default is 50,000 rows returned in a chunk.

In [535]: for df in store.select["df", chunksize=3]: .....: print[df] .....: A B C 2000-01-01 -0.398501 -0.677311 -0.874991 2000-01-02 -1.167564 -0.593353 0.146262 2000-01-03 -0.131959 0.089012 0.667450 A B C 2000-01-04 0.169405 -1.358046 -0.105563 2000-01-05 0.492195 0.076693 0.213685 2000-01-06 -0.285283 -1.210529 -1.408386 A B C 2000-01-07 0.941577 -0.342447 0.222031 2000-01-08 0.052607 2.093214 1.064908

Note

You can also use the iterator with read_hdf which will open, then automatically close the store when finished iterating.

for df in pd.read_hdf["store.h5", "df", chunksize=3]: print[df]

Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize will subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks.

Here is a recipe for generating a query and using it to create equal sized return chunks.

In [536]: dfeq = pd.DataFrame[{"number": np.arange[1, 11]}] In [537]: dfeq Out[537]: number 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 In [538]: store.append["dfeq", dfeq, data_columns=["number"]] In [539]: def chunks[l, n]: .....: return [l[i: i + n] for i in range[0, len[l], n]] .....: In [540]: evens = [2, 4, 6, 8, 10] In [541]: coordinates = store.select_as_coordinates["dfeq", "number=evens"] In [542]: for c in chunks[coordinates, 2]: .....: print[store.select["dfeq", where=c]] .....: number 1 2 3 4 number 5 6 7 8 number 9 10

Advanced queries#

Select a single column#

To retrieve a single indexable or data column, use the method select_column. This will, for example, enable you to get the index very quickly. These return a Series of the result, indexed by the row number. These do not currently accept the where selector.

In [543]: store.select_column["df_dc", "index"] Out[543]: 0 2000-01-01 1 2000-01-02 2 2000-01-03 3 2000-01-04 4 2000-01-05 5 2000-01-06 6 2000-01-07 7 2000-01-08 Name: index, dtype: datetime64[ns] In [544]: store.select_column["df_dc", "string"] Out[544]: 0 foo 1 foo 2 foo 3 foo 4 NaN 5 NaN 6 foo 7 bar Name: string, dtype: object

Selecting coordinates#

Sometimes you want to get the coordinates [a.k.a the index locations] of your query. This returns an Int64Index of the resulting locations. These coordinates can also be passed to subsequent where operations.

In [545]: df_coord = pd.DataFrame[ .....: np.random.randn[1000, 2], index=pd.date_range["20000101", periods=1000] .....: ] .....: In [546]: store.append["df_coord", df_coord] In [547]: c = store.select_as_coordinates["df_coord", "index > 20020101"] In [548]: c Out[548]: Int64Index[[732, 733, 734, 735, 736, 737, 738, 739, 740, 741, ... 990, 991, 992, 993, 994, 995, 996, 997, 998, 999], dtype='int64', length=268] In [549]: store.select["df_coord", where=c] Out[549]: 0 1 2002-01-02 0.009035 0.921784 2002-01-03 -1.476563 -1.376375 2002-01-04 1.266731 2.173681 2002-01-05 0.147621 0.616468 2002-01-06 0.008611 2.136001 ... ... ... 2002-09-22 0.781169 -0.791687 2002-09-23 -0.764810 -2.000933 2002-09-24 -0.345662 0.393915 2002-09-25 -0.116661 0.834638 2002-09-26 -1.341780 0.686366 [268 rows x 2 columns]

Selecting using a where mask#

Sometime your query can involve creating a list of rows to select. Usually this mask would be a resulting index from an indexing operation. This example selects the months of a datetimeindex which are 5.

In [550]: df_mask = pd.DataFrame[ .....: np.random.randn[1000, 2], index=pd.date_range["20000101", periods=1000] .....: ] .....: In [551]: store.append["df_mask", df_mask] In [552]: c = store.select_column["df_mask", "index"] In [553]: where = c[pd.DatetimeIndex[c].month == 5].index In [554]: store.select["df_mask", where=where] Out[554]: 0 1 2000-05-01 -0.386742 -0.977433 2000-05-02 -0.228819 0.471671 2000-05-03 0.337307 1.840494 2000-05-04 0.050249 0.307149 2000-05-05 -0.802947 -0.946730 ... ... ... 2002-05-27 1.605281 1.741415 2002-05-28 -0.804450 -0.715040 2002-05-29 -0.874851 0.037178 2002-05-30 -0.161167 -1.294944 2002-05-31 -0.258463 -0.731969 [93 rows x 2 columns]

Storer object#

If you want to inspect the stored object, retrieve via get_storer. You could use this programmatically to say get the number of rows in an object.

In [555]: store.get_storer["df_dc"].nrows Out[555]: 8

Multiple table queries#

The methods append_to_multiple and select_as_multiple can perform appending/selecting from multiple tables at once. The idea is to have one table [call it the selector table] that you index most/all of the columns, and perform your queries. The other table[s] are data tables with an index matching the selector table’s index. You can then perform a very fast query on the selector table, yet get lots of data back. This method is similar to having a very wide table, but enables more efficient queries.

The append_to_multiple method splits a given single DataFrame into multiple tables according to d, a dictionary that maps the table names to a list of ‘columns’ you want in that table. If None is used in place of a list, that table will have the remaining unspecified columns of the given DataFrame. The argument selector defines which table is the selector table [which you can make queries from]. The argument dropna will drop rows from the input DataFrame to ensure tables are synchronized. This means that if a row for one of the tables being written to is entirely np.NaN, that row will be dropped from all tables.

If dropna is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES. Remember that entirely np.Nan rows are not written to the HDFStore, so if you choose to call dropna=False, some tables may have more rows than others, and therefore select_as_multiple may not work or it may return unexpected results.

In [556]: df_mt = pd.DataFrame[ .....: np.random.randn[8, 6], .....: index=pd.date_range["1/1/2000", periods=8], .....: columns=["A", "B", "C", "D", "E", "F"], .....: ] .....: In [557]: df_mt["foo"] = "bar" In [558]: df_mt.loc[df_mt.index[1], ["A", "B"]] = np.nan # you can also create the tables individually In [559]: store.append_to_multiple[ .....: {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt" .....: ] .....: In [560]: store Out[560]: File path: store.h5 # individual tables were created In [561]: store.select["df1_mt"] Out[561]: A B 2000-01-01 0.079529 -1.459471 2000-01-02 NaN NaN 2000-01-03 -0.423113 2.314361 2000-01-04 0.756744 -0.792372 2000-01-05 -0.184971 0.170852 2000-01-06 0.678830 0.633974 2000-01-07 0.034973 0.974369 2000-01-08 -2.110103 0.243062 In [562]: store.select["df2_mt"] Out[562]: C D E F foo 2000-01-01 -0.596306 -0.910022 -1.057072 -0.864360 bar 2000-01-02 0.477849 0.283128 -2.045700 -0.338206 bar 2000-01-03 -0.033100 -0.965461 -0.001079 -0.351689 bar 2000-01-04 -0.513555 -1.484776 -0.796280 -0.182321 bar 2000-01-05 -0.872407 -1.751515 0.934334 0.938818 bar 2000-01-06 -1.398256 1.347142 -0.029520 0.082738 bar 2000-01-07 -0.755544 0.380786 -1.634116 1.293610 bar 2000-01-08 1.453064 0.500558 -0.574475 0.694324 bar # as a multiple In [563]: store.select_as_multiple[ .....: ["df1_mt", "df2_mt"], .....: where=["A>0", "B>0"], .....: selector="df1_mt", .....: ] .....: Out[563]: A B C D E F foo 2000-01-06 0.678830 0.633974 -1.398256 1.347142 -0.029520 0.082738 bar 2000-01-07 0.034973 0.974369 -0.755544 0.380786 -1.634116 1.293610 bar

Delete from a table#

You can delete from a table selectively by specifying a where. In deleting rows, it is important to understand the PyTables deletes rows by erasing the rows, then moving the following data. Thus deleting can potentially be a very expensive operation depending on the orientation of your data. To get optimal performance, it’s worthwhile to have the dimension you are deleting be the first of the indexables.

Data is ordered [on the disk] in terms of the indexables. Here’s a simple use case. You store panel-type data, with dates in the major_axis and ids in the minor_axis. The data is then interleaved like this:

date_1
- id_1
- id_2
- .
- id_n
date_2
- id_1
- .
- id_n

It should be clear that a delete operation on the major_axis will be fairly quick, as one chunk is removed, then the following data moved. On the other hand a delete operation on the minor_axis will be very expensive. In this case it would almost certainly be faster to rewrite the table using a where that selects all but the missing data.

Warning

Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly deleting [or removing nodes] and adding again, WILL TEND TO INCREASE THE FILE SIZE.

To repack and clean the file, use ptrepack.

Notes & caveats#

Compression#

PyTables allows the stored data to be compressed. This applies to all kinds of stores, not just tables. Two parameters are used to control compression: complevel and complib.

complevel specifies if and how hard data is to be compressed. complevel=0 and complevel=None disables compression and 0 sa.bindparam["date"]] In [647]: pd.read_sql[expr, engine, params={"date": dt.datetime[2010, 10, 18]}] Out[647]: index Date Col_1 Col_2 Col_3 0 1 2010-10-19 Y -12.50 False 1 2 2010-10-20 Z 5.73 True

Sqlite fallback#

The use of sqlite is supported without using SQLAlchemy. This mode requires a Python database adapter which respect the Python DB-API.

You can create connections like so:

import sqlite3 con = sqlite3.connect[":memory:"]

And then issue the following queries:

data.to_sql["data", con] pd.read_sql_query["SELECT * FROM data", con]

Google BigQuery#

Warning

Starting in 0.20.0, pandas has split off Google BigQuery support into the separate package pandas-gbq. You can pip install pandas-gbq to get it.

The pandas-gbq package provides functionality to read/write from Google BigQuery.

pandas integrates with this external package. if pandas-gbq is installed, you can use the pandas methods pd.read_gbq and DataFrame.to_gbq, which will call the respective functions from pandas-gbq.

Full documentation can be found here.

Stata format#

Writing to stata format#

The method to_stata[] will write a DataFrame into a .dta file. The format version of this file is always 115 [Stata 12].

In [648]: df = pd.DataFrame[np.random.randn[10, 2], columns=list["AB"]] In [649]: df.to_stata["stata.dta"]

Stata data files have limited data type support; only strings with 244 or fewer characters, int8, int16, int32, float32 and float64 can be stored in .dta files. Additionally, Stata reserves certain values to represent missing data. Exporting a non-missing value that is outside of the permitted range in Stata for a particular data type will retype the variable to the next larger size. For example, int8 values are restricted to lie between -127 and 100 in Stata, and so variables with values above 100 will trigger a conversion to int16. nan values in floating points data types are stored as the basic missing data type [. in Stata].

Note

It is not possible to export missing data values for integer data types.

The Stata writer gracefully handles other data types including int64, bool, uint8, uint16, uint32 by casting to the smallest supported type that can represent the data. For example, data with a type of uint8 will be cast to int8 if all values are less than 100 [the upper bound for non-missing int8 data in Stata], or, if values are outside of this range, the variable is cast to int16.

Warning

Conversion from int64 to float64 may result in a loss of precision if int64 values are larger than 2**53.

Warning

StataWriter and to_stata[] only support fixed width strings containing up to 244 characters, a limitation imposed by the version 115 dta file format. Attempting to write Stata dta files with strings longer than 244 characters raises a ValueError.

Reading from Stata format#

The top-level function read_stata will read a dta file and return either a DataFrame or a StataReader that can be used to read the file incrementally.

In [650]: pd.read_stata["stata.dta"] Out[650]: index A B 0 0 -1.690072 0.405144 1 1 -1.511309 -1.531396 2 2 0.572698 -1.106845 3 3 -1.185859 0.174564 4 4 0.603797 -1.796129 5 5 -0.791679 1.173795 6 6 -0.277710 1.859988 7 7 -0.258413 1.251808 8 8 1.443262 0.441553 9 9 1.168163 -2.054946

Specifying a chunksize yields a StataReader instance that can be used to read chunksize lines from the file at a time. The StataReader object can be used as an iterator.

In [651]: with pd.read_stata["stata.dta", chunksize=3] as reader: .....: for df in reader: .....: print[df.shape] .....: [3, 3] [3, 3] [3, 3] [1, 3]

For more fine-grained control, use iterator=True and specify chunksize with each call to read[].

In [652]: with pd.read_stata["stata.dta", iterator=True] as reader: .....: chunk1 = reader.read[5] .....: chunk2 = reader.read[5] .....:

Currently the index is retrieved as a column.

The parameter convert_categoricals indicates whether value labels should be read and used to create a Categorical variable from them. Value labels can also be retrieved by the function value_labels, which requires read[] to be called before use.

The parameter convert_missing indicates whether missing value representations in Stata should be preserved. If False [the default], missing values are represented as np.nan. If True, missing values are represented using StataMissingValue objects, and columns containing missing values will have object data type.

Note

read_stata[] and StataReader support .dta formats 113-115 [Stata 10-12], 117 [Stata 13], and 118 [Stata 14].

Note

Setting preserve_dtypes=False will upcast to the standard pandas data types: int64 for all integer types and float64 for floating point data. By default, the Stata data types are preserved when importing.

Categorical data#

Categorical data can be exported to Stata data files as value labeled data. The exported data consists of the underlying category codes as integer data values and the categories as value labels. Stata does not have an explicit equivalent to a Categorical and information about whether the variable is ordered is lost when exporting.

Warning

Stata only supports string value labels, and so str is called on the categories when exporting data. Exporting Categorical variables with non-string categories produces a warning, and can result a loss of information if the str representations of the categories are not unique.

Labeled data can similarly be imported from Stata data files as Categorical variables using the keyword argument convert_categoricals [True by default]. The keyword argument order_categoricals [True by default] determines whether imported Categorical variables are ordered.

Note

When importing categorical data, the values of the variables in the Stata data file are not preserved since Categorical variables always use integer data types between -1 and n-1 where n is the number of categories. If the original values in the Stata data file are required, these can be imported by setting convert_categoricals=False, which will import original data [but not the variable labels]. The original values can be matched to the imported categorical data since there is a simple mapping between the original Stata data values and the category codes of imported Categorical variables: missing values are assigned code -1, and the smallest original value is assigned 0, the second smallest is assigned 1 and so on until the largest original value is assigned the code n-1.

Note

Stata supports partially labeled series. These series have value labels for some but not all data values. Importing a partially labeled series will produce a Categorical with string categories for the values that are labeled and numeric categories for values with no label.

SAS formats#

The top-level function read_sas[] can read [but not write] SAS XPORT [.xpt] and [since v0.18.0] SAS7BDAT [.sas7bdat] format files.

SAS files only contain two value types: ASCII text and floating point values [usually 8 bytes but sometimes truncated]. For xport files, there is no automatic type conversion to integers, dates, or categoricals. For SAS7BDAT files, the format codes may allow date variables to be automatically converted to dates. By default the whole file is read and returned as a DataFrame.

Specify a chunksize or use iterator=True to obtain reader objects [XportReader or SAS7BDATReader] for incrementally reading the file. The reader objects also have attributes that contain additional information about the file and its variables.

Read a SAS7BDAT file:

df = pd.read_sas["sas_data.sas7bdat"]

Obtain an iterator and read an XPORT file 100,000 lines at a time:

def do_something[chunk]: pass with pd.read_sas["sas_xport.xpt", chunk=100000] as rdr: for chunk in rdr: do_something[chunk]

The specification for the xport file format is available from the SAS web site.

No official documentation is available for the SAS7BDAT format.

SPSS formats#

New in version 0.25.0.

The top-level function read_spss[] can read [but not write] SPSS SAV [.sav] and ZSAV [.zsav] format files.

SPSS files contain column names. By default the whole file is read, categorical columns are converted into pd.Categorical, and a DataFrame with all columns is returned.

Specify the usecols parameter to obtain a subset of columns. Specify convert_categoricals=False to avoid converting categorical columns into pd.Categorical.

Read an SPSS file:

df = pd.read_spss["spss_data.sav"]

Extract a subset of columns contained in usecols from an SPSS file and avoid converting categorical columns into pd.Categorical:

df = pd.read_spss[ "spss_data.sav", usecols=["foo", "bar"], convert_categoricals=False, ]

More information about the SAV and ZSAV file formats is available here.

Other file formats#

pandas itself only supports IO with a limited set of file formats that map cleanly to its tabular data model. For reading and writing other file formats into and from pandas, we recommend these packages from the broader community.

netCDF#

xarray provides data structures inspired by the pandas DataFrame for working with multi-dimensional datasets, with a focus on the netCDF file format and easy conversion to and from pandas.

Performance considerations#

This is an informal comparison of various IO methods, using pandas 0.24.2. Timings are machine dependent and small differences should be ignored.

In [1]: sz = 1000000 In [2]: df = pd.DataFrame[{'A': np.random.randn[sz], 'B': [1] * sz}] In [3]: df.info[] RangeIndex: 1000000 entries, 0 to 999999 Data columns [total 2 columns]: A 1000000 non-null float64 B 1000000 non-null int64 dtypes: float64[1], int64[1] memory usage: 15.3 MB

The following test functions will be used below to compare the performance of several IO methods:

import numpy as np import os sz = 1000000 df = pd.DataFrame[{"A": np.random.randn[sz], "B": [1] * sz}] sz = 1000000 np.random.seed[42] df = pd.DataFrame[{"A": np.random.randn[sz], "B": [1] * sz}] def test_sql_write[df]: if os.path.exists["test.sql"]: os.remove["test.sql"] sql_db = sqlite3.connect["test.sql"] df.to_sql[name="test_table", con=sql_db] sql_db.close[] def test_sql_read[]: sql_db = sqlite3.connect["test.sql"] pd.read_sql_query["select * from test_table", sql_db] sql_db.close[] def test_hdf_fixed_write[df]: df.to_hdf["test_fixed.hdf", "test", mode="w"] def test_hdf_fixed_read[]: pd.read_hdf["test_fixed.hdf", "test"] def test_hdf_fixed_write_compress[df]: df.to_hdf["test_fixed_compress.hdf", "test", mode="w", complib="blosc"] def test_hdf_fixed_read_compress[]: pd.read_hdf["test_fixed_compress.hdf", "test"] def test_hdf_table_write[df]: df.to_hdf["test_table.hdf", "test", mode="w", format="table"] def test_hdf_table_read[]: pd.read_hdf["test_table.hdf", "test"] def test_hdf_table_write_compress[df]: df.to_hdf[ "test_table_compress.hdf", "test", mode="w", complib="blosc", format="table" ] def test_hdf_table_read_compress[]: pd.read_hdf["test_table_compress.hdf", "test"] def test_csv_write[df]: df.to_csv["test.csv", mode="w"] def test_csv_read[]: pd.read_csv["test.csv", index_col=0] def test_feather_write[df]: df.to_feather["test.feather"] def test_feather_read[]: pd.read_feather["test.feather"] def test_pickle_write[df]: df.to_pickle["test.pkl"] def test_pickle_read[]: pd.read_pickle["test.pkl"] def test_pickle_write_compress[df]: df.to_pickle["test.pkl.compress", compression="xz"] def test_pickle_read_compress[]: pd.read_pickle["test.pkl.compress", compression="xz"] def test_parquet_write[df]: df.to_parquet["test.parquet"] def test_parquet_read[]: pd.read_parquet["test.parquet"]

When writing, the top three functions in terms of speed are test_feather_write, test_hdf_fixed_write and test_hdf_fixed_write_compress.

In [4]: %timeit test_sql_write[df] 3.29 s ± 43.2 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [5]: %timeit test_hdf_fixed_write[df] 19.4 ms ± 560 µs per loop [mean ± std. dev. of 7 runs, 1 loop each] In [6]: %timeit test_hdf_fixed_write_compress[df] 19.6 ms ± 308 µs per loop [mean ± std. dev. of 7 runs, 10 loops each] In [7]: %timeit test_hdf_table_write[df] 449 ms ± 5.61 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [8]: %timeit test_hdf_table_write_compress[df] 448 ms ± 11.9 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [9]: %timeit test_csv_write[df] 3.66 s ± 26.2 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [10]: %timeit test_feather_write[df] 9.75 ms ± 117 µs per loop [mean ± std. dev. of 7 runs, 100 loops each] In [11]: %timeit test_pickle_write[df] 30.1 ms ± 229 µs per loop [mean ± std. dev. of 7 runs, 10 loops each] In [12]: %timeit test_pickle_write_compress[df] 4.29 s ± 15.9 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [13]: %timeit test_parquet_write[df] 67.6 ms ± 706 µs per loop [mean ± std. dev. of 7 runs, 10 loops each]

When reading, the top three functions in terms of speed are test_feather_read, test_pickle_read and test_hdf_fixed_read.

In [14]: %timeit test_sql_read[] 1.77 s ± 17.7 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [15]: %timeit test_hdf_fixed_read[] 19.4 ms ± 436 µs per loop [mean ± std. dev. of 7 runs, 10 loops each] In [16]: %timeit test_hdf_fixed_read_compress[] 19.5 ms ± 222 µs per loop [mean ± std. dev. of 7 runs, 10 loops each] In [17]: %timeit test_hdf_table_read[] 38.6 ms ± 857 µs per loop [mean ± std. dev. of 7 runs, 10 loops each] In [18]: %timeit test_hdf_table_read_compress[] 38.8 ms ± 1.49 ms per loop [mean ± std. dev. of 7 runs, 10 loops each] In [19]: %timeit test_csv_read[] 452 ms ± 9.04 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [20]: %timeit test_feather_read[] 12.4 ms ± 99.7 µs per loop [mean ± std. dev. of 7 runs, 100 loops each] In [21]: %timeit test_pickle_read[] 18.4 ms ± 191 µs per loop [mean ± std. dev. of 7 runs, 100 loops each] In [22]: %timeit test_pickle_read_compress[] 915 ms ± 7.48 ms per loop [mean ± std. dev. of 7 runs, 1 loop each] In [23]: %timeit test_parquet_read[] 24.4 ms ± 146 µs per loop [mean ± std. dev. of 7 runs, 10 loops each]

The files test.pkl.compress, test.parquet and test.feather took the least space on disk [in bytes].

29519500 Oct 10 06:45 test.csv 16000248 Oct 10 06:45 test.feather 8281983 Oct 10 06:49 test.parquet 16000857 Oct 10 06:47 test.pkl 7552144 Oct 10 06:48 test.pkl.compress 34816000 Oct 10 06:42 test.sql 24009288 Oct 10 06:43 test_fixed.hdf 24009288 Oct 10 06:43 test_fixed_compress.hdf 24458940 Oct 10 06:44 test_table.hdf 24458940 Oct 10 06:44 test_table_compress.hdf

Writing to HTML files#

HTML Table Parsing Gotchas#

LaTeX#

Writing to LaTeX files#

XML#

Reading XML#

Writing XML#

XML Final Notes#

Excel files#

Reading Excel files#

ExcelFile class#

Specifying sheets#

Reading a MultiIndex#

Parsing specific columns#

Parsing dates#

Cell converters#

Dtype specifications#

Writing Excel files#

Writing Excel files to disk#

Writing Excel files to memory#

Excel writer engines#

Style and formatting#

OpenDocument Spreadsheets#

Binary Excel [.xlsb] files#

Clipboard#

Pickling#

Compressed pickle files#

msgpack#

HDF5 [PyTables]#

Read/write API#

Fixed format#

Table format#

Hierarchical keys#

Storing types#

Storing mixed types in a table#

Storing MultiIndex DataFrames#

Querying#

Querying a table#

Iterator#

Advanced queries#

Multiple table queries#

Delete from a table#

Notes & caveats#

Compression#

Sqlite fallback#

Google BigQuery#

Stata format#

Writing to stata format#

Reading from Stata format#

Categorical data#

SAS formats#

SPSS formats#

Other file formats#

netCDF#

Performance considerations#

Which filing system is considered to be the most efficient?

Which of the following best describes the most important function of the health record quizlet?

Which of the following is a disadvantage of alphabetic filing?

Which of the following is the best definition of electronic health records?

Bài Viết Liên Quan

Toplist mới

Bài mới nhất

Chủ Đề