Version 0.16.2 (June 12, 2015)¶
This is a minor bug-fix release from 0.16.1 and includes a large number of
bug fixes along some new features (pipe() method), enhancements, and performance improvements.
We recommend that all users upgrade to this version.
Highlights include:
What’s new in v0.16.2
New features¶
Pipe¶
We’ve introduced a new method DataFrame.pipe(). As suggested by the name, pipe
should be used to pipe data through a chain of function calls.
The goal is to avoid confusing nested function calls like
# df is a DataFrame
# f, g, and h are functions that take and return DataFrames
f(g(h(df), arg1=1), arg2=2, arg3=3) # noqa F821
The logic flows from inside out, and function names are separated from their keyword arguments. This can be rewritten as
(
df.pipe(h) # noqa F821
.pipe(g, arg1=1) # noqa F821
.pipe(f, arg2=2, arg3=3) # noqa F821
)
Now both the code and the logic flow from top to bottom. Keyword arguments are next to their functions. Overall the code is much more readable.
In the example above, the functions f, g, and h each expected the DataFrame as the first positional argument.
When the function you wish to apply takes its data anywhere other than the first argument, pass a tuple
of (function, keyword) indicating where the DataFrame should flow. For example:
In [1]: import statsmodels.formula.api as sm
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-5ad1e596ef83> in <module>
----> 1 import statsmodels.formula.api as sm
/usr/lib/python3/dist-packages/statsmodels/__init__.py in <module>
----> 1 from statsmodels.compat.patsy import monkey_patch_cat_dtype
2
3 from statsmodels._version import __version__, __version_tuple__
4
5 __version_info__ = __version_tuple__
/usr/lib/python3/dist-packages/statsmodels/compat/patsy.py in <module>
2
3 import pandas as pd
----> 4 import patsy.util
5
6
/usr/lib/python3/dist-packages/patsy/__init__.py in <module>
75 # This used to have less copy-paste, but explicit import statements make
76 # packaging tools like py2exe and py2app happier. Sigh.
---> 77 import patsy.highlevel
78 _reexport(patsy.highlevel)
79
/usr/lib/python3/dist-packages/patsy/highlevel.py in <module>
17 import numpy as np
18 from patsy import PatsyError
---> 19 from patsy.design_info import DesignMatrix, DesignInfo
20 from patsy.eval import EvalEnvironment
21 from patsy.desc import ModelDesc
/usr/lib/python3/dist-packages/patsy/design_info.py in <module>
36 from patsy.constraint import linear_constraint
37 from patsy.contrasts import ContrastMatrix
---> 38 from patsy.desc import ModelDesc, Term
39
40 class FactorInfo(object):
/usr/lib/python3/dist-packages/patsy/desc.py in <module>
12 from patsy import PatsyError
13 from patsy.parse_formula import ParseNode, Token, parse_formula
---> 14 from patsy.eval import EvalEnvironment, EvalFactor
15 from patsy.util import uniqueify_list
16 from patsy.util import repr_pretty_delegate, repr_pretty_impl
/usr/lib/python3/dist-packages/patsy/eval.py in <module>
34 return flags
35
---> 36 _ALL_FUTURE_FLAGS = _all_future_flags()
37
38 # This is just a minimal dict-like object that does lookup in a 'stack' of
/usr/lib/python3/dist-packages/patsy/eval.py in _all_future_flags()
30 for feature_name in __future__.all_feature_names:
31 feature = getattr(__future__, feature_name)
---> 32 if feature.getMandatoryRelease() > sys.version_info:
33 flags |= feature.compiler_flag
34 return flags
TypeError: '>' not supported between instances of 'NoneType' and 'sys.version_info'
In [2]: bb = pd.read_csv("data/baseball.csv", index_col="id")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-2-f21a45afde3f> in <module>
----> 1 bb = pd.read_csv("data/baseball.csv", index_col="id")
/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
209 else:
210 kwargs[new_arg_name] = new_arg_value
--> 211 return func(*args, **kwargs)
212
213 return cast(F, wrapper)
/usr/lib/python3/dist-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
329 stacklevel=find_stack_level(),
330 )
--> 331 return func(*args, **kwargs)
332
333 # error: "Callable[[VarArg(Any), KwArg(Any)], Any]" has no
/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
948 kwds.update(kwds_defaults)
949
--> 950 return _read(filepath_or_buffer, kwds)
951
952
/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
603
604 # Create the parser.
--> 605 parser = TextFileReader(filepath_or_buffer, **kwds)
606
607 if chunksize or iterator:
/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
1440
1441 self.handles: IOHandles | None = None
-> 1442 self._engine = self._make_engine(f, self.engine)
1443
1444 def close(self) -> None:
/usr/lib/python3/dist-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
1733 if "b" not in mode:
1734 mode += "b"
-> 1735 self.handles = get_handle(
1736 f,
1737 mode,
/usr/lib/python3/dist-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
854 if ioargs.encoding and "b" not in ioargs.mode:
855 # Encoding
--> 856 handle = open(
857 handle,
858 ioargs.mode,
FileNotFoundError: [Errno 2] No such file or directory: 'data/baseball.csv'
# sm.ols takes (formula, data)
In [3]: (
...: bb.query("h > 0")
...: .assign(ln_h=lambda df: np.log(df.h))
...: .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
...: .fit()
...: .summary()
...: )
...:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-1edcc311f87c> in <module>
1 (
----> 2 bb.query("h > 0")
3 .assign(ln_h=lambda df: np.log(df.h))
4 .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
5 .fit()
NameError: name 'bb' is not defined
The pipe method is inspired by unix pipes, which stream text through
processes. More recently dplyr and magrittr have introduced the
popular (%>%) pipe operator for R.
See the documentation for more. (GH 10129)
Other enhancements¶
Added
rsplitto Index/Series StringMethods (GH 10303)Removed the hard-coded size limits on the
DataFrameHTML representation in the IPython notebook, and leave this to IPython itself (only for IPython v3.0 or greater). This eliminates the duplicate scroll bars that appeared in the notebook with large frames (GH 10231).Note that the notebook has a
toggle output scrollingfeature to limit the display of very large frames (by clicking left of the output). You can also configure the way DataFrames are displayed using the pandas options, see here here.axisparameter ofDataFrame.quantilenow accepts alsoindexandcolumn. (GH 9543)
API changes¶
Holidaynow raisesNotImplementedErrorif bothoffsetandobservanceare used in the constructor instead of returning an incorrect result (GH 10217).
Performance improvements¶
Bug fixes¶
Bug in
Series.histraises an error when a one rowSerieswas given (GH 10214)Bug where
HDFStore.selectmodifies the passed columns list (GH 7212)Bug in
Categoricalrepr withdisplay.widthofNonein Python 3 (GH 10087)Bug in
to_jsonwith certain orients and aCategoricalIndexwould segfault (GH 10317)Bug where some of the nan functions do not have consistent return dtypes (GH 10251)
Bug in
DataFrame.quantileon checking that a valid axis was passed (GH 9543)Bug in
groupby.applyaggregation forCategoricalnot preserving categories (GH 10138)Bug in
to_csvwheredate_formatis ignored if thedatetimeis fractional (GH 10209)Bug in
DataFrame.to_jsonwith mixed data types (GH 10289)Bug in cache updating when consolidating (GH 10264)
Bug in
mean()where integer dtypes can overflow (GH 10172)Bug where
Panel.from_dictdoes not set dtype when specified (GH 10058)Bug in
Index.unionraisesAttributeErrorwhen passing array-likes. (GH 10149)Bug in
Timestamp’s’microsecond,quarter,dayofyear,weekanddaysinmonthproperties returnnp.inttype, not built-inint. (GH 10050)Bug in
NaTraisesAttributeErrorwhen accessing todaysinmonth,dayofweekproperties. (GH 10096)Bug in Index repr when using the
max_seq_items=Nonesetting (GH 10182).Bug in getting timezone data with
dateutilon various platforms ( GH 9059, GH 8639, GH 9663, GH 10121)Bug in displaying datetimes with mixed frequencies; display ‘ms’ datetimes to the proper precision. (GH 10170)
Bug in
setitemwhere type promotion is applied to the entire block (GH 10280)Bug in
Seriesarithmetic methods may incorrectly hold names (GH 10068)Bug in
GroupBy.get_groupwhen grouping on multiple keys, one of which is categorical. (GH 10132)Bug in
DatetimeIndexandTimedeltaIndexnames are lost after timedelta arithmetic ( GH 9926)Bug in
DataFrameconstruction from nesteddictwithdatetime64(GH 10160)Bug in
Seriesconstruction fromdictwithdatetime64keys (GH 9456)Bug in
Series.plot(label="LABEL")not correctly setting the label (GH 10119)Bug in
plotnot defaulting to matplotlibaxes.gridsetting (GH 9792)Bug causing strings containing an exponent, but no decimal to be parsed as
intinstead offloatinengine='python'for theread_csvparser (GH 9565)Bug in
Series.alignresetsnamewhenfill_valueis specified (GH 10067)Bug in
read_csvcausing index name not to be set on an empty DataFrame (GH 10184)Bug in
SparseSeries.absresetsname(GH 10241)Bug in
TimedeltaIndexslicing may reset freq (GH 10292)Bug in
GroupBy.get_groupraisesValueErrorwhen group key containsNaT(GH 6992)Bug in
SparseSeriesconstructor ignores input data name (GH 10258)Bug in
Categorical.remove_categoriescausing aValueErrorwhen removing theNaNcategory if underlying dtype is floating-point (GH 10156)Bug where infer_freq infers time rule (WOM-5XXX) unsupported by to_offset (GH 9425)
Bug in
DataFrame.to_hdf()where table format would raise a seemingly unrelated error for invalid (non-string) column names. This is now explicitly forbidden. (GH 9057)Bug to handle masking empty
DataFrame(GH 10126).Bug where MySQL interface could not handle numeric table/column names (GH 10255)
Bug in
read_csvwith adate_parserthat returned adatetime64array of other time resolution than[ns](GH 10245)Bug in
Panel.applywhen the result has ndim=0 (GH 10332)Bug in
read_hdfwhereauto_closecould not be passed (GH 9327).Bug in
read_hdfwhere open stores could not be used (GH 10330).Bug in adding empty
DataFrames, now results in aDataFramethat.equalsan emptyDataFrame(GH 10181).Bug in
to_hdfandHDFStorewhich did not check that complib choices were valid (GH 4582, GH 8874).