carabiner.pd package

Submodules

carabiner.pd.utils module

Utilities for Pandas.

class carabiner.pd.utils.IOFormat(in_delim: str, out_delim: str | None = None, strict: bool = True)[source]

Bases: object

Stores delimiters for reading and writing.

Parameters:

in_delim (str) – Delimiter when reading.
out_delim (str, optional) – Delimiter when writing.

in_delim: str

out_delim: str | None = None

strict: bool = True

carabiner.pd.utils.format2delim(format: str, default: str | None = None, allow_excel: bool = True, strict: bool = True) → IOFormat | None[source]

Return a delimiter from its format name or extension.

Parameters:

format (str) – Format name.
default (str, optional) – Default delimiter to return if delimiter not supported.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.
strict (bool, optional) – Whether to allow whitespace delimiter in TSV. Default: False.

Returns:

Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None

Return type:

IOFormat or None

Examples

>>> format2delim(".csv")
IOFormat(in_delim=',', out_delim=',', strict=True)
>>> format2delim("tsv")
IOFormat(in_delim='\t', out_delim='\t', strict=True)
>>> format2delim("tsv", strict=False)
IOFormat(in_delim='\\s+', out_delim='\t', strict=False)
>>> format2delim(".xlsx")
IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True)
>>> format2delim(".cool", default=".")
IOFormat(in_delim='.', out_delim='.', strict=True)
>>> format2delim(".cool") is None
True

carabiner.pd.utils.get_formats(allow_excel: bool = True) → Tuple[str][source]

List the supported table formats.

Parameters:: allow_excel (bool, optional) – Whether to include XLSX formats. Default: True.
Returns:: The supported table formats.
Return type:: tuple

Examples

>>> get_formats()
('.txt', '.tsv', '.csv', '.xlsx', 'txt', 'tsv', 'csv', 'xlsx')

carabiner.pd.utils.read_csv(filename: str | TextIO, rows: int | Sequence[int] | None = None, progress: bool = True, *args, **kwargs) → DataFrame[source]

Read a delimited file, optionally GZIPped, optionally only specific rows.

Provides a progress bar by default. Addtional arguments are passed to pd.read_csv.

Parameters:

filename (str) – Path of file to read. Optionally GZIP compressed.
rows (list of int, optional) – Rows to read. If None (default), read all rows.
progress (bool) – Whether to display a progress bar. Default: True.

Returns:

Pandas DataFrame of the input file.

Return type:

pd.DataFrame

Universal reader of tabular data files.

Addtional arguments are passed to read_csv or pd.read_excel. If chunksize is provided, returns a iterator of chunks which is lazy for non-Excel files but greedy for Excel files. Otherwise returns a DataFrame.

Parameters:

filename (str) – Path of file to read in CSV, TSV or Excel format. Optionally GZIP compressed.
format (str) – Format name. Default: infer from filename
sheet_name (str, int or list, optional) – If reading an XLSX file, which sheets to read. Default: read all sheets.

Returns:

Pandas DataFrame of the input file.

Return type:

pd.DataFrame or iterator of pd.DataFrame

carabiner.pd.utils.resolve_delim(file: str | IO, format: str | None = None, default: str | None = None, allow_excel: bool = True) → IOFormat | None[source]

Identify the delimiter of a file.

Uses the file extension, unless an explicit format is provided.

Parameters:

file (str or file-like) – File whose delimiter should be identified.
format (str, optional) – Override the file extension to return a format.
default (str, optional) – Provide this default if the extension cannot be identified, otherwise return None.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.

Returns:

Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None

Return type:

IOFormat or None

Examples

>>> resolve_delim("test.tsv", format="csv")
IOFormat(in_delim=',', out_delim=',', strict=True)
>>> resolve_delim("test.tsv", format="tsv")
IOFormat(in_delim='\t', out_delim='\t', strict=True)
>>> resolve_delim("test.cool") is None
True
>>> resolve_delim("test.cool", default="\t")
IOFormat(in_delim='\t', out_delim='\t', strict=True)

carabiner.pd.utils.sniff(file: str | IO, default: str | None = None, allow_excel: bool = True, strict: bool = True) → IOFormat | None[source]

Identify the delimiter of a file from its extension.

Parameters:

file (str or file-like) – Input path to file or a file-like object.
default (str, optional) – Default delimiter to return if delimiter not supported.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.
strict (bool, optional) – Whether to allow whitespace delimiter in TSV. Default: False.

Returns:

Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None

Return type:

IOFormat or None

Examples

>>> sniff("test.csv")
IOFormat(in_delim=',', out_delim=',', strict=True)
>>> sniff("test.tsv")
IOFormat(in_delim='\t', out_delim='\t', strict=True)
>>> sniff("test.tsv", strict=False)
IOFormat(in_delim='\\s+', out_delim='\t', strict=False)
>>> sniff("test.xlsx")
IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True)
>>> sniff("test.cool", default=".")
IOFormat(in_delim='.', out_delim='.', strict=True)
>>> sniff("test.cool") is None
True
>>> sniff("test.xlsx")
IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True)
>>> sniff("test.xlsx", allow_excel=False) is None
True
>>> sniff("test.tsv.gz")
IOFormat(in_delim='\t', out_delim='\t', strict=True)

carabiner.pd.utils.write_stream(df: ~pandas.core.frame.DataFrame, output: ~typing.TextIO | str = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, format: str | None = None, *args, **kwargs) → None[source]

Write a Pandas DataFrame to a file or stdout.

Similar to pd.write_csv() but excludes the index by default and writes to stdout by default with support for truncating output without complaining about broken pipes.

Addtional arguments are passed to pd.write_csv.

Parameters:

df (pd.DataFrame) – Input Pandas DataFrame to write out.
output (str, optional) – Path to output filename. Default: stdout.
format (str) – Format name. Default: infer from filename

Return type:

None

carabiner.pd package

Submodules

carabiner.pd.utils module

Module contents