carabiner.pd package
Submodules
carabiner.pd.utils module
Utilities for Pandas.
- class carabiner.pd.utils.IOFormat(in_delim: str, out_delim: str | None = None, strict: bool = True)[source]
Bases:
objectStores delimiters for reading and writing.
- Parameters:
in_delim (str) – Delimiter when reading.
out_delim (str, optional) – Delimiter when writing.
- in_delim: str
- out_delim: str | None = None
- strict: bool = True
- carabiner.pd.utils.format2delim(format: str, default: str | None = None, allow_excel: bool = True, strict: bool = True) IOFormat | None[source]
Return a delimiter from its format name or extension.
- Parameters:
format (str) – Format name.
default (str, optional) – Default delimiter to return if delimiter not supported.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.
strict (bool, optional) – Whether to allow whitespace delimiter in TSV. Default: False.
- Returns:
Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None
- Return type:
IOFormat or None
Examples
>>> format2delim(".csv") IOFormat(in_delim=',', out_delim=',', strict=True) >>> format2delim("tsv") IOFormat(in_delim='\t', out_delim='\t', strict=True) >>> format2delim("tsv", strict=False) IOFormat(in_delim='\\s+', out_delim='\t', strict=False) >>> format2delim(".xlsx") IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True) >>> format2delim(".cool", default=".") IOFormat(in_delim='.', out_delim='.', strict=True) >>> format2delim(".cool") is None True
- carabiner.pd.utils.get_formats(allow_excel: bool = True) Tuple[str][source]
List the supported table formats.
- Parameters:
allow_excel (bool, optional) – Whether to include XLSX formats. Default: True.
- Returns:
The supported table formats.
- Return type:
tuple
Examples
>>> get_formats() ('.txt', '.tsv', '.csv', '.xlsx', 'txt', 'tsv', 'csv', 'xlsx')
- carabiner.pd.utils.read_csv(filename: str | TextIO, rows: int | Sequence[int] | None = None, progress: bool = True, *args, **kwargs) DataFrame[source]
Read a delimited file, optionally GZIPped, optionally only specific rows.
Provides a progress bar by default. Addtional arguments are passed to pd.read_csv.
- Parameters:
filename (str) – Path of file to read. Optionally GZIP compressed.
rows (list of int, optional) – Rows to read. If None (default), read all rows.
progress (bool) – Whether to display a progress bar. Default: True.
- Returns:
Pandas DataFrame of the input file.
- Return type:
pd.DataFrame
- carabiner.pd.utils.read_table(file: str | IO, format: str | None = None, progress: bool = False, sheet_name: str | int | list | None = None, chunksize: int | None = None, *args, **kwargs) DataFrame | Iterator[DataFrame][source]
Universal reader of tabular data files.
Addtional arguments are passed to read_csv or pd.read_excel. If chunksize is provided, returns a iterator of chunks which is lazy for non-Excel files but greedy for Excel files. Otherwise returns a DataFrame.
- Parameters:
filename (str) – Path of file to read in CSV, TSV or Excel format. Optionally GZIP compressed.
format (str) – Format name. Default: infer from filename
sheet_name (str, int or list, optional) – If reading an XLSX file, which sheets to read. Default: read all sheets.
- Returns:
Pandas DataFrame of the input file.
- Return type:
pd.DataFrame or iterator of pd.DataFrame
- carabiner.pd.utils.resolve_delim(file: str | IO, format: str | None = None, default: str | None = None, allow_excel: bool = True) IOFormat | None[source]
Identify the delimiter of a file.
Uses the file extension, unless an explicit format is provided.
- Parameters:
file (str or file-like) – File whose delimiter should be identified.
format (str, optional) – Override the file extension to return a format.
default (str, optional) – Provide this default if the extension cannot be identified, otherwise return None.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.
- Returns:
Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None
- Return type:
IOFormat or None
Examples
>>> resolve_delim("test.tsv", format="csv") IOFormat(in_delim=',', out_delim=',', strict=True) >>> resolve_delim("test.tsv", format="tsv") IOFormat(in_delim='\t', out_delim='\t', strict=True) >>> resolve_delim("test.cool") is None True >>> resolve_delim("test.cool", default="\t") IOFormat(in_delim='\t', out_delim='\t', strict=True)
- carabiner.pd.utils.sniff(file: str | IO, default: str | None = None, allow_excel: bool = True, strict: bool = True) IOFormat | None[source]
Identify the delimiter of a file from its extension.
- Parameters:
file (str or file-like) – Input path to file or a file-like object.
default (str, optional) – Default delimiter to return if delimiter not supported.
allow_excel (bool, optional) – Whether to return ‘xlsx’ for Excel files. If False, returns default or None. Default: True.
strict (bool, optional) – Whether to allow whitespace delimiter in TSV. Default: False.
- Returns:
Delimiter for TSV or CSV, or “xlsx” if Excel. If not supported and no default, returns None
- Return type:
IOFormat or None
Examples
>>> sniff("test.csv") IOFormat(in_delim=',', out_delim=',', strict=True) >>> sniff("test.tsv") IOFormat(in_delim='\t', out_delim='\t', strict=True) >>> sniff("test.tsv", strict=False) IOFormat(in_delim='\\s+', out_delim='\t', strict=False) >>> sniff("test.xlsx") IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True) >>> sniff("test.cool", default=".") IOFormat(in_delim='.', out_delim='.', strict=True) >>> sniff("test.cool") is None True >>> sniff("test.xlsx") IOFormat(in_delim='xlsx', out_delim='xlsx', strict=True) >>> sniff("test.xlsx", allow_excel=False) is None True >>> sniff("test.tsv.gz") IOFormat(in_delim='\t', out_delim='\t', strict=True)
- carabiner.pd.utils.write_stream(df: ~pandas.core.frame.DataFrame, output: ~typing.TextIO | str = <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>, format: str | None = None, *args, **kwargs) None[source]
Write a Pandas DataFrame to a file or stdout.
Similar to pd.write_csv() but excludes the index by default and writes to stdout by default with support for truncating output without complaining about broken pipes.
Addtional arguments are passed to pd.write_csv.
- Parameters:
df (pd.DataFrame) – Input Pandas DataFrame to write out.
output (str, optional) – Path to output filename. Default: stdout.
format (str) – Format name. Default: infer from filename
- Return type:
None