.. _dev_guide: Developer Guide =============== Repository Layout ----------------- .. code-block:: text ragged/ ├── src/ragged/ │ ├── __init__.py # public API / __all__ │ ├── _spec_array_object.py # array class + helpers │ ├── _spec_creation_functions.py # zeros, ones, arange, … │ ├── _spec_elementwise_functions.py # sqrt, add, sin, … │ ├── _spec_manipulation_functions.py # reshape, roll, stack, … │ ├── _spec_linear_algebra_functions.py # matmul, tensordot, vecdot │ ├── _spec_statistical_functions.py # sum, mean, std, … │ ├── _spec_searching_functions.py # argmax, nonzero, where, … │ ├── _spec_sorting_functions.py # sort, argsort │ ├── _spec_set_functions.py # unique_* │ ├── _spec_indexing_functions.py # take │ ├── _spec_data_type_functions.py # astype, can_cast, … │ ├── _spec_constants.py # e, pi, inf, nan, newaxis │ ├── _typing.py # type aliases (Shape, Dtype, …) │ ├── _import.py # lazy cupy import helper │ ├── _helper_functions.py # shared internal utilities │ └── io/ │ ├── __init__.py │ └── cf.py # CF Conventions I/O ├── docs/ │ ├── conf.py # Sphinx config (MyST + autodoc) │ ├── index.md │ ├── user_guide.rst │ └── dev_guide.rst ├── tests/ │ ├── conftest.py │ ├── test_spec_*.py # spec-driven test suites │ └── test_*.py # feature-specific suites ├── pyproject.toml └── noxfile.py Each ``_spec_*.py`` module corresponds to one section of the `Array API specification `_. Module names mirror the spec URL slugs intentionally so grep-based cross-referencing is easy. The ``array`` Class -------------------- The ``array`` class (lower-case, matching the Array API convention) lives in ``_spec_array_object.py``. Instance attributes ~~~~~~~~~~~~~~~~~~~ Every ``array`` instance carries exactly four private attributes: .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Attribute - Type - Description * - ``_impl`` - ``ak.Array | numpy.ndarray | cupy.ndarray`` - The underlying data buffer. Almost always ``ak.Array``; scalar (0-D) arrays may hold a raw ``numpy.ndarray``. * - ``_shape`` - ``tuple[int | None, ...]`` - Cached shape. Computed once by ``_shape_dtype`` and kept in sync manually after any mutation. * - ``_dtype`` - ``numpy.dtype`` - Cached dtype. Derived from the leaf ``NumpyArray`` inside the layout. * - ``_device`` - ``"cpu" | "cuda"`` - String identifier for the compute backend. These are **not** part of the public API. Read them only inside ``_spec_*.py`` modules; external code should use the ``.shape``, ``.dtype``, ``.device`` properties. ``_new`` class method ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python @classmethod def _new(cls, impl, shape, dtype, device) -> array: A fast constructor that bypasses ``__init__`` validation. Used in hot paths such as ``__iter__`` where shape and dtype are already known. Do not call it from user-facing code. Layout Types and ``_shape_dtype`` ---------------------------------- Internally, ``_impl`` is an ``ak.Array`` whose **layout** is one of a small set of Awkward Array content types: .. list-table:: :header-rows: 1 :widths: 30 70 * - Layout class - Meaning for ``ragged.array`` * - ``NumpyArray`` - 1-D (or packed N-D) contiguous numeric data. ``shape`` has no ``None`` entries. ``ak.to_numpy`` always succeeds. * - ``RegularArray`` - Fixed inner dimension. Produced by ``ak.from_numpy`` on an N-D array. ``shape`` has no ``None`` entries. * - ``ListOffsetArray`` - Variable-length rows (the common case for user-constructed arrays). May be truly ragged (different row lengths → ``shape[i] == None``) *or* incidentally uniform (all rows same length but still ``ListOffsetArray``). ``ak.to_numpy`` succeeds iff all rows are the same length. ``_shape_dtype(layout)`` walks the layout tree once to extract ``shape`` and ``dtype``: .. code-block:: python # simplified pseudocode def _shape_dtype(layout): shape = (len(layout),) node = layout while isinstance(node, ListOffsetArray | RegularArray | ListArray): shape += (node.size if RegularArray else None,) node = node.content # node is now NumpyArray return shape + node.data.shape[1:], node.data.dtype **Key rule**: call ``_shape_dtype`` only when the layout actually changes. After ``ak.values_astype`` (dtype cast only), shape is unchanged — update ``_dtype`` directly without re-traversing the layout. The Box / Unbox Pattern ------------------------ Every function that consumes or produces ``ragged.array`` objects uses two module-level helpers to move between the public type and its ``ak.Array`` implementation: ``_unbox(*inputs)`` Extract ``._impl`` from each input ``array``. Raises ``TypeError`` on mixed array subclasses or device mismatches. .. code-block:: python (impl,) = _unbox(x) left_impl, right_impl = _unbox(a, b) ``_box(cls, output, *, dtype=None, device=None)`` Wrap an ``ak.Array`` result back into a ``ragged.array`` (or subclass). Calls ``_shape_dtype`` to populate ``_shape`` and ``_dtype``. .. code-block:: python return _box(type(x), some_ak_array) Always use ``type(x)`` (not ``array``) as the first argument to ``_box`` so that subclasses round-trip correctly. Writing a New Function ----------------------- The following checklist applies to any new Array API function or extension. 1. **Choose the right module** — pick the ``_spec_*.py`` file whose name matches the spec section the function belongs to. 2. **Signature** — match the Array API signature exactly (keyword-only arguments, ``/`` positional-only markers): .. code-block:: python def my_func(x: array, /, *, axis: int | None = None) -> array: 3. **Unbox inputs**: .. code-block:: python (impl,) = _unbox(x) 4. **Fast path for uniform arrays** — wrap the numpy equivalent in ``contextlib.suppress(TypeError, ValueError)`` and try ``ak.to_numpy``: .. code-block:: python with contextlib.suppress(TypeError, ValueError): np_arr = ak.to_numpy(impl) return _box(type(x), ak.from_numpy(np.my_func(np_arr))) ``ak.to_numpy`` succeeds for ``NumpyArray``, ``RegularArray``, and ``ListOffsetArray`` with uniform row lengths. It raises ``TypeError`` or ``ValueError`` for genuinely ragged arrays. 5. **Ragged / general path** — implement using Awkward Array primitives (``ak.flatten``, ``ak.unflatten``, ``ak.num``, etc.) where possible. Use ``tolist()`` / list-based fallback only as a last resort for complex shapes that have no efficient awkward equivalent. 6. **Box the result**: .. code-block:: python return _box(type(x), result_ak) 7. **Export** — add the function name to ``__init__.py``'s ``__all__`` list and the relevant import block. 8. **Docstring** — include the Array API URL: .. code-block:: python """ Short description. https://data-apis.org/array-api/latest/API_specification/generated/array_api.my_func.html """ 9. **Tests** — add a ``tests/test_.py`` file (see :ref:`testing_conventions`). .. _testing_conventions: Testing Conventions -------------------- Structure ~~~~~~~~~ Tests are grouped by feature in ``tests/test_.py``. Within each file, group related cases into classes: .. code-block:: python class TestMyFunc1D: def test_basic(self): ... def test_dtype_preserved(self): ... class TestMyFunc2DRagged: def test_integer_index(self): ... Helper ~~~~~~ Every test file should define a local factory to avoid repeating ``ragged.array(...)``: .. code-block:: python def _make(data, dtype=None) -> ragged.array: return ragged.array(data, dtype=dtype) Coverage checklist ~~~~~~~~~~~~~~~~~~ For each new function, cover: - 1-D uniform input - 2-D uniform input (created from ``np.ndarray``) - 2-D ragged input (created from Python lists) - dtype preservation (``np.float32`` should stay ``np.float32``) - result type (``isinstance(result, ragged.array)``) - error cases (wrong shape, wrong dtype, unsupported key type, …) - copy / isolation (mutations via ``__setitem__`` or ``.at`` do not affect the original) Running tests ~~~~~~~~~~~~~ .. code-block:: bash pip install -e ".[test]" pytest tests/ With coverage:: pytest tests/ --cov=ragged --cov-report=term-missing The full test matrix (multiple Python / NumPy versions) is run via ``nox``:: nox Performance Patterns --------------------- The following patterns are used consistently throughout the codebase. New code should follow them. Single try/except for fast-path detection ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Determine whether an array is uniform by probing ``ak.to_numpy`` once, in a single ``try/except``. Do not wrap value unwrapping in a separate ``try/except``; instead, branch on the result of the single probe: .. code-block:: python try: arr_np = ak.to_numpy(self._impl) except (TypeError, ValueError): arr_np = None if arr_np is not None: # fast path — unwrap value as numpy val = ak.to_numpy(value._impl) if isinstance(value, array) else value ... else: # slow path — unwrap value as list val = value._impl.tolist() if isinstance(value, array) else value ... Do not use ``isinstance(layout, NumpyArray | RegularArray)`` as the sole fast-path gate — it misses ``ListOffsetArray`` arrays with incidentally uniform rows (common when the user constructs from Python lists). Avoid full ``tolist()`` in ragged paths ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Prefer iterating over ``ak.Array`` sub-blocks and calling ``ak.to_numpy`` per sub-block over calling ``ak.to_list()`` on the whole array. ``ak.to_list`` allocates a Python object for every scalar; sub-block ``ak.to_numpy`` stays in C for uniform chunks. .. code-block:: python # Preferred def _process(a: ak.Array, b: ak.Array) -> Any: try: return np_func(ak.to_numpy(a), ak.to_numpy(b)) except (TypeError, ValueError): pass return [_process( ai if isinstance(ai, ak.Array) else ak.Array(ai), bi if isinstance(bi, ak.Array) else ak.Array(bi), ) for ai, bi in zip(a, b, strict=False)] O(D) layout walks for nested structures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When restoring a nested structure from a flat array (e.g. after ``ak.flatten(axis=None)``), collect counts at each nesting level with a **single top-down walk** rather than calling ``ak.num(impl, axis=depth)`` from the root for each depth: .. code-block:: python # O(D) — peel one level at a time level_counts: list[np.ndarray] = [] cur = impl for _ in range(ndim - 1): level_counts.append(ak.to_numpy(ak.num(cur, axis=1))) cur = ak.flatten(cur, axis=1) result = flat_rolled for counts in reversed(level_counts): result = ak.unflatten(result, counts) Shape is invariant under ``ak.values_astype`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After a dtype cast via ``ak.values_astype``, shape does not change. Update ``_dtype`` directly instead of re-running ``_shape_dtype``: .. code-block:: python self._impl = ak.values_astype(self._impl, new_dtype) self._dtype = new_dtype # shape unchanged — no _shape_dtype call needed Zero-copy dummies for broadcast helpers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When a dummy array is needed only to drive ``ak.broadcast_arrays`` (its values are discarded), use a zero-copy broadcast view instead of allocating a full array: .. code-block:: python dummy = ak.from_numpy(np.broadcast_to(np.zeros((), dtype=np.int8), target_shape)) ``_apply_inplace`` and in-place operators ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``_apply_inplace`` copies ``_impl``, ``_shape``, ``_dtype``, and ``_device`` directly from the already-computed result — it does not call ``_shape_dtype`` again. This is safe because all in-place operators (``__iadd__``, etc.) are elementwise and therefore shape-preserving. Awkward Array Gotchas ---------------------- ``ak.from_numpy`` on N-D arrays ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``ak.from_numpy`` on a 2-D (or higher) NumPy array produces a ``NumpyArray`` layout, **not** a ``ListOffsetArray``. The resulting ``ragged.array`` will have a concrete integer for the inner dimension (e.g. ``shape == (3, 4)``). However, the helper ``_ak_from_numpy`` (defined in ``_spec_manipulation_functions.py``) calls ``ak.from_regular(..., axis=None)`` afterwards to convert every regular dimension to variable-length. Use it when the ragged convention (``shape[-1] == None``) is required: .. code-block:: python from ._spec_manipulation_functions import _ak_from_numpy impl = _ak_from_numpy(np_result) ``ak.flatten(axis=1)`` is O(1) for ``ListOffsetArray`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Peeling one nesting level with ``ak.flatten(impl, axis=1)`` returns the content buffer of the outer ``ListOffsetArray`` — it does not copy data. This makes iterative level-peeling (as in the ``roll`` axis=None path) effectively O(1) per level. ``ak.to_numpy`` on uniform ``ListOffsetArray`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``ak.to_numpy`` succeeds on a ``ListOffsetArray`` whose rows all have the same length — it is not restricted to ``NumpyArray`` or ``RegularArray`` layouts. This is why the fast-path probe uses ``try/except`` rather than a layout ``isinstance`` check.