Uploaded image for project: 'ROOT'
  1. ROOT
  2. ROOT-10076

New TTree.AsNumpy, supporting both primitive and array-of-primitive branches

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Open (View Workflow)
    • Priority: High
    • Resolution: Unresolved
    • Affects Version/s: 6.18/00
    • Fix Version/s: None
    • Component/s: I/O, PyROOT
    • Labels:
      None
    • Environment:

      Python with Numpy

      Description

      In keeping with a convention set by the `RDataFrame.AsNumpy` Pythonization, `TTree.AsNumpy` would return a dict of arrays, rather than the contiguous 2-dimensional array returned by `TTree.AsMatrix`.

      For branches of primitive types, this only means that the arrays need not be contiguous in memory or converted to a common type (requesting an integer branch and a floating point branch doesn't force the integers to floating point type).

      However, it also enables branches of arrays of primitive types, including `std::vectors` of primitive types. Such arrays must be returned as more complex objects—jagged arrays—which consist of content and offsets of different lengths.

      The interpretation of basket data as content and offsets requires checking the `fLast` byte position, subtracting `TKey` bytes from the offsets, and dividing by content (primitive type) width (which is always a multiple of 2, so "divide" should be "bit shift").

      Arrays of primitives within C++ classes defined by a streamer should remove 1 byte from the beginning of each entry. `std::vectors` of primitives should remove 10 bytes from the beginning of each entry (6 byte header + 4 byte size, which isn't necessary because we know the entry offsets). This non-uniform array compaction should be performed in C++. The sizes of input buffers and output buffers are known in advance (unless strings are also considered in scope, for which a strict upper bound on the size of the output buffer is known in advance).

      The most open-ended question about ROOT producing jagged arrays is, "What library should be used to represent jagged arrays in Python?" Numpy has no concept of jagged arrays. Several libraries provide jagged (a.k.a. "ragged") arrays in different contexts:

      • zarr (https://zarr.readthedocs.io/en/stable/tutorial.html#ragged-arrays) has a ragged array object, but its focus is storage and we already have a storage format.
      • jagged (https://github.com/sdvillal/jagged) is focused purely on storing jagged arrays in various formats, but see above.
      • XND (https://xnd.io/) is a next-generation Numpy (by Numpy's author, Travis Oliphant) which focuses, among other things, on ragged arrays. It includes a math library similar to Numpy's universal functions ("ufuncs"), but none of the structure manipulations necessary for particle physics (when I last interacted with the group in July 2018, and no visible progress online).
      • Arrow (https://arrow.apache.org/docs/python/) is a widely-used memory format that includes jagged arrays, and it has well-supported C++ and Python bindings. The development team has large overlaps with Parquet-C++ and Pandas (led by Pandas's author, Wes McKinney). However, Arrow's focus is just representing data in memory—data manipulation is intended for Pandas, which doesn't currently handle jagged arrays well (converts them all into Python lists).
      • Awkward-array (https://github.com/scikit-hep/awkward-array) is used by uproot to solve the same problem: representing jagged arrays from ROOT files and providing structure-manipulation tools for physicists. Numpy's ufuncs are supported, as are methods for combinatorics. Choosing Awkward would allow `TTree.AsNumpy` to be a drop-in replacement for uproot's `TTree.arrays`.

      Only the last two, Arrow and Awkward, seem to be good choices. Arrow has compile-time dependencies, but they might already be included in ROOT for the sake of RDataFrame's Arrow RDataSource. Producing an Arrow buffer does not require pyarrow as a strict dependency, since it's just a binary data specification, though a user would need to have pyarrow to make use of it in Python. Awkward has no dependencies other than Numpy >= 0.13.1.

      Presenting jagged data as Arrow buffers would not preclude users from manipulating them with Awkward: Awkward can view Arrow buffers without copying them. As an alternative to this proposal, returning all arrays—jagged or not—as Arrow buffers would be a more uniform option and can avoid the misuse of the word `AsNumpy` to describe the result: it can be `AsArrow`. Then the output need not deal with Python dicts, as an Arrow Struct of List types can present the same data in a more unified way. (This gives up on reusing the convention set by RDataFrame, but perhaps RDataFrame should include an `AsArrow` method, rather than both having `AsNumpy`.)

      A choice of output format and name (`AsNumpy` vs `AsArrow`) should be made before work begins on implementation.

        Attachments

          Activity

            People

            • Assignee:
              pcanal Philippe Canal
              Reporter:
              pivarski Jim Pivarski
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                PlannedEnd:
                PlannedStart:

                Time Tracking

                Estimated:
                Original Estimate - 2 weeks
                2w
                Remaining:
                Remaining Estimate - 2 weeks
                2w
                Logged:
                Time Spent - Not Specified
                Not Specified