{ "cells": [ { "cell_type": "markdown", "id": "257f4ff7-0a37-45e7-a701-bab07d2b757f", "metadata": {}, "source": [ "# Speed Up Analysis Code with Parquet Cache\n", "Looping through the XML-like LHE text file format and reconstructing the objects in memory is a slow process. If the in-memory analysis tool you use for studying the LHE files is the awkward library, one can avoid this by caching the awkward-form of the LHE data in a data file format that is much faster to read than the raw LHE file.\n", "\n", "The code below is a small function that will store a parquet cache file alongside any LHE file you wish to read, so any subsequent reads can go through the faster parquet. The parquet cache file will be re-created if anything modifies the original LHE file." ] }, { "cell_type": "code", "execution_count": 1, "id": "e2ae414e-d09a-4792-a60f-b8c4d1a8644e", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import awkward as ak\n", "\n", "import pylhe\n", "\n", "\n", "def _parquet_cache(lhe_fp):\n", " \"\"\"Determine the parquet cache file name by replacing the LHE extension.\"\"\"\n", " return os.path.splitext(os.path.splitext(lhe_fp)[0])[0] + \".parquet\"\n", "\n", "\n", "def _from_pylhe(lhe_fp):\n", " \"\"\"Read an LHE file into an awkward array in memory.\"\"\"\n", " return pylhe.to_awkward(pylhe.read_lhe(lhe_fp))\n", "\n", "\n", "def convert_to_parquet(lhe_fp):\n", " \"\"\"Convert the input LHE file into a parquet file of the same name and location\n", " but with the extension updated.\n", "\n", " Converting the LHE file to a parquet file is beneficial because the resulting\n", " parquet file is about the same size as the gzipped LHE file but it offers about\n", " 2 orders of magnitude speed up when reading the data back into an awkward array\n", " in memory.\n", "\n", " Parameters\n", " ----------\n", " lhe_fp : str\n", " path to LHE file to convert\n", " \"\"\"\n", "\n", " ak.to_parquet(_from_pylhe(lhe_fp), _parquet_cache(lhe_fp))\n", "\n", "\n", "def from_lhe(filepath, *, parquet_cache=True):\n", " \"\"\"Load an awkward array of the events in the passed LHE file\n", "\n", " Parameters\n", " ----------\n", " filepath : str\n", " Path to LHE file to load\n", " parquet_cache : bool, optional\n", " If true, use a parquet file alongside the LHE file to cache the parsing.\n", " This caching makes sure to update the cache if the LHE file timestamp is\n", " newer than the parquet cache timestamp. If false, never use a cache.\n", " \"\"\"\n", "\n", " # need the file to exist\n", " if not os.path.exists(filepath):\n", " msg = f\"Input LHE file {filepath} does not exist.\"\n", " raise FileNotFoundError(msg)\n", "\n", " # leave early without even thinking about cache if user doesn't want it\n", " if not parquet_cache:\n", " return _from_pylhe(filepath)\n", "\n", " # if cache doesn't exist or its last modification time is earlier than\n", " # the last modification time of the original LHE file, we need to create\n", " # the cache file\n", " cache_fp = _parquet_cache(filepath)\n", " if not os.path.exists(cache_fp) or os.path.getmtime(cache_fp) < os.path.getmtime(\n", " filepath\n", " ):\n", " convert_to_parquet(filepath)\n", "\n", " # load the data from the cache\n", " return ak.from_parquet(cache_fp)" ] }, { "cell_type": "markdown", "id": "63c527ef-4bb9-4982-badc-2145ff81d031", "metadata": {}, "source": [ "Just as an example, we can use the scikit-hep test data to show how much faster the parquet reading is." ] }, { "cell_type": "code", "execution_count": 2, "id": "705a9b59-3044-456c-b9b9-3a0e1f5bf711", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 4.44 s, sys: 178 ms, total: 4.62 s\n", "Wall time: 4.62 s\n", "CPU times: user 4.43 s, sys: 133 ms, total: 4.56 s\n", "Wall time: 4.6 s\n", "CPU times: user 11.8 ms, sys: 3.98 ms, total: 15.8 ms\n", "Wall time: 15.3 ms\n" ] } ], "source": [ "from skhep_testdata import data_path\n", "\n", "lhe_file = data_path(\"pylhe-drell-yan-ll-lhe.gz\")\n", "\n", "%time events = _from_pylhe(lhe_file)\n", "# first run needs to generate the cache\n", "# so it will be about as slow as normal LHE reading\n", "%time events = from_lhe(lhe_file)\n", "# later runs will be faster\n", "%time events = from_lhe(lhe_file)" ] }, { "cell_type": "code", "execution_count": null, "id": "f7efdbf7-f40c-4b29-8b00-f455b4a25684", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }