{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "257f4ff7-0a37-45e7-a701-bab07d2b757f",
   "metadata": {},
   "source": [
    "# Speed Up Analysis Code with Parquet Cache\n",
    "Looping through the XML-like LHE text file format and reconstructing the objects in memory is a slow process. If the in-memory analysis tool you use for studying the LHE files is the awkward library, one can avoid this by caching the awkward-form of the LHE data in a data file format that is much faster to read than the raw LHE file.\n",
    "\n",
    "The code below is a small function that will store a parquet cache file alongside any LHE file you wish to read, so any subsequent reads can go through the faster parquet. The parquet cache file will be re-created if anything modifies the original LHE file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e2ae414e-d09a-4792-a60f-b8c4d1a8644e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import awkward as ak\n",
    "\n",
    "import pylhe\n",
    "\n",
    "\n",
    "def _parquet_cache(lhe_fp):\n",
    "    \"\"\"Determine the parquet cache file name by replacing the LHE extension.\"\"\"\n",
    "    return os.path.splitext(os.path.splitext(lhe_fp)[0])[0] + \".parquet\"\n",
    "\n",
    "\n",
    "def _from_pylhe(lhe_fp):\n",
    "    \"\"\"Read an LHE file into an awkward array in memory.\"\"\"\n",
    "    return pylhe.to_awkward(pylhe.read_lhe(lhe_fp))\n",
    "\n",
    "\n",
    "def convert_to_parquet(lhe_fp):\n",
    "    \"\"\"Convert the input LHE file into a parquet file of the same name and location\n",
    "    but with the extension updated.\n",
    "\n",
    "    Converting the LHE file to a parquet file is beneficial because the resulting\n",
    "    parquet file is about the same size as the gzipped LHE file but it offers about\n",
    "    2 orders of magnitude speed up when reading the data back into an awkward array\n",
    "    in memory.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    lhe_fp : str\n",
    "        path to LHE file to convert\n",
    "    \"\"\"\n",
    "\n",
    "    ak.to_parquet(_from_pylhe(lhe_fp), _parquet_cache(lhe_fp))\n",
    "\n",
    "\n",
    "def from_lhe(filepath, *, parquet_cache=True):\n",
    "    \"\"\"Load an awkward array of the events in the passed LHE file\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    filepath : str\n",
    "        Path to LHE file to load\n",
    "    parquet_cache : bool, optional\n",
    "        If true, use a parquet file alongside the LHE file to cache the parsing.\n",
    "        This caching makes sure to update the cache if the LHE file timestamp is\n",
    "        newer than the parquet cache timestamp. If false, never use a cache.\n",
    "    \"\"\"\n",
    "\n",
    "    # need the file to exist\n",
    "    if not os.path.exists(filepath):\n",
    "        msg = f\"Input LHE file {filepath} does not exist.\"\n",
    "        raise FileNotFoundError(msg)\n",
    "\n",
    "    # leave early without even thinking about cache if user doesn't want it\n",
    "    if not parquet_cache:\n",
    "        return _from_pylhe(filepath)\n",
    "\n",
    "    # if cache doesn't exist or its last modification time is earlier than\n",
    "    # the last modification time of the original LHE file, we need to create\n",
    "    # the cache file\n",
    "    cache_fp = _parquet_cache(filepath)\n",
    "    if not os.path.exists(cache_fp) or os.path.getmtime(cache_fp) < os.path.getmtime(\n",
    "        filepath\n",
    "    ):\n",
    "        convert_to_parquet(filepath)\n",
    "\n",
    "    # load the data from the cache\n",
    "    return ak.from_parquet(cache_fp)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "63c527ef-4bb9-4982-badc-2145ff81d031",
   "metadata": {},
   "source": [
    "Just as an example, we can use the scikit-hep test data to show how much faster the parquet reading is."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "705a9b59-3044-456c-b9b9-3a0e1f5bf711",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4.44 s, sys: 178 ms, total: 4.62 s\n",
      "Wall time: 4.62 s\n",
      "CPU times: user 4.43 s, sys: 133 ms, total: 4.56 s\n",
      "Wall time: 4.6 s\n",
      "CPU times: user 11.8 ms, sys: 3.98 ms, total: 15.8 ms\n",
      "Wall time: 15.3 ms\n"
     ]
    }
   ],
   "source": [
    "from skhep_testdata import data_path\n",
    "\n",
    "lhe_file = data_path(\"pylhe-drell-yan-ll-lhe.gz\")\n",
    "\n",
    "%time events = _from_pylhe(lhe_file)\n",
    "# first run needs to generate the cache\n",
    "# so it will be about as slow as normal LHE reading\n",
    "%time events = from_lhe(lhe_file)\n",
    "# later runs will be faster\n",
    "%time events = from_lhe(lhe_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7efdbf7-f40c-4b29-8b00-f455b4a25684",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}