13 Jun 2011

Python, OneNote & You, Pt. 2 - Parsing OneNote XML to JSON

In the last post, we looked at how access the COM API from Python and retreive OneNote notebooks as XML data. Now, we're going to use some XML parsing libraries to convert this OneNote data into native Pythonic data structures, which we can then serialize to XML. 

All the code from this tutorial is available on this GitHub repo: a9538dfb5b99767896d7 


Notebooks

First thing we'll need to do is to parse the XML element -  we're going to use the ElementTree module because its fast and lightweight, though you could opt to use minidom or sass, if you prefer. Let's start off by importing the libraries we need and initializing the COM interface. Note that I also declared a  namespace variable - just add this for now, it will come in handy later when comparing XML tags.
import win32com.client
from xml.etree import ElementTree

onapp = win32com.client.gencache.EnsureDispatch('OneNote.Application')
NS = "{http://schemas.microsoft.com/office/onenote/2010/onenote}"

To store all this data manipulate it, we're going to be using two of Python's basic data structures - lists & dictionaries. Lists are great for storing arrays of notebooks, sections & pages which are homogenous collections that need to be iterable. Dictionaries on the other hand, are more suited to handling the attributes, since they are essentially hash tables which allow us look up specific keys quickly.

First off, we need a simple method that will parse each ElementTree node into a Python dictionary:
def parseAttributes(obj):
        tempDict = {}
        for key,value in obj.items():
            tempDict[key] = value
        return tempDict

 

Now, we call ElementTree to parse the XML and use a for loop to process the attributes of each notebook, which are stored as dictionary items within the notebook.

def getHierarchy():
    oneTree = ElementTree.fromstring(onapp.GetHierarchy("",win32com.client.constants.hsPages))
    notebooks = []
    for notebook in oneTree:
        nbk = parseAttributes(notebook)          
        notebooks.append(nbk)
    return notebooks 

Sections & Section Groups

This is great, but the OneNote Hierarchy also has sections, section groups and pages - and to make things more complicated Section Groups can contain Sections or other Section Groups, so we will need some recursive helpers to parse the whole tree.

def getSections(notebook):
    sections = []
    sectionGroups = []
    for section in notebook:
        if (section.tag == NS + "SectionGroup"):
            newSectionGroup = parseAttributes(section)
            if (section.getchildren()):
               s, sg = getSections(section)
               if (sg != []):
                  newSectionGroup['sectionGroups'] = sg
               if (s != []):
                  newSectionGroup['sections'] = s
            sectionGroups.append(newSectionGroup)

        if (section.tag == NS + "Section"):
            newSection = parseAttributes(section)
            if (section.getchildren()):
               newSection['pages'] = getPages(section)
            sections.append(newSection)

    return sections, sectionGroups

 

 

You'll also want to go back and edit the getHierarchy method so that it calls getSections and adds the sections and section groups as children of the notebook. We also have a special case here we might want to handle - the Recycle Bin. The recycle bin is a normal section group, with the isRecycleBin attribute set to true. It's highly likely that you want to exclude the RecycleBin from normal Section operations, so we store it as a separate recycleBin property on the notebook, instead of with the other children. 

 

def getHierarchy():
    oneTree = ElementTree.fromstring(onapp.GetHierarchy("",win32com.client.constants.hsPages))
    notebooks = []
    for notebook in oneTree:
        nbk = parseAttributes(notebook)
     
        if (notebook.getchildren()):
           s, sg = getSections(notebook)
           if (s != []):
               nbk['sections'] = s

           for i in range(len(sg)):
               if ('isRecycleBin' in sg[i]):
                  nbk['recycleBin'] = sg[i]
                  sg.pop(i)
           if (sg != []):
               nbk['sectionGroups'] = sg
           
        notebooks.append(nbk)
    return notebooks 

Pages & Metadata

Now, we just need to add some helpers to parse pages & metadata in a similar manner. Like Section Groups & Sections, pages also have multiple levels of hierarchy called sub-pages. Unlike section groups & sections, sub-pages are not stored as children of pages - they have a pageLevel attribute which defines how 'deep' in the hierarchy they are. This makes things a lot simpler, since we don't need to recurse as much.

def getPages(section):
     pages =[]
     for page in section:
         newPage = parseAttributes(page)
         if (page.getchildren()):
             newPage['meta'] = getMeta(page)
         pages.append(newPage)
     return pages



def getMeta (page):
    metas = []
    for meta in page:
        metas.append(parseAttributes(meta))
    return metas

 

And we're almost done! Now you have the entire XML hierarchy as a set of Python lists & dictionaries which are much easier to manipulate. If you want to give this a spin, create a new .py file in the same folder as your onepy.py file & add the following code:

import onepy
notebooks = onepy.getHierarchy()
for section in notebooks[0]["sections"]:
    for page in section["pages"]:
        print(page["name"])

This snippet will print the name of every page in every section in the first notebook

Converting Lists to Json

And finally, let's convert this Pythonic data structure into some JSON. This is the easy part - the json library in Python will seralize any native python data structures into JSON, so its just a matter of importing the library and calling the appropriate function.

import json
def getHierarchyJson():
    return(json.dumps(getHierarchy(), indent=4))

 

 

Now, you're code should look like this

I'm working on creating a OneNote Object Model for Python (much like the C# version) , so that you can just import the module and work with your OneNote data directly as Python objects. That way, you can avoid worrying about serializing data to XML and interfacing with the COM API, and focus on building powerful add-ins. I'll post another tutorial with instructions when its done.