Python, OneNote & You, Pt. 2 - Parsing OneNote XML to JSON
In the last post, we looked at how access the COM API from Python and retreive OneNote notebooks as XML data. Now, we're going to use some XML parsing libraries to convert this OneNote data into native Pythonic data structures, which we can then serialize to XML.
All the code from this tutorial is available on this GitHub repo: a9538dfb5b99767896d7
Notebooks
import win32com.client
from xml.etree import ElementTree
onapp = win32com.client.gencache.EnsureDispatch('OneNote.Application')
NS = "{http://schemas.microsoft.com/office/onenote/2010/onenote}"def parseAttributes(obj):
tempDict = {}
for key,value in obj.items():
tempDict[key] = value
return tempDict
Now, we call ElementTree to parse the XML and use a for loop to process the attributes of each notebook, which are stored as dictionary items within the notebook.
def getHierarchy():
oneTree = ElementTree.fromstring(onapp.GetHierarchy("",win32com.client.constants.hsPages))
notebooks = []
for notebook in oneTree:
nbk = parseAttributes(notebook)
notebooks.append(nbk)
return notebooks Sections & Section Groups
def getSections(notebook):
sections = []
sectionGroups = []
for section in notebook:
if (section.tag == NS + "SectionGroup"):
newSectionGroup = parseAttributes(section)
if (section.getchildren()):
s, sg = getSections(section)
if (sg != []):
newSectionGroup['sectionGroups'] = sg
if (s != []):
newSectionGroup['sections'] = s
sectionGroups.append(newSectionGroup)
if (section.tag == NS + "Section"):
newSection = parseAttributes(section)
if (section.getchildren()):
newSection['pages'] = getPages(section)
sections.append(newSection)
return sections, sectionGroups
You'll also want to go back and edit the getHierarchy method so that it calls getSections and adds the sections and section groups as children of the notebook. We also have a special case here we might want to handle - the Recycle Bin. The recycle bin is a normal section group, with the isRecycleBin attribute set to true. It's highly likely that you want to exclude the RecycleBin from normal Section operations, so we store it as a separate recycleBin property on the notebook, instead of with the other children.
def getHierarchy():
oneTree = ElementTree.fromstring(onapp.GetHierarchy("",win32com.client.constants.hsPages))
notebooks = []
for notebook in oneTree:
nbk = parseAttributes(notebook)
if (notebook.getchildren()):
s, sg = getSections(notebook)
if (s != []):
nbk['sections'] = s
for i in range(len(sg)):
if ('isRecycleBin' in sg[i]):
nbk['recycleBin'] = sg[i]
sg.pop(i)
if (sg != []):
nbk['sectionGroups'] = sg
notebooks.append(nbk)
return notebooks Pages & Metadata
def getPages(section):
pages =[]
for page in section:
newPage = parseAttributes(page)
if (page.getchildren()):
newPage['meta'] = getMeta(page)
pages.append(newPage)
return pages
def getMeta (page):
metas = []
for meta in page:
metas.append(parseAttributes(meta))
return metas
And we're almost done! Now you have the entire XML hierarchy as a set of Python lists & dictionaries which are much easier to manipulate. If you want to give this a spin, create a new .py file in the same folder as your onepy.py file & add the following code:
import onepy
notebooks = onepy.getHierarchy()
for section in notebooks[0]["sections"]:
for page in section["pages"]:
print(page["name"])This snippet will print the name of every page in every section in the first notebook
Converting Lists to Json
And finally, let's convert this Pythonic data structure into some JSON. This is the easy part - the json library in Python will seralize any native python data structures into JSON, so its just a matter of importing the library and calling the appropriate function.
import json
def getHierarchyJson():
return(json.dumps(getHierarchy(), indent=4))
Now, you're code should look like this
I'm working on creating a OneNote Object Model for Python (much like the C# version) , so that you can just import the module and work with your OneNote data directly as Python objects. That way, you can avoid worrying about serializing data to XML and interfacing with the COM API, and focus on building powerful add-ins. I'll post another tutorial with instructions when its done.
