How to extract URLs with Apache OpenOffice, from formatted text and HTML tables
Posted by Jim DeLaHunt on 31 Mar 2014 at 11:46 pm | Tagged as: robobait, software engineering
I use and value a good spreadsheet application the way chefs use and value good knives. I have countless occasions to do ad-hoc data processing and conversion, and I tend to turn to spreadsheets even more often I turn to a good text editor. I know a lot of ways to get the job done with spreadsheets. But recently I learned a new trick. I’m delighted to share it with you here.
The situation: you have an HTML document, with a list of linked text. Imagine a list of projects, each with a link to a project URL (the names aren’t meaningful):
The task is to convert this list of formatted links into a table, with the project name in column A, and the URL in column B. The trick is to use an OpenOffice macro, which exposes the URL (and other facets of formatted text) as OpenOffice functions.
(I described my recent real-life instance of this situation in the blog Open Data Day 2014, and a dataset dataset for Vancouver. The HTML document was the City of Vancouver’s Open Data Catalogue. It was an HTML table, not a list, which I wanted to convert to a spreadsheet — but that difference wasn’t all that material.)
An obvious way to start converting such a list is to select the list, Copy, and Paste into an empty Apache OpenOffice spreadsheet. OK so far. Unfortunately, each list item becomes one cell in the spreadsheet. The text in the cell is formatted, and the link is buried in the text formatting. It looks like this:
A | B | C | |
1 | Example “Aphids” Project | ||
2 | Example “Beeswax” Project | ||
3 | Example “Clover” Project |
The tool we want is a spreadsheet function =CELL_URL( A1 ) , in column C, which extracts the URL from the cell in column A and exposes it for further manipulation. Column B is the place for the link text, unformatted. The formula =A1 accomplishes that. If we had a CELL_URL() function, our spreadsheet would look like this:
A | B | C | |
1 | Example “Aphids” Project | Example “Aphids” Project | http://example.com/pr/aphids/ |
2 | Example “Beeswax” Project | Example “Beeswax” Project | http://example.com/pr/beeswax/ |
3 | Example “Clover” Project | Example “Clover” Project | http://example.com/pr/clover/ |
Unfortunately, the standard OpenOffice spreadsheet doesn’t include a function like CELL_URL().
I’m indebted to OpenOffice.org forum user Villeroy for their post [Calc, Basic] Introspective cell functions (Apache OpenOffice forums, Code Snippets forum, Mon Feb 18, 2008 9:53 pm). They, bless their hearts, posted a set of macros which implement pretty much the CELL_URL() function we want. Instructions for how to add a macro to an OpenOffice spreadsheet are out of scope here, and available from the OpenOffice project.
The CELL_URL() function from Villeroy’s code has the arguments: CELL_URL( sheetnum, rownum, columnnum, paranum ). The parameters are absolute coordinates for a particular paragraph of text in a particular cell within the entire spreadsheet. sheetnum is an integer, with 1 for the first sheet in a document, 2 for the second, and so on. rownum is 1 for row 1 of the current worksheet, 2 for row 2, etc. columnnum is 1 for column A, 2 for column B, etc. paranum is an index into the paragraphs (strings separated by newlines) in a cell. We don’t have them in this example, but I did have them in the HTML table of the Vancouver Open Data catalogue.
Absolute coordinates are awkward. In spreadsheets we are used to relative cell references. In C1, a cell reference “A1” in a formula means “the cell in the same row but two columns to the left”. Paste the same formula into cell E4, and the cell reference changes to C4. Fortunately, there are OpenOffice functions SHEET(), ROW(), and COLUMN(). Each takes a cell reference, and returns the absolute coordinate of that cell in the sheet, row, and column dimension. So, I found it convenient to always use these functions with relative or absolute cell references when I called CELL_URL(). The result is an expression which has the relative cell reference behaviour we want. My usage looked like: =CELL_URL(SHEET(A1);ROW(A1);COLUMN(A1);1)
Filling a formula like this into column C of our example spreadsheet gives us cells with string values containing the URLs we seek. That’s the trick. From that point, processing the URLs and link text into some useful form is straightforward. For me, as detailed in the other blog, my product was wikitext for a table.
Here is the macro code posted by OpenOffice.org forum user Villeroy in their post [Calc, Basic] Introspective cell functions (Apache OpenOffice forums, Code Snippets forum, Mon Feb 18, 2008 9:53 pm). I’m keeping a copy here as insurance against the original forum disappearing. Unfortunately, Villeroy didn’t post a formal licence with the code. It’s clear from context that they are comfortable sharing their code.
REM ***** BASIC *****
'from: https://forum.openoffice.org/en/forum/viewtopic.php?f=21&t=2762 'written by: Villeroy, https://forum.openoffice.org/en/forum/memberlist.php?mode=viewprofile&u=107 'Retrieved by Jim DeLaHunt, 2014-02-22, for Vancouver Open Data Day 'Usage: =CELL_URL(SHEET(A1);ROW(A1);COLUMN(A1);B1) 'Note usage of SHEET() to get current sheet number, 'ROW( ) and COL( ) to be able to use relative references. '4th parameter (optional) is 1-based index of text item in cell.
REM ***** BASIC ***** REM ################### RETURNING STRING ################################################# Function CELL_NOTE(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns annotation text Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_NOTE = v.Annotation.getText.getString else CELL_NOTE = v endif End Function Function CELL_URL(vSheet,lRowIndex&,iColIndex%,optional n%) 'calls: getSheetCell REM returns URL of Nth text-hyperlink from a cell, default N=1) Dim v If isMissing(n) then n= 1 If n < 1 then CELL_URL = Null exit function endif v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then if v.Textfields.Count >= n then CELL_URL = v.getTextfields.getByIndex(n -1).URL else Cell_URL = Null endif else CELL_URL = v endif End Function Function CELL_FORMULA(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM return unlocalized (English) formula Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_FORMULA = v.getFormula() else CELL_FORMULA = v endif End Function Function CELL_STYLE(vSheet,lRowIndex&,iColIndex%,optional bLocalized) 'calls: getSheetCell REM return name of cell-style, optionally localized Dim v,s$,bLocal as Boolean if not isMissing(bLocalized) then bLocal=cBool(bLocalized) v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then if bLocal then s = thisComponent.StyleFamilies("CellStyles").getByName(v.CellStyle).DisplayName else s = v.CellStyle endif CELL_STYLE = s else CELL_STYLE = v endif End Function Function CELL_LINE(vSheet,lRowIndex&,iColIndex%,optional n) 'calls: getSheetCell REM Split by line breaks, missing or zero line number returns whole string. REM =CELL_LINE(SHEET(),1,1,2) -> second line of A1 in this sheet Dim v,s$,a(),i% v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then s = v.getString if not isMissing(n) then i = cInt(n) if i > 0 then a() = Split(s,chr(10)) If (i <= uBound(a())+1)then CELL_LINE = a(i -1) else CELL_LINE = NULL endif else CELL_LINE = s endif else CELL_LINE = v endif end Function
REM ################### RETURNING NUMBER ################################################# Function CELL_ISHORIZONTALPAGEBREAK(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_ISHORIZONTALPAGEBREAK = Abs(cINT(v.Rows.getByIndex(0).IsStartOfNewPage)) else CELL_ISHORIZONTALPAGEBREAK = v endif End Function Function CELL_ISVERTICALPAGEBREAK(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_ISVERTICALPAGEBREAK = Abs(cINT(v.Columns.getByIndex(0).IsStartOfNewPage)) else CELL_ISVERTICALPAGEBREAK = v endif End Function Function CELL_CHARCOLOR(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns color code as number Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_CHARCOLOR = v.CharColor else CELL_CHARCOLOR = v endif End Function Function CELL_BACKCOLOR(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns color code as number Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_BACKCOLOR = v.CellBackColor else CELL_BACKCOLOR = v endif End Function Function CELL_VISIBLE(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns visibility state as number 0|1 Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_VISIBLE = Abs(v.Rows.isVisible) else CELL_VISIBLE = v endif End Function Function CELL_LOCKED(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns locked state as number 0|1 Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_LOCKED = Abs(v.CellProtection.isLocked) else CELL_LOCKED = v endif End Function Function CELL_NumberFormat(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM returns the number format index Dim v v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then CELL_NumberFormat = v.NumberFormat else CELL_NumberFormat = v endif End Function Function CELL_NumberFormatType(vSheet,lRowIndex&,iColIndex%) 'calls: getSheetCell REM return a numeric com.sun.star.util.NumberFormat which describes a format category Dim v,lNF& v = getSheetCell(vSheet,lRowIndex&,iColIndex%) if vartype(v) = 9 then lNF = v.NumberFormat CELL_NumberFormatType = ThisComponent.getNumberFormats.getByKey(lNF).Type else CELL_NumberFormatType = v endif End Function
'################### HELPERS FOR ABOVE CELL FUNCTIONS ######################################### Function getSheet(byVal vSheet) REM Helper for sheet functions. Get cell from sheet's name or position; cell's row-position; cell's col-position on error goto exitErr select case varType(vSheet) case is = 8 if thisComponent.sheets.hasbyName(vSheet) then getSheet = thisComponent.sheets.getByName(vSheet) else getSheet = NULL endif case 2 to 5 vSheet = cInt(vSheet) 'Wow! Calc has sheets with no name at index < 0, ' so NOT isNull(oSheet), if vSheet <= lbound(sheets) = CRASH! 'http://www.openoffice.org/issues/show_bug.cgi?id=58796 if(vSheet <= thisComponent.getSheets.getCount)AND(vSheet > 0) then getSheet = thisComponent.sheets.getByIndex(vSheet -1) else getSheet = NULL endif end select exit function exitErr: getSheet = NULL End Function
Function getSheetCell(byVal vSheet,byVal lRowIndex&,byVal iColIndex%) dim oSheet ' print vartype(vsheet) oSheet = getSheet(vSheet) if varType(oSheet) <>9 then getSheetCell = NULL elseif (lRowIndex > oSheet.rows.count)OR(lRowIndex < 1) then getSheetCell = NULL elseif (iColIndex > oSheet.columns.count)OR(iColIndex < 1) then getSheetCell = NULL else getSheetCell = oSheet.getCellByPosition(iColIndex -1,lRowIndex -1) endif End Function