How to extract URLs with Apache OpenOffice, from formatted text and HTML tables
Posted by Jim DeLaHunt on 31 Mar 2014 at 11:46 pm | Tagged as: robobait, software engineering
I use and value a good spreadsheet application the way chefs use and value good knives. I have countless occasions to do ad-hoc data processing and conversion, and I tend to turn to spreadsheets even more often I turn to a good text editor. I know a lot of ways to get the job done with spreadsheets. But recently I learned a new trick. I’m delighted to share it with you here.
The situation: you have an HTML document, with a list of linked text. Imagine a list of projects, each with a link to a project URL (the names aren’t meaningful):
The task is to convert this list of formatted links into a table, with the project name in column A, and the URL in column B. The trick is to use an OpenOffice macro, which exposes the URL (and other facets of formatted text) as OpenOffice functions.
(I described my recent real-life instance of this situation in the blog Open Data Day 2014, and a dataset dataset for Vancouver. The HTML document was the City of Vancouver’s Open Data Catalogue. It was an HTML table, not a list, which I wanted to convert to a spreadsheet — but that difference wasn’t all that material.)
An obvious way to start converting such a list is to select the list, Copy, and Paste into an empty Apache OpenOffice spreadsheet. OK so far. Unfortunately, each list item becomes one cell in the spreadsheet. The text in the cell is formatted, and the link is buried in the text formatting. It looks like this:
| A | B | C | |
| 1 | Example “Aphids” Project | ||
| 2 | Example “Beeswax” Project | ||
| 3 | Example “Clover” Project |
The tool we want is a spreadsheet function =CELL_URL( A1 ) , in column C, which extracts the URL from the cell in column A and exposes it for further manipulation. Column B is the place for the link text, unformatted. The formula =A1 accomplishes that. If we had a CELL_URL() function, our spreadsheet would look like this:
| A | B | C | |
| 1 | Example “Aphids” Project | Example “Aphids” Project | http://example.com/pr/aphids/ |
| 2 | Example “Beeswax” Project | Example “Beeswax” Project | http://example.com/pr/beeswax/ |
| 3 | Example “Clover” Project | Example “Clover” Project | http://example.com/pr/clover/ |
Unfortunately, the standard OpenOffice spreadsheet doesn’t include a function like CELL_URL().
I’m indebted to OpenOffice.org forum user Villeroy for their post [Calc, Basic] Introspective cell functions (Apache OpenOffice forums, Code Snippets forum, Mon Feb 18, 2008 9:53 pm). They, bless their hearts, posted a set of macros which implement pretty much the CELL_URL() function we want. Instructions for how to add a macro to an OpenOffice spreadsheet are out of scope here, and available from the OpenOffice project.
The CELL_URL() function from Villeroy’s code has the arguments: CELL_URL( sheetnum, rownum, columnnum, paranum ). The parameters are absolute coordinates for a particular paragraph of text in a particular cell within the entire spreadsheet. sheetnum is an integer, with 1 for the first sheet in a document, 2 for the second, and so on. rownum is 1 for row 1 of the current worksheet, 2 for row 2, etc. columnnum is 1 for column A, 2 for column B, etc. paranum is an index into the paragraphs (strings separated by newlines) in a cell. We don’t have them in this example, but I did have them in the HTML table of the Vancouver Open Data catalogue.
Absolute coordinates are awkward. In spreadsheets we are used to relative cell references. In C1, a cell reference “A1” in a formula means “the cell in the same row but two columns to the left”. Paste the same formula into cell E4, and the cell reference changes to C4. Fortunately, there are OpenOffice functions SHEET(), ROW(), and COLUMN(). Each takes a cell reference, and returns the absolute coordinate of that cell in the sheet, row, and column dimension. So, I found it convenient to always use these functions with relative or absolute cell references when I called CELL_URL(). The result is an expression which has the relative cell reference behaviour we want. My usage looked like: =CELL_URL(SHEET(A1);ROW(A1);COLUMN(A1);1)
Filling a formula like this into column C of our example spreadsheet gives us cells with string values containing the URLs we seek. That’s the trick. From that point, processing the URLs and link text into some useful form is straightforward. For me, as detailed in the other blog, my product was wikitext for a table.
Here is the macro code posted by OpenOffice.org forum user Villeroy in their post [Calc, Basic] Introspective cell functions (Apache OpenOffice forums, Code Snippets forum, Mon Feb 18, 2008 9:53 pm). I’m keeping a copy here as insurance against the original forum disappearing. Unfortunately, Villeroy didn’t post a formal licence with the code. It’s clear from context that they are comfortable sharing their code.
REM ***** BASIC *****
'from: https://forum.openoffice.org/en/forum/viewtopic.php?f=21&t=2762 'written by: Villeroy, https://forum.openoffice.org/en/forum/memberlist.php?mode=viewprofile&u=107 'Retrieved by Jim DeLaHunt, 2014-02-22, for Vancouver Open Data Day 'Usage: =CELL_URL(SHEET(A1);ROW(A1);COLUMN(A1);B1) 'Note usage of SHEET() to get current sheet number, 'ROW( ) and COL( ) to be able to use relative references. '4th parameter (optional) is 1-based index of text item in cell.
REM ***** BASIC *****
REM ################### RETURNING STRING #################################################
Function CELL_NOTE(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns annotation text
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_NOTE = v.Annotation.getText.getString
else
CELL_NOTE = v
endif
End Function
Function CELL_URL(vSheet,lRowIndex&,iColIndex%,optional n%)
'calls: getSheetCell
REM returns URL of Nth text-hyperlink from a cell, default N=1)
Dim v
If isMissing(n) then n= 1
If n < 1 then
CELL_URL = Null
exit function
endif
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
if v.Textfields.Count >= n then
CELL_URL = v.getTextfields.getByIndex(n -1).URL
else
Cell_URL = Null
endif
else
CELL_URL = v
endif
End Function
Function CELL_FORMULA(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM return unlocalized (English) formula
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_FORMULA = v.getFormula()
else
CELL_FORMULA = v
endif
End Function
Function CELL_STYLE(vSheet,lRowIndex&,iColIndex%,optional bLocalized)
'calls: getSheetCell
REM return name of cell-style, optionally localized
Dim v,s$,bLocal as Boolean
if not isMissing(bLocalized) then bLocal=cBool(bLocalized)
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
if bLocal then
s = thisComponent.StyleFamilies("CellStyles").getByName(v.CellStyle).DisplayName
else
s = v.CellStyle
endif
CELL_STYLE = s
else
CELL_STYLE = v
endif
End Function
Function CELL_LINE(vSheet,lRowIndex&,iColIndex%,optional n)
'calls: getSheetCell
REM Split by line breaks, missing or zero line number returns whole string.
REM =CELL_LINE(SHEET(),1,1,2) -> second line of A1 in this sheet
Dim v,s$,a(),i%
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
s = v.getString
if not isMissing(n) then i = cInt(n)
if i > 0 then
a() = Split(s,chr(10))
If (i <= uBound(a())+1)then
CELL_LINE = a(i -1)
else
CELL_LINE = NULL
endif
else
CELL_LINE = s
endif
else
CELL_LINE = v
endif
end Function
REM ################### RETURNING NUMBER #################################################
Function CELL_ISHORIZONTALPAGEBREAK(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_ISHORIZONTALPAGEBREAK = Abs(cINT(v.Rows.getByIndex(0).IsStartOfNewPage))
else
CELL_ISHORIZONTALPAGEBREAK = v
endif
End Function
Function CELL_ISVERTICALPAGEBREAK(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_ISVERTICALPAGEBREAK = Abs(cINT(v.Columns.getByIndex(0).IsStartOfNewPage))
else
CELL_ISVERTICALPAGEBREAK = v
endif
End Function
Function CELL_CHARCOLOR(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns color code as number
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_CHARCOLOR = v.CharColor
else
CELL_CHARCOLOR = v
endif
End Function
Function CELL_BACKCOLOR(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns color code as number
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_BACKCOLOR = v.CellBackColor
else
CELL_BACKCOLOR = v
endif
End Function
Function CELL_VISIBLE(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns visibility state as number 0|1
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_VISIBLE = Abs(v.Rows.isVisible)
else
CELL_VISIBLE = v
endif
End Function
Function CELL_LOCKED(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns locked state as number 0|1
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_LOCKED = Abs(v.CellProtection.isLocked)
else
CELL_LOCKED = v
endif
End Function
Function CELL_NumberFormat(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM returns the number format index
Dim v
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
CELL_NumberFormat = v.NumberFormat
else
CELL_NumberFormat = v
endif
End Function
Function CELL_NumberFormatType(vSheet,lRowIndex&,iColIndex%)
'calls: getSheetCell
REM return a numeric com.sun.star.util.NumberFormat which describes a format category
Dim v,lNF&
v = getSheetCell(vSheet,lRowIndex&,iColIndex%)
if vartype(v) = 9 then
lNF = v.NumberFormat
CELL_NumberFormatType = ThisComponent.getNumberFormats.getByKey(lNF).Type
else
CELL_NumberFormatType = v
endif
End Function
'################### HELPERS FOR ABOVE CELL FUNCTIONS #########################################
Function getSheet(byVal vSheet)
REM Helper for sheet functions. Get cell from sheet's name or position; cell's row-position; cell's col-position
on error goto exitErr
select case varType(vSheet)
case is = 8
if thisComponent.sheets.hasbyName(vSheet) then
getSheet = thisComponent.sheets.getByName(vSheet)
else
getSheet = NULL
endif
case 2 to 5
vSheet = cInt(vSheet)
'Wow! Calc has sheets with no name at index < 0,
' so NOT isNull(oSheet), if vSheet <= lbound(sheets) = CRASH!
'http://www.openoffice.org/issues/show_bug.cgi?id=58796
if(vSheet <= thisComponent.getSheets.getCount)AND(vSheet > 0) then
getSheet = thisComponent.sheets.getByIndex(vSheet -1)
else
getSheet = NULL
endif
end select
exit function
exitErr:
getSheet = NULL
End Function
Function getSheetCell(byVal vSheet,byVal lRowIndex&,byVal iColIndex%)
dim oSheet
' print vartype(vsheet)
oSheet = getSheet(vSheet)
if varType(oSheet) <>9 then
getSheetCell = NULL
elseif (lRowIndex > oSheet.rows.count)OR(lRowIndex < 1) then
getSheetCell = NULL
elseif (iColIndex > oSheet.columns.count)OR(iColIndex < 1) then
getSheetCell = NULL
else
getSheetCell = oSheet.getCellByPosition(iColIndex -1,lRowIndex -1)
endif
End Function