Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GIP] dropping ogc-server-statistics and analytics #5

Open
fvanderbiest opened this issue Mar 23, 2023 · 17 comments
Open

[GIP] dropping ogc-server-statistics and analytics #5

fvanderbiest opened this issue Mar 23, 2023 · 17 comments
Labels
GIP Pending Waiting for review

Comments

@fvanderbiest
Copy link
Member

fvanderbiest commented Mar 23, 2023

Who ?

Camptocamp, with funding from MEL

Target Module

The ogc-server-statistics logger will be removed.
Which implies also a removal of the analytics webapp and a rework of the front console application.

What ?

As said above, we plan to remove the OGC logging feature from geOrchestra core.

Why ?

We're preparing the replacement of the older security-proxy by the georchestra-gateway.

  1. the georchestra-gateway is not able to host a log4j1 logger like ogc-server-statistics
  2. storing ogc access logs in a postgresql database was not a good idea, since it quickly became "big data"

How ?

Essentially git rm -rf analytics and git rm -rf ogc-server-statistics.

Any potential pitfalls and ways to circumvent them ?

There's no plan yet to provide an equivalent feature by <<insert here any fancy tech like ELK or ... >>
Maybe we should ?

When ?

One should expect geOrchestra 24.0 to be free from ogc-server-statistics, which means funding will be required to get the equivalent feature by then.

State of the vote:

PSC members vote
Fabrice Phung
François Van Der Biest
Pierre Mauduit
Landry Breuil
Stéphane Mével-Viannay
Maël Reboux
Pierre Jégo
Jean Pommier
Catherine Piton-Morales
@fvanderbiest fvanderbiest added GIP Pending Waiting for review labels Mar 23, 2023
@jeanpommier
Copy link
Member

I agree on removing it, but we should consider a replacement solution. Globally, I have the feeling we should ease or at least document the integration of analytics tools, since it is a quite common need

Maybe this could be an interesting workshop during the geocom/following codesprint

@fvanderbiest
Copy link
Member Author

Globally, I have the feeling we should ease or at least document the integration of analytics tools, since it is a quite common need

agreed

@jusabatier
Copy link

storing ogc access logs in a postgresql database was not a good idea, since it quickly became "big data"

Concerning this point, what about use : https://github.com/timescale/timescaledb ?
It allow to compress and manage data retention if well configured.

From what I could see ES is a sinkhole for resources and not easy to use.

Using timescaledb database with well configured data retention combined with a solution like Grafana and their dashboard can be an alternative to the current analytics tools.

@fvanderbiest
Copy link
Member Author

Concerning this point, what about use : https://github.com/timescale/timescaledb ?

Love it !
Thanks for the hint.

@jeanpommier
Copy link
Member

Concerning this point, what about use : https://github.com/timescale/timescaledb ?

@jusabatier Can you remind us how you use it / feed the logs into it ?

@landryb
Copy link
Member

landryb commented Mar 24, 2023

fwiw i use influxdb for similar needs, but they're on the same level with timescaledb. For log "ingestion" promtail & https://github.com/grafana/loki is used to send metrics to influxdb,but you can also use telegraf or fluentd for the logs.

https://linuxfr.org/news/loki-centralisation-de-logs-a-la-sauce-prometheus

@jusabatier
Copy link

Can you remind us how you use it / feed the logs into it ?

Here is some example config to feed database via log4j2 using JDBC appenders (commented) : https://github.com/georchestra/cadastrapp/blob/master/cadastrapp/src/main/resources/log4j2.properties

It's feed same way as postgresql as it's an extension.

And you can find how to configure retention in the timescaledb docs : https://docs.timescale.com/timescaledb/latest/how-to-guides/data-retention/

@pierrejego
Copy link
Member

Any potential pitfalls and ways to circumvent them ?
There's no plan yet to provide an equivalent feature by <<insert here any fancy tech like ELK or ... >>
Maybe we should ?

as @jusabatier @landryb and @jeanpommier
I think ogcstatistics should not be fully removed until some replacement is found.
For example, we could remove it from the console, but let the possibility to install ogcstatistic apps when using security proxy.

Another idea, without changing architecture or add more framework and since elasticsearch and kibana are already installed, we could probably test some simple logs insertion via logstash and create a route to get Kibana accessible for admin user. With a specific dashboard.

Why ?
We're preparing the replacement of the older security-proxy by the georchestra-gateway.

This point will probably need more explantation, I know we already spoke about it, but a specific point should be done on this important point when it will occur. Have you an idea when you'd like to replace security-proxy by georchestra-gateway ?

@pierrejego
Copy link
Member

From what I could see ES is a sinkhole for resources

True, but already installed for Geonetwork4 so could be shared

@landryb
Copy link
Member

landryb commented Mar 28, 2023

From what I could see ES is a sinkhole for resources

True, but already installed for Geonetwork4 so could be shared

that's more or less discouraged, as kibana is configured for GN indexes only, and there's some hairy url rewriting being done too...

@jeanpommier
Copy link
Member

And the default usage made by GN4 (index metadata only) is quite moderate, which allows a "relatively light" ES setup.

Logs are known to become quickly massive data, specially if we want some retention time, which we will need for analytics.

I'd rather have some experiments first with lighter tools like loki. How far did you go with Loki, @landryb ?

@landryb
Copy link
Member

landryb commented Mar 28, 2023

How far did you go with Loki, @landryb ?

i have a promtail/loki/grafana dashboard with nginx metrics for mapserver/mapproxy logs. This was a poc done by students in 2021. It's been running in production for 2 years.

i've never got around fully digging more into it to expand it for other needs and fine-tune it more, but the logic is sound. and its lightweight.

$ps aux| egrep '(loki|grafana|promtail)'
loki          91  0.1  0.8 965396 74720 ?        Ssl   2022 492:09 /opt/loki/loki-linux-amd64 -config.file /opt/loki/config.yml
promtail      94  0.2  0.3 1508404 27440 ?       Ssl   2022 880:21 /opt/promtail/promtail-linux-amd64 -config.file /opt/promtail/config.yml
grafana   385891  0.1  1.0 1937616 88932 ?       Ssl  Mar23   8:15 /usr/share/grafana/bin/grafana server --config=/etc/grafana/grafana.ini --pidfile=/run/grafana/grafana-server.pid --packaging=deb cfg:default.paths.logs=/var/log/grafana cfg:default.paths.data=/var/lib/grafana cfg:default.paths.plugins=/var/lib/grafana/plugins cfg:default.paths.provisioning=/etc/grafana/provisioning

the loki datadir takes 14Gb with only those nginx metrics from 2 years.

i've other truenas/proxmox dashboards in grafana but those are not related to logs parsing.

@fvanderbiest
Copy link
Member Author

OK, so, to sum up, we have two technical solutions here:

  • one based on postgresql/timescaledb/grafana
  • one based on promtail/loki/grafana

Do we do POCs or is there one emerging from a technical / strategic POV ?

Naturally, I would favor timescaledb since it requires less additional stacks to the already existing components, and offers the potential to rewrite the analytics backend if we want to provide key metrics in the console or anywhere else.

@bchartier
Copy link

I apologize in advance for inserting noise in this discussion:

  • any update from the discussions on this topic that occured yesterday during geOcom 2023?
  • any thought about InfluxDb for this specific use case?

@landryb
Copy link
Member

landryb commented Jun 2, 2023

I apologize in advance for inserting noise in this discussion:

* any update from the discussions on this topic that occured yesterday during geOcom 2023?

there were some tests during the community sprint with loki and ES as alternatives, @jeanpommier can give more details

* any thought about InfluxDb for this specific use case?

iirc influxdb is more used for metrics coming from telegraf, loki stores his data in his own database

@fvanderbiest
Copy link
Member Author

Florent Berault from MEL says that his needs for analytics is more than just OGC WMS/WFS/etc.
Ideally the Data API should also be part of it.
Since it is "OGC API Features"-based, it makes a lot of sense to me.

@jeanpommier
Copy link
Member

Hi @fvanderbiest
Could you be more specific about what it would imply ? What would be expected ?

@MaelREBOUX MaelREBOUX added In review This proposal is currently reviewed and removed In review This proposal is currently reviewed labels Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GIP Pending Waiting for review
Projects
None yet
Development

No branches or pull requests

7 participants